Search engines
Analysis of search technology, products, services, and vendors. Related subjects include:
The future of search
I believe there are two ways search will improve significantly in the future. First, since talking is easier than typing, speech recognition will allow longer and more accurate input strings. Second, search will be informed by much more persistent user information, with search companies having very detailed understanding of searchers. Based on that, I expect:
- A small oligopoly dominating the conjoined businesses of mobile device software and search. The companies most obviously positioned for membership are Google and Apple.
- The continued and growing combination of search, advertisement/recommendation, and alerting. The same user-specific data will be needed for all three.
- A whole lot of privacy concerns.
My reasoning starts from several observations:
- Enterprise search is greatly disappointing. My main reason for saying that is anecdotal evidence — I don’t notice users being much happier with search than they were 15 years ago. But business results are suggestive too:
- HP just disclosed serious problems with Autonomy.
- Microsoft’s acquisition of FAST was a similar debacle.
- Lesser enterprise search outfits never prospered much. (E.g., when’s the last time you heard mention of Coveo?)
- My favorable impressions of the e-commerce site search business turned out to be overdone. (E.g., Mercado’s assets were sold for a pittance soon after I wrote that, while Endeca and Inquira were absorbed into Oracle.)
- Lucene/Solr’s recent stirrings aren’t really in the area of search.
- Web search, while superior to the enterprise kind, is disappointing people as well. Are Google’s results any better than they were 8 years ago? Google’s ongoing hard work notwithstanding, are they even as good?
- Consumer computer usage is swinging toward mobile devices. I hope I don’t have to convince you about that one. 🙂
In principle, there are two main ways to make search better:
- Understand more about the documents being searched over. But Google’s travails, combined with the rather dismal history of enterprise search, suggest we’re well into the diminishing-returns part of that project.
- Understand more about what the searcher wants.
The latter, I think, is where significant future improvement will be found.
Categories: Autonomy, Coveo, Endeca, Enterprise search, FAST, Google, Lucene, Mercado, Microsoft, Search engines, Speech recognition, Structured search | 4 Comments |
Google’s version of an old joke
Search Google for “recursion” and it helpfully offers a link to let you search on — you guessed it — “recursion.” The joke has been implemented in German as well.
This idea is not, to put it mildly, new. I first saw the definition
Recursion: See recursion
in the glossary to Intellicorp’s KEE documentation, in 1984 or so. And I’d guess the joke is actually a lot older than that.
For another variation of the same idea, see this link.
Categories: Fun stuff, Google, Humor, Search engines | Leave a Comment |
Data marts in the world of text
CMS/search (Content Management System) expert Alan Pelz-Sharpe recently decried “Shadow IT”, by which he seems to mean departmental proliferation of data stores outside the control of the IT department. In other words, he’s talking about data marts, only for documents rather than tabular data.
Notwithstanding the manifest virtues of centralization, there are numerous reasons you might want data marts, in the tabular and document worlds alike. For example:
- Price/performance. Your main/central data manager might be too expensive to support additional large specialized databases. Or different databases and applications might have sufficiently different profiles so as to get great price/performance from different kinds of data managers. This is particularly prevalent in the relational world, where each of column stores, sequentially-oriented row stores, and random I/O-oriented row stores have compelling use cases.
- Different SLAs (Service-Level Agreements). Similarly, different applications may have very different requirements for uptime, response time, and the like. (In the relational world, think of operational data stores.)
- Different security requirements. Different subsets of the data may need different levels of security. This is particularly prevalent in the document world, where security problems are not as well-solved as in the tabular arena, and where it’s common for a search engine to index across different corpuses with radically different levels of sensitivity.
- Integrated application and user interfaces. In the relational world, there’s a pretty clean separation between data management and interface logic; most serious business intelligence tools can talk to most DBMS. The document world is quite different. Some search engines bundle, for example, various kinds of faceted or parameterized search interfaces. What’s more, in public-facing search, a major differentiator is the facilities that the product offers for skewing search results.
- Different text applications require different thesauruses or taxonomy management systems. Ideally, those should all be integrated — but the requisite technology still doesn’t exist.
Bottom line: Text data marts, much like relational data marts, are almost surely here to stay.
Related link
Categories: Enterprise search, Ontologies, Search engines, Specialized search, Structured search | 2 Comments |
MEN ARE FROM EARTH, COMPUTERS ARE FROM VULCAN
The newsletter/column excerpted below was originally published in 1998. Some of the specific references are obviously very dated. But the general points about the requirements for successful natural language computer interfaces still hold true. Less progress has been made in the intervening decade-plus than I would have hoped, but some recent efforts — especially in the area of search-over-business-intelligence — are at least mildly encouraging. Emphasis added.
Natural language computer interfaces were introduced commercially about 15 years ago*. They failed miserably.
*I.e., the early 1980s
For example, Artificial Intelligence Corporation’s Intellect was a natural language DBMS query/reporting/charting tool. It was actually a pretty good product. But it’s infamous among industry insiders as the product for which IBM, in one of its first software licensing deals, got about 1700 trial installations — and less than a 1% sales close rate. Even its successor, Linguistic Technologies’ English Wizard*, doesn’t seem to be attracting many customers, despite consistently good product reviews.
*These days (i.e., in 2009) it’s owned by Progress and called EasyAsk. It still doesn’t seem to be selling well.
Another example was HAL, the natural language command interface to 1-2-3. HAL is the product that first made Bill Gross (subsequently the founder of Knowledge Adventure and idealab!) and his brother Larry famous. However, it achieved no success*, and was quickly dropped from Lotus’ product line.
*I loved the product personally. But I was sadly alone.
In retrospect, it’s obvious why natural language interfaces failed. First of all, they offered little advantage over the forms-and-menus paradigm that dominated enterprise computing in both the online-character-based and client-server-GUI eras. If you couldn’t meet an application need with forms and menus, you couldn’t meet it with natural language either. Read more
Categories: BI integration, IBM and UIMA, Language recognition, Natural language processing (NLP), Progress and EasyAsk, Search engines, Speech recognition | 3 Comments |
Google Wave — finally a Microsoft killer?
Google held a superbly-received preview of a new technology called Google Wave, which promises to “reinvent communication.” In simplest terms, Google Wave is a software platform that:
- Offers the possibility to improve upon a broad range of communication, collaboration, and/or text-based product categories, such as:
- Search
- Word processing
- Instant messaging
- Microblogging
- Blogging
- Mini-portals (Facebook-style)
- Mini-portals (Sharepoint-style)
- In particular, allows these applications to be both much more integrated and interactive than they now are.
- Will have open developer APIs.
- WIll be open-sourced.
If this all works out, Google Wave could play merry hell with Microsoft Outlook, Microsoft Word, Microsoft Exchange, Microsoft SharePoint, and more.
I suspect it will.
And by the way, there’s a cool “natural language” angle as well. Read more
Categories: Google, Language recognition, Microblogging, Microsoft, Natural language processing (NLP), Search engines, Social software and online media, Software as a Service (SaaS) | 3 Comments |
Google has a lot more features than I realized
A features and syntax page reveals that the basic Google search box now gives you flight times, weather, stock quotes, sports scores, currency conversion, calculator results, and a lot more. Wow. I did not know.
Since the early 1980s, I’ve thought that natural language interfaces — spoken or otherwise — would someday win. While this versatility isn’t natural lanaguage per se, it still in my opinion is evidence in favor of that belief.
Categories: Google, Search engines | Leave a Comment |
Thoughts on the rumored Google/Twitter deal
Michael Arrington reports that Google and Twitter are contemplating both:
- A Google acquisition of Twitter
- Some other kind of relationship built around real-time search
I have three initial thoughts on this:
1. Clearly, in Google’s mission to “organize all the world’s information,” there are several web areas it isn’t yet doing well in, and one of those is microblogs. What’s more, much as in the case of YouTube, it’s hard to see how Google would do that organizing any time soon unless it owned or otherwise was in bed with the leading platform for that kind of content — i.e., Twitter.
2. The YouTube example is apt in another way as well — it’s not clear where the monetization would come from. Google famously doesn’t make much advertising revenue from YouTube. And Twitter is even worse as an advertising platform; sticking ads into the tweetstream would quickly drive users elsewhere, and any other advertising scheme would likely fail because of the broad variety of interfaces — such as various mobile phones — Twitterers use to get at the service.
3. I’ve been suggesting all along that Twitter needs radical user experience enhancements. But when has Google ever made made user experience enhancements to a service? Its core search engine always looks pretty much the same. Ditto GMail. Ditto Blogger. Ditto YouTube.
Categories: Google, Microblogging, Search engines, Social software and online media, Twitter | 2 Comments |
Twitter shows some directions for growth
TechCrunch pointed out a Twitter jobs page. The specific job TechCrunch mentioned* isn’t up there any more, but at the moment I write this, 18 others are (copied below). That’s considerable growth, given that the same page says Twitter has fewer than 30 current employees. Note the emphasis on search and the mention of Japan.
*Care and feeding of celebrity tweeters. Celebrity tweeting is actually a subject I’ve written and even been interviewed about several times.
As of this writing, the full list is: Read more
Categories: Microblogging, Search engines, Social software and online media, Specialized search, Twitter | 1 Comment |
Yet more NoFollow whining
Andy Beal has a blog post up to the effect that NoFollow is a bad thing. (Edit: Andy points out in the comment thread that his opposition to NoFollow isn’t as absolute as I was suggesting.) Other SEO types are promoting this is if it were some kind of important cause. I think that’s nuts, and NoFollow is a huge spam-reducer.
The weakness of Andy’s argument is illustrated by the one and only scenario he posits in support of his crusade:
The result is that a blog post added to a brand new site may well have just broken the story about the capture of Bin Laden (we wish!)–and a link to said post may have been Tweeted and re-tweeted–but Google won’t discover or index that post until it finds a “followed” link. Likely from a trusted site in Google’s index and likely hours, if not days, after it was first shared on Twitter.
Helloooo — if I post something here, it is indexed at least in Google blog search immediately. (As in, within a minute or so.) Ping, crawl, pop — there it is. The only remotely valid version of Andy’s complaint is that It might take some hours for Google’s main index to update — but even there there’s a News listing at the top. This simply is not a problem.
Now, I think it would be personally great for me if all the links to my sites from Wikipedia and Twitter and the comment threads of major blogs pointed back with “link juice.” On the other hand, even with NoFollow out there, my sites come up high in Google’s rankings for all sorts of keywords, driving a lot of their readership. I imagine the same is true for most other sites containing fairly unique content that people find interesting enough to link to.
So other than making it harder to engage in deceptive SEO, I fail to see what problems NoFollow is causing.
Categories: Google, Online marketing, Search engine optimization (SEO), Search engines, Spam and antispam | 2 Comments |
Where “semantic” technology is or isn’t important
At Lynda Moulton’s behest, I spoke a couple of times recently on the subject of where “semantic” technology is or isn’t likely to be important. One was at the Gilbane conference in early December. The slides were based on my previously posted deck for a June talk I gave on a text analytics market overview. The actual Gilbane slides may be found here.
My opinions about the applicability of semantic technology include:
- The big bucks in web search are for “transactional” web search, and semantics isn’t the issue there. (Slides 3-4)
- When UIs finally go beyond the simple search box — e.g. to clusters/facets or to voice — semantics should have a role to play. (Slide 5)
- Public-facing site search depends — more than any other area of text analytics — on hand-tagging. (Slide 7)
- “Enterprise” search that searches specialized external databases could benefit from semantic technologies. (Slide 8)
- True enterprise search could benefit from semantic technologies in multiple ways, but has other problems as well. (Slides 10-11)
- Semantics — specifically extraction — is central to custom publishing. (Slide 12 — upon review I regret using the word “sophisticated”)
- Semantics is central to text mining. (Slide 18)
- Semantics could play a big role in all sorts of exciting future developments. (Slide 19)
So what would your list be like?