Enterprise search
Analysis of enterprise-specific search technology (as opposed to general web search). Related subjects include:
Web search and enterprise search are coming together
Web search and enterprise search are in many ways fundamentally different problems. The biggest problem in web search is screening out pages that deliberately pretend to be relevant to a search. The second biggest problem is picking out the crème de la crème from a long list of essentially good hits. In enterprise search, on the other hand, the biggest problem is finding a single document, or single fact, that is lonely at best, and if you’re unlucky doesn’t exist in the corpus at all. Document structures are also completely different, as are linking structures and almost every other input to the ranking algorithms except the raw words themselves.
Even so, the businesses and technologies of web and enterprise search are beginning to combine. Read more
Categories: Convera, Enterprise search, FAST, Search engines | 3 Comments |
Convera aka Excalibur aka ConQuest
Once upon a time, more than a decade before the founding of Autonomy, a New Mexico inventor had the idea for a generic pattern recognition tool. He implemented it on a PC add-in board that, if I recall correctly, plugged into the Apple II. This was the genesis of the company Excalibur Technologies.
Categories: Convera, Enterprise search, Ontologies, Search engines | 5 Comments |
Text mining for compliance and legal discovery
One theme that keeps recurring in my talks with text mining and other text analytics/text technology companies is compliance. Ditto legal discovery, which is closely related. Most of the focus seems to be on three kinds of data:
- Vehicle defect evidence. The TREAD Act is of course the big driver here (no pun intended).
- Drug side effect evidence. The FDA is pushing that one.
- Email/correspondence archives. Text search/filtering/clustering/mining whatever is now a standard part of legal discovery.
Categories: Enterprise search, Search engines, Text mining | 2 Comments |
Update: Autonomy/Verity merger
I had a couple of very interesting calls with Autonomy last week. One message I got was that they do not want to be pigeonholed in search, which they think on the whole is a primitive way of dealing with “unstructured information.” Nonetheless, my first post based on those calls will indeed focus on text indexing and search. You see, I wrote quite skeptically about the Autonomy/Verity merger when it was announced, and I’d like to amend that with an updated opinion. Autonomy’s claims can be summarized in part by the following: Read more
Categories: About this blog, Autonomy, Enterprise search, Search engines | Leave a Comment |
Google’s internal text-based project/knowledge management
Slashdot turned up an amazing article in Baseline on Google’s infrastructure. There’s lots of gee-whiz stuff in there about server farms, petabytes of disk packed into a standard shipping container so as to allow the setup of more server farms around the globe, and so on. But even more interesting to me was another point, about Google’s internal use of its own technology. In at least one case – a hybrid of project and knowledge management – Google really seems to be doing what other firms only dream about as futures. Here’s the relevant excerpt:
Categories: About this blog, Enterprise search, Google, Search engines, Specialized search | 2 Comments |
The text technologies market 3: Here’s what’s missing
The text technologies market should be booming, but actually is in disarray. How, then, do I think it should be fixed? I think the key problem can be summed up like this:
There’s a product category that is a key component of the technology, without which it won’t live up to nearly its potential benefits. But there’s widespread and justified concern over its commercial viability. Hence, the industry cowers in niches where it can indeed eke out some success despite products that fall far short of their true potential.
The product category I have in mind, for lack of a better name, is an ontology management system. No category of text technology can work really well without some kind of semantic understanding. Automated clustering is very important for informing this understanding in a cost-effective way, but such clustering is not a complete solution – hence the relative disappointment of Autonomy, the utter failure of Excite, and so on. Rather, there has to be some kind of concept ontology that can be use to inform disambiguation. It doesn’t matter whether the application category is search, text mining, command/control, or anything else; semantic disambiguation is almost always necessary for the most precise, user-satisfying results. Maybe it’s enough to have a thesaurus – i.e., a list of synonyms. Maybe it’s enough to define “concepts” by simple vectors of word likelihoods. But you have to have something, or your search results will be cluttered, your information retrieval won’t fetch what you want it to, your text mining will have wide error bars, and your free-speech understanders will come back with a whole lot of “I’m sorry; I didn’t understand that.”
This isn’t just my opinion. Look at Inquira. Look at text mining products from SPSS and many others. Look at Oracle’s original text indexing technology and also at its Triplehop acquisition. For that matter, look at Sybase’s AnswersAnywhere, in which the concept network is really just an object model, in the full running-application sense of “object.” Comparing text to some sort of thesaurus or concept representation is central to enterprise text technology applications (and increasingly to web search as well).
Could one “ontology management system,” whatever that is, service multiple types of text applications? Of course it could. The ideal ontology would consist mainly of four aspects:
1. A conceptual part that’s language-independent.
2. A general language-dependent part.
3. A sensitivity to different kinds of text – language is used differently when spoken, for instance, than it is in edited newspaper articles.
4. An enterprise-specific part. For example, a company has product names, and competitors with product names, and those names have abbreviations, and so on.
Relatively little of that is application-specific; for any given enterprise, a single ontology should meet most or all of its application needs.
Coming up: The legitimate barriers to the creation of an ontology management system market, and ideas about how to overcome them.
Categories: Enterprise search, Language recognition, Ontologies, Search engines, Speech recognition, Text mining | 5 Comments |
The text technologies market 2: It’s actually in disarray
The text technologies market should be huge and thriving. Actually, however, it’s in disarray. Multiple generations of enterprise search vendors have floundered, with the Autonomy/Verity merger being basically a combination of the weak. The RDBMS vendors came up with decent hybrid tabular/text offerings, and almost nobody cared. (Admittedly, part of the reason for that is that the best offering was Oracle’s, and Oracle almost always screws up its ancillary businesses. Email searchability has been ridiculously bad since — well, since the invention of email. And speech technology has floundered for decades, with most of the survivors now rolled into the new version of Nuance.
Commercial text mining is indeed booming, but not to an extent that erases the overall picture of gloom. It’s at most a several hundred million dollar business, and one that’s highly fragmented. For example, at a conference on IT in life sciences not that long ago, two things became evident. First, the text mining companies were making huge, intellectually fascinating, life-saving contributions to medical research. Second, more than ten vendors were divvying up what was only around a $10 million market.
If text technology is going to achieve the prominence and prosperity it deserves, something dramatic has to change.
Categories: Enterprise search, Language recognition, Search engines, Speech recognition, Text mining | 2 Comments |
Autonomy + Verity — so what?
On some levels, the Autonomy/Verity merger makes total sense. The text search industry now has an unquestionably dominant vendor of shelfware. Somewhat less snarkily, I could say that it has a dominant OEM vendor of search technology. And while Verity’s management team has never recovered from the dizzying cycles of turnover in the 1990s, Autonomy’s obviously was quite effective. However, I see no obvious reason to believe that combined company will actually ship good products, or ones that lead to fundamentally greater adoption for enterprise search than the fairly marginal role it plays today.
Verity and Autonomy represent different philosophies of text search — Boolean vs. concept-based, basically. Neither works very well on its own, whether in the enterprise or on the web, with concept-based being the weaker of the two. That’s why Altavista et al. failed, and Excite failed yet more completely. It’s why Verity’s text search is generally more respected, and has more hardcore users, than Autonomy’s. (Being a vastly older company than Autonomy helps a lot too, of course.)
I hope that the merged company will soon introduce some new and/or synthesized approaches to search, significantly improving the overall quality of available products. If anybody has the resources and motivation, it will. The recent boom in text data mining, and the general increase of seriousness about ontologies, at least raises the possibility that concept-oriented search will evolve into something significantly useful. But I’m not holding my breath.
Categories: Enterprise search, Ontologies, Search engines | 4 Comments |