Search engines
Analysis of search technology, products, services, and vendors. Related subjects include:
Text mining for compliance and legal discovery
One theme that keeps recurring in my talks with text mining and other text analytics/text technology companies is compliance. Ditto legal discovery, which is closely related. Most of the focus seems to be on three kinds of data:
- Vehicle defect evidence. The TREAD Act is of course the big driver here (no pun intended).
- Drug side effect evidence. The FDA is pushing that one.
- Email/correspondence archives. Text search/filtering/clustering/mining whatever is now a standard part of legal discovery.
Categories: Enterprise search, Search engines, Text mining | 2 Comments |
Autonomy on text mining
I asked Mike Lynch (Autonomy CEO) about text mining. He responded with an example:
A very well-known company “mines” its incoming emails for signs of trouble, not via any linguistics-driven approach, but just by clustering them. If a cluster changes size anomalously over time, it bears close investigation.
Categories: About this blog, Autonomy, Search engines, Text mining | 1 Comment |
Update: Autonomy/Verity merger
I had a couple of very interesting calls with Autonomy last week. One message I got was that they do not want to be pigeonholed in search, which they think on the whole is a primitive way of dealing with “unstructured information.” Nonetheless, my first post based on those calls will indeed focus on text indexing and search. You see, I wrote quite skeptically about the Autonomy/Verity merger when it was announced, and I’d like to amend that with an updated opinion. Autonomy’s claims can be summarized in part by the following: Read more
Categories: About this blog, Autonomy, Enterprise search, Search engines | Leave a Comment |
Towards an enterprise text architecture
My column this month for Computerworld is on enterprise text technology architecture. A sequel is promised for next month.
This month’s column focuses mainly on reciting application needs. Did I leave any important ones out?
Next time I’ll focus more on how to meet those needs. I need to write it in in 2 1/2 weeks or so. I plan to talk with a lot of industry players between now and then.
Categories: About this blog, Ontologies, Search engines, Text mining | 4 Comments |
Google’s internal text-based project/knowledge management
Slashdot turned up an amazing article in Baseline on Google’s infrastructure. There’s lots of gee-whiz stuff in there about server farms, petabytes of disk packed into a standard shipping container so as to allow the setup of more server farms around the globe, and so on. But even more interesting to me was another point, about Google’s internal use of its own technology. In at least one case – a hybrid of project and knowledge management – Google really seems to be doing what other firms only dream about as futures. Here’s the relevant excerpt:
Categories: About this blog, Enterprise search, Google, Search engines, Specialized search | 2 Comments |
Product search fun at Tesco
Single-site websearch can be quite problematic. Here are screenshots of two examples:
Canned fish returned on a drain cleaner search
Rum returned on a condom brand search
Categories: Search engines, Specialized search | Leave a Comment |
Text analytics white papers
The Text Analytics Summit folks have created a page where you can download a bunch of whitepapers. Most of those are tripe, of course, but I’m finding some of them interesting. The first few Megaputer ones I looked at seemed to have non-obvious application stories. The two SPSS ones actually called “Case Study” are worth a quick glance. And the ones from Lexalytics and Intellisophic definitely seem worthy of more thought and/or research.
Edit: Also possibly worth a look on that page is a Henry Morris opus for ClearForest called “Putting the Why in BI.”
Categories: Ontologies, Search engines, Text Analytics Summit, Text mining | Leave a Comment |
Four (at least) issues in text and taxonomy federation
For any issue in text analytics, there are a lot of smart people who understand it very well (and, in particular, better than I do). But I suspect that even many of the discipline’s better thinkers have failed to grasp just what will be involved in taking text analytic technologies to the next level of usefulness and adoption. One example is what I see as the requirements for an ontology management system; I’m not aware of anybody else right now who has as expansive a view of this as I do. In this post, I’m going to address another issue whose complexity is in my opinion under-appreciated – federation. More specifically, I’ll argue that the various aspects of the federation task are a big part of what makes ontology management so complex.
The thing is – there are at least four different senses of “federation” in the text sphere, all of them important, most of them still technically unsolved. Namely:
Categories: Ontologies, Search engines | 1 Comment |
Misunderstandings of text management
I argue a lot with relational purists. On the whole they’re smart people, but they do have their blindspots.
One of the biggest is in the area of text. They fail to see how text data management is fundamentally different from tabular data management. Here’s a little article explaining why text doesn’t fit well into the relational model.
Categories: Search engines | 1 Comment |
The text technologies market 3: Here’s what’s missing
The text technologies market should be booming, but actually is in disarray. How, then, do I think it should be fixed? I think the key problem can be summed up like this:
There’s a product category that is a key component of the technology, without which it won’t live up to nearly its potential benefits. But there’s widespread and justified concern over its commercial viability. Hence, the industry cowers in niches where it can indeed eke out some success despite products that fall far short of their true potential.
The product category I have in mind, for lack of a better name, is an ontology management system. No category of text technology can work really well without some kind of semantic understanding. Automated clustering is very important for informing this understanding in a cost-effective way, but such clustering is not a complete solution – hence the relative disappointment of Autonomy, the utter failure of Excite, and so on. Rather, there has to be some kind of concept ontology that can be use to inform disambiguation. It doesn’t matter whether the application category is search, text mining, command/control, or anything else; semantic disambiguation is almost always necessary for the most precise, user-satisfying results. Maybe it’s enough to have a thesaurus – i.e., a list of synonyms. Maybe it’s enough to define “concepts” by simple vectors of word likelihoods. But you have to have something, or your search results will be cluttered, your information retrieval won’t fetch what you want it to, your text mining will have wide error bars, and your free-speech understanders will come back with a whole lot of “I’m sorry; I didn’t understand that.”
This isn’t just my opinion. Look at Inquira. Look at text mining products from SPSS and many others. Look at Oracle’s original text indexing technology and also at its Triplehop acquisition. For that matter, look at Sybase’s AnswersAnywhere, in which the concept network is really just an object model, in the full running-application sense of “object.” Comparing text to some sort of thesaurus or concept representation is central to enterprise text technology applications (and increasingly to web search as well).
Could one “ontology management system,” whatever that is, service multiple types of text applications? Of course it could. The ideal ontology would consist mainly of four aspects:
1. A conceptual part that’s language-independent.
2. A general language-dependent part.
3. A sensitivity to different kinds of text – language is used differently when spoken, for instance, than it is in edited newspaper articles.
4. An enterprise-specific part. For example, a company has product names, and competitors with product names, and those names have abbreviations, and so on.
Relatively little of that is application-specific; for any given enterprise, a single ontology should meet most or all of its application needs.
Coming up: The legitimate barriers to the creation of an ontology management system market, and ideas about how to overcome them.