Categorization and filtering
Analysis of technologies that focus on the categorization and filtering of documents and text. Related subjects include:
- Any subcategory
- Text mining
- Search engines
Convera aka Excalibur aka ConQuest
Once upon a time, more than a decade before the founding of Autonomy, a New Mexico inventor had the idea for a generic pattern recognition tool. He implemented it on a PC add-in board that, if I recall correctly, plugged into the Apple II. This was the genesis of the company Excalibur Technologies.
Categories: Convera, Enterprise search, Ontologies, Search engines | 5 Comments |
Should ontology management be open sourced?
I’ve argued previously that enterprises need serious ontologies, and that this lack is holding back growth in multiple areas of text technology – search, text mining and knowledge extraction, various forms of speech recognition, and so on. The core point was:
The ideal ontology would consist mainly of four aspects:
1. A conceptual part that’s language-independent.
2. A general language-dependent part.
3. A sensitivity to different kinds of text – language is used differently when spoken, for instance, than it is in edited newspaper articles.
4. An enterprise-specific part. For example, a company has product names, it has competitors with product names, those names have abbreviations, and so on.
Categories: About this blog, Ontologies, Open source text analytics | Leave a Comment |
Towards an enterprise text architecture
My column this month for Computerworld is on enterprise text technology architecture. A sequel is promised for next month.
This month’s column focuses mainly on reciting application needs. Did I leave any important ones out?
Next time I’ll focus more on how to meet those needs. I need to write it in in 2 1/2 weeks or so. I plan to talk with a lot of industry players between now and then.
Categories: About this blog, Ontologies, Search engines, Text mining | 4 Comments |
Procter & Gamble on text mining projects
Terry McFadden of Procter & Gamble made a number of interesting points in his Text Analytics Summit talk, in the area of how to build and “amass” (his word) lexicons. Above all, I’m thrilled that he recognized the necessity of amassing lexicography that can be reused from one app to the next. Beyond that, specific comments and tips included: Read more
Categories: About this blog, ClearForest/Reuters, Companies and products, Ontologies, Text Analytics Summit, Text mining | 2 Comments |
Text analytics white papers
The Text Analytics Summit folks have created a page where you can download a bunch of whitepapers. Most of those are tripe, of course, but I’m finding some of them interesting. The first few Megaputer ones I looked at seemed to have non-obvious application stories. The two SPSS ones actually called “Case Study” are worth a quick glance. And the ones from Lexalytics and Intellisophic definitely seem worthy of more thought and/or research.
Edit: Also possibly worth a look on that page is a Henry Morris opus for ClearForest called “Putting the Why in BI.”
Categories: Ontologies, Search engines, Text Analytics Summit, Text mining | Leave a Comment |
Four (at least) issues in text and taxonomy federation
For any issue in text analytics, there are a lot of smart people who understand it very well (and, in particular, better than I do). But I suspect that even many of the discipline’s better thinkers have failed to grasp just what will be involved in taking text analytic technologies to the next level of usefulness and adoption. One example is what I see as the requirements for an ontology management system; I’m not aware of anybody else right now who has as expansive a view of this as I do. In this post, I’m going to address another issue whose complexity is in my opinion under-appreciated – federation. More specifically, I’ll argue that the various aspects of the federation task are a big part of what makes ontology management so complex.
The thing is – there are at least four different senses of “federation” in the text sphere, all of them important, most of them still technically unsolved. Namely:
Categories: Ontologies, Search engines | 1 Comment |
The text technologies market 4: Requirements for an industry-altering ontology management system
In previous posts I argued that what’s holding the text technology industry back is the lack of a viable ontology management system. The obvious objection to such a suggestion is: Who would use it? There is no business process for ontology management, even less than there is for “knowledge management,” and for that matter less than there was for “knowledge engineering” during the expert systems bubble of the 1980s. Enterprises do not have anything like a “chief ontologist.” Indeed, that job title sounds like a joke — a touchy-feely liberal-artsy nonstarter.
The only way a successful product category of ontology management systems can emerge is if the products are usable by ordinary IT personnel. Vendor-supplied product training can be required, of course. Some day there can be certifications, and maybe a single class in a computer science curriculum. But almost nobody is going to buy a product whose use requires a masters degree in library science or “ontology management.”
So here are some very high-level requirements I think an ontology management system needs to meet.
1. Basic knowledge representation has to be flexible. It has to accommodate semantic net kinds of relationships (is_an_instance_of, is_a_subcategory_of). It also has to accommodate machine learning/statistical kinds of evidence (both positive and negative evidence).
2. There has to be strong layering/versioning. Pieces of the ontology will come from the vendor. Pieces will come from frequently-updated machine-learning exercises against an enterprise’s own corpus(es). Pieces will be added by hand, through a collaboration between IT and (at first) power users. It will have to be possible to reverse any of those pieces out, to apply different pieces for different specific applications, and so on.
3. There need to be standard, open ways for different kinds of applications to use the ontologies. UIMA could be a starting point.
4. The product needs to be industrial-strength – reliable, scalable, secure, sufficiently easy to administer, available on a sufficient range of platforms, and compliant with general standards (not just the text-specific ones).
Obviously, these requirements are nontrivial to achieve. But if some vendor does do a good job on them, the payoff could be huge. Dominance of the enterprise text technologies market – which would be a greatly expanded market – is at stake.
Categories: Ontologies | 4 Comments |
The text technologies market 3: Here’s what’s missing
The text technologies market should be booming, but actually is in disarray. How, then, do I think it should be fixed? I think the key problem can be summed up like this:
There’s a product category that is a key component of the technology, without which it won’t live up to nearly its potential benefits. But there’s widespread and justified concern over its commercial viability. Hence, the industry cowers in niches where it can indeed eke out some success despite products that fall far short of their true potential.
The product category I have in mind, for lack of a better name, is an ontology management system. No category of text technology can work really well without some kind of semantic understanding. Automated clustering is very important for informing this understanding in a cost-effective way, but such clustering is not a complete solution – hence the relative disappointment of Autonomy, the utter failure of Excite, and so on. Rather, there has to be some kind of concept ontology that can be use to inform disambiguation. It doesn’t matter whether the application category is search, text mining, command/control, or anything else; semantic disambiguation is almost always necessary for the most precise, user-satisfying results. Maybe it’s enough to have a thesaurus – i.e., a list of synonyms. Maybe it’s enough to define “concepts” by simple vectors of word likelihoods. But you have to have something, or your search results will be cluttered, your information retrieval won’t fetch what you want it to, your text mining will have wide error bars, and your free-speech understanders will come back with a whole lot of “I’m sorry; I didn’t understand that.”
This isn’t just my opinion. Look at Inquira. Look at text mining products from SPSS and many others. Look at Oracle’s original text indexing technology and also at its Triplehop acquisition. For that matter, look at Sybase’s AnswersAnywhere, in which the concept network is really just an object model, in the full running-application sense of “object.” Comparing text to some sort of thesaurus or concept representation is central to enterprise text technology applications (and increasingly to web search as well).
Could one “ontology management system,” whatever that is, service multiple types of text applications? Of course it could. The ideal ontology would consist mainly of four aspects:
1. A conceptual part that’s language-independent.
2. A general language-dependent part.
3. A sensitivity to different kinds of text – language is used differently when spoken, for instance, than it is in edited newspaper articles.
4. An enterprise-specific part. For example, a company has product names, and competitors with product names, and those names have abbreviations, and so on.
Relatively little of that is application-specific; for any given enterprise, a single ontology should meet most or all of its application needs.
Coming up: The legitimate barriers to the creation of an ontology management system market, and ideas about how to overcome them.
Categories: Enterprise search, Language recognition, Ontologies, Search engines, Speech recognition, Text mining | 5 Comments |
Autonomy + Verity — so what?
On some levels, the Autonomy/Verity merger makes total sense. The text search industry now has an unquestionably dominant vendor of shelfware. Somewhat less snarkily, I could say that it has a dominant OEM vendor of search technology. And while Verity’s management team has never recovered from the dizzying cycles of turnover in the 1990s, Autonomy’s obviously was quite effective. However, I see no obvious reason to believe that combined company will actually ship good products, or ones that lead to fundamentally greater adoption for enterprise search than the fairly marginal role it plays today.
Verity and Autonomy represent different philosophies of text search — Boolean vs. concept-based, basically. Neither works very well on its own, whether in the enterprise or on the web, with concept-based being the weaker of the two. That’s why Altavista et al. failed, and Excite failed yet more completely. It’s why Verity’s text search is generally more respected, and has more hardcore users, than Autonomy’s. (Being a vastly older company than Autonomy helps a lot too, of course.)
I hope that the merged company will soon introduce some new and/or synthesized approaches to search, significantly improving the overall quality of available products. If anybody has the resources and motivation, it will. The recent boom in text data mining, and the general increase of seriousness about ontologies, at least raises the possibility that concept-oriented search will evolve into something significantly useful. But I’m not holding my breath.
Categories: Enterprise search, Ontologies, Search engines | 4 Comments |
Entity-Attribute-Value everywhere?
Jon Udell’s rumination on metadata in Infoworld reminds me how pervasive entity-attribute-value knowledge representation is becoming. You knew that, of course, because you’ve heard of “XML.” But did you also know that operating systems were gaining rich metadata representation at the file level? I must confess that I didn’t.
Categories: Ontologies | Leave a Comment |