Text mining

Analysis of text mining companies, technology, and trends. Related subjects include:

June 23, 2006

The current state of text mining/analytics marketing?

One thing that didn’t go so well at the Text Analytics Summit was the marketing panel. Indeed, when we wracked our brains afterward, Mary Crissey (who was on the panel) and I could only think of a single observation that was actually made about marketing. Namely, she referred to a core truth of marketing: Just selling features doesn’t work (nobody cares). Just selling benefits doesn’t work (you’re not differentiated). What you have to do is sell the connection between your features and desirable benefits.

So I’m going to try to gather some useful observations on marketing here, filling the gap that the panel left. Key questions I’d love input on include:

1. Which feature-benefit connections do you see customers easily accepting?

2. Which feature-benefit connections is it harder to get them to believe?

3. How are customers defining text analytics market segments?

4. What do they see as the key issues in each segement?

5. Which application areas are showing growth even beyond that of the market overall?

I’m particularly interested in comments from the larger vendors that are selling into multiple parts of the text mining and text analytics market. But everybody else’s input would be warmly appreciated too.

The comment thread to this post is open for business!

June 23, 2006

Notes from the Second Annual Text Analytics (formerly Text Mining) Summit

Thursday morning at the Text Analytics Summit featured, among other things, one excellent panel, a couple of lively and interesting presentations, and the usual tedious discussion of “What do we call this technology area anyway?” Lots of airtime went to industry stars such as Olivier Jouve (SPSS), Todd Wakefield (Attensity), Ramana Rao (Inxight), and Mary Crissey (SAS). There also was a gratifying repetition from the front of the room of the statement “Curt is right”, so as of when I’m writing these notes (midday) I’m happy, even if Ramana somehow neglected to join that chorus …

Repeated themes and messages included:

I’ve only slept one night in the past three, so I’ll stop here and blog more about the conference later.

June 16, 2006

Data capture for the sake of text mining

One of the major factors driving successful use of advanced analytic tools is direct initiatives to procure more data. The single best example I can think of is the gaming industry’s use of otherwise-contrived loyalty cards; improved marketing based on that data at chains like Harrah’s seems to produce upwards of 100% of total profits.

So can we apply the same approach to text mining? One place would be surveys. Rather than those annoying, contrived forms demanding we fill in a lot of choices as if we were taking the SATs all over again, maybe users would be more revealing if they could just write whatever they wanted? The obvious firm to ask is SPSS, which is big both in surveys and text mining, not to mention the intersection of the two markets. So I emailed Olivier Jouve, and he shot back an answer from an airport. Read more

June 13, 2006

Text analytics white papers

The Text Analytics Summit folks have created a page where you can download a bunch of whitepapers. Most of those are tripe, of course, but I’m finding some of them interesting. The first few Megaputer ones I looked at seemed to have non-obvious application stories. The two SPSS ones actually called “Case Study” are worth a quick glance. And the ones from Lexalytics and Intellisophic definitely seem worthy of more thought and/or research.

Edit: Also possibly worth a look on that page is a Henry Morris opus for ClearForest called “Putting the Why in BI.”

June 10, 2006

DISCOUNT on Text Analytics Summit, June 22-23, 2006

Last year I blogged glowingly about the first annual Text Mining Summit. Now it’s time for the second annual conference, with the name changed to Text Analytics Summit. It is in Boston (next to the Prudential Center/Hynes Auditorium complex) on June 22-23, 2006. From the descriptions*, it sounds a lot like last year’s, with the same high-level industry attendance. Indeed, in many cases there will be exactly the same high-level industry speakers …

*Unlike last year, I wasn’t involved in the planning, and am not going to be a speaker.

Anyhow, I plan to be there. Even better, I am empowered to offer a discount to anybody else who cares to attend. I am told that if you register online, there will be a discount box, and if you enter “CAM” you will receive a $100 reduction in your registration fee.

December 11, 2005

The text technologies market 3: Here’s what’s missing

The text technologies market should be booming, but actually is in disarray. How, then, do I think it should be fixed? I think the key problem can be summed up like this:

There’s a product category that is a key component of the technology, without which it won’t live up to nearly its potential benefits. But there’s widespread and justified concern over its commercial viability. Hence, the industry cowers in niches where it can indeed eke out some success despite products that fall far short of their true potential.

The product category I have in mind, for lack of a better name, is an ontology management system. No category of text technology can work really well without some kind of semantic understanding. Automated clustering is very important for informing this understanding in a cost-effective way, but such clustering is not a complete solution – hence the relative disappointment of Autonomy, the utter failure of Excite, and so on. Rather, there has to be some kind of concept ontology that can be use to inform disambiguation. It doesn’t matter whether the application category is search, text mining, command/control, or anything else; semantic disambiguation is almost always necessary for the most precise, user-satisfying results. Maybe it’s enough to have a thesaurus – i.e., a list of synonyms. Maybe it’s enough to define “concepts” by simple vectors of word likelihoods. But you have to have something, or your search results will be cluttered, your information retrieval won’t fetch what you want it to, your text mining will have wide error bars, and your free-speech understanders will come back with a whole lot of “I’m sorry; I didn’t understand that.”

This isn’t just my opinion. Look at Inquira. Look at text mining products from SPSS and many others. Look at Oracle’s original text indexing technology and also at its Triplehop acquisition. For that matter, look at Sybase’s AnswersAnywhere, in which the concept network is really just an object model, in the full running-application sense of “object.” Comparing text to some sort of thesaurus or concept representation is central to enterprise text technology applications (and increasingly to web search as well).

Could one “ontology management system,” whatever that is, service multiple types of text applications? Of course it could. The ideal ontology would consist mainly of four aspects:

1. A conceptual part that’s language-independent.
2. A general language-dependent part.
3. A sensitivity to different kinds of text – language is used differently when spoken, for instance, than it is in edited newspaper articles.
4. An enterprise-specific part. For example, a company has product names, and competitors with product names, and those names have abbreviations, and so on.

Relatively little of that is application-specific; for any given enterprise, a single ontology should meet most or all of its application needs.

Coming up: The legitimate barriers to the creation of an ontology management system market, and ideas about how to overcome them.

December 9, 2005

The text technologies market 2: It’s actually in disarray

The text technologies market should be huge and thriving. Actually, however, it’s in disarray. Multiple generations of enterprise search vendors have floundered, with the Autonomy/Verity merger being basically a combination of the weak. The RDBMS vendors came up with decent hybrid tabular/text offerings, and almost nobody cared. (Admittedly, part of the reason for that is that the best offering was Oracle’s, and Oracle almost always screws up its ancillary businesses. Email searchability has been ridiculously bad since — well, since the invention of email. And speech technology has floundered for decades, with most of the survivors now rolled into the new version of Nuance.

Commercial text mining is indeed booming, but not to an extent that erases the overall picture of gloom. It’s at most a several hundred million dollar business, and one that’s highly fragmented. For example, at a conference on IT in life sciences not that long ago, two things became evident. First, the text mining companies were making huge, intellectually fascinating, life-saving contributions to medical research. Second, more than ten vendors were divvying up what was only around a $10 million market.

If text technology is going to achieve the prominence and prosperity it deserves, something dramatic has to change.

December 9, 2005

The text technologies market 1: It should be huge

From a number of standpoints, the market for enterprise technologies that explicitly* manage text SHOULD be huge. Consider:

1. The market for consumer text search is huge — think of Google.

2. The market for implicit* management of text is huge. Email management is a significant fraction of the IT budget, if you factor in the predominance of email in the use of networks and PCs. Now regulations are compelling email to be stored and managed at great expense. People spend hours per day working on email, word processors, etc.

3. The text mining market has recently boomed, and good ROI appears to be the norm.

*My implicit vs. explicit distinction here is meant to distinguish technologies that manage text as some sort of BLOB or other blob vs. technologies that take account of the fact that it is text, which contains words, phrases, synonyms, and so on.

If text technologies could live up to researchers’ dreams, typical knowledge workers would save hours per week and in many cases hours per day. The benefits would rival at least those of the whole PC/office productivity/messaging set of technologies. Thus, at least in theory, the market potential for these technologies is enormous.

October 19, 2005

Linkage among different text technologies

The first post in this blog describes its subject as “a group of interrelated, linguistics-based technology sectors, including text mining, search, speech recognition, and text command-and-control.” So I might as well kick off the discussion by summarizing some reasons why I think these sectors really are connected to each other. Very quickly:

IBM says so, and nobody (that I know of) is contradicting them.
The essence of the UIMA story is that a lot of different pieces of technology need to be swapped in and out, not just among different brands of the same text applications, but among different kinds of text app. The vendors I checked with are uniformly skeptical about whether UIMA will have a real market impact, but none disputes UIMA’s underlying premise.

The tokenization chain. This general industry agreement is only one of three major reasons I believe in the general UIMA premise (while sharing the skepticism about that particular framework’s early adoption). A second dates back to when I was first learning about text search. At a Verity User Conference in, I think, April 1997, I had a very interesting conversation about Verity’s new architecture. (Probably with Phil Nelson, maybe with somebody else, such as Hugh Njemanze or Nick Arnett.) Basically, the system had been modularized, and the way it had been modularized was to create a flow of tokenization after tokenization after tokenization. The third reason is the observation that Inxight, so central to the tokenization strategies of text search vendors, plays pretty much the same role for the text mining companies.

The centrality of concept ontologies. I don’t currently have an opinion about the Semantic Web, but in a more limited sense it’s clear that ontologies will rule text applications. Whether for search, text data mining, or application command/control, it just doesn’t suffice to identify, find, weigh, or respond to individual words. Rather, you need to add other words indicating similar meaning – or a similar user “intent” — into the mix.*

This is a big deal, because simple minded ontologies don’t work. They can’t just be automatically generated, and they can’t just be hand-built. They can’t just be custom to each user or user enterprise, but they also can’t be provided entirely by technology vendors. Almost no large enterprises currently have a good system of ontology building and management, but in the near future most will have to. Evolution in this area will be a crucial determinant of how multiple text technology submarkets are shaped.

In particular, this is a big enough deal that I think search and text data mining and other text technologies will, for each enterprise, tend to use the same ontology.

*Note: There’s a whole other question as to how long we’ll be able to get by just looking at semantics, or whether syntactic analysis absolutely also should be in the mix. But first things first; without a good ontology, syntactic analysis is a pretty hopeless endeavor.

The use of text data mining in other areas. The automated part of the ontology building process involves a lot of text data mining. Large search engine companies generally do a lot of data mining to establish and validate tweaks to their search algorithms. The same goes for spam filters and more questionable forms of censorware. You can’t act intelligently without learning, and machines don’t learn well without doing statistical analyses.

I hope to post soon on each of these issues at more length, and I encourage comments on any of them as inputs to further work. But for now, I’ll just claim to have provided strong evidence for my initial point: Seemingly different text technologies are indeed closely related.

← Previous Page

Feed including blog about text analytics, text mining, and text search Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.