Text mining – fact and fiction
Text mining is science-project artificial intelligence. Fiction. Text mining is proven in many practical applications.
To implement text mining, you need computational linguists. Fact. Monash’s Second Law of Commercial Semantics states “Where there are ontologies, there is consulting.” And it’s linguists, or reasonable facsimiles of same, who do the consulting.
To use text mining, you need computational linguists. Fiction. When last I counted, the number of known computational linguists working for end-user organizations, worldwide, was precisely 1, at Procter & Gamble. (Intelligence agencies excepted, of course.) I’d guess it’s higher now, but I probably could still count them all without taking my socks off.
CRM applications are driving the growth of text mining. Fact. Most current growth in text mining seems to come from Voice of the Customer and Voice of the Market/competitive intelligence applications. And a couple of years ago, when SAS and SPSS had a joint boom in text mining, a lot of that was coming from CRM.
Text mining products are useful mainly for large enterprises. More fact than fiction. Text mining makes the most sense when you have too much text for humans to read and summarize.
Text mining doesn’t fit well with relational databases. Fiction. The fastest-growing text mining companies seem to be Attensity and Clarabridge, who consistently extract textual information into relational databases.
Text mining imposes structure on unstructured* data. More fact than fiction. Most text mining applications involve examining free-text documents and creating entries in relational or XML databases. Most people would call that a transition from unstructured to structured form.
*I still don’t like the “structured/unstructured” distinction, but with repetition I’m getting somewhat inured to it.
Enterprise search is an alternative to text mining. Fact. You can use a high-end search engine to cluster documents and look for trends and insight. It’s not the real McCoy, but in some cases it gives you 80% of the benefit of the real thing.
Text mining is an ingredient, not a product category. Part fact, part fiction. The biggest text mining efforts in the world are probably at Google, Yahoo, Microsoft search, and Dow Jones/Factiva. Antispam vendors also invest a lot in text mining. Two of the top five independent text mining vendors were acquired this year (ClearForest and Inxight). And of the many dozens of small text mining independents, most are focused on specific niches.
Even so, Attensity, Clarabridge, and Temis show that, at least for now, text mining remains a legitimate product category.
The text mining industry is in trouble. Part fact, part fiction. As I recently ranted, even the leading text mining vendors are letting many opportunities pass them by. And like many software sectors, text mining seems poised to be absorbed via large-company acquisition. SAP has already secured a text mining business via BOBJ/Inxight, but at least one vendor each could easily be bought by Oracle, Microsoft (despite the in-house expertise from its search arm), and IBM (despite or even in connection with UIMA).
But in the meantime, a few small text mining vendors are still showing rapid growth.
Previous “fact and fiction” post: Data warehouse appliances.
Related links
Categories: Text mining | 5 Comments |
Scout Labs – yet more public-facing sentiment analysis
Scout Labs sounds like even more of what I was thinking of than Summize. It’s a shame that the “traditional” text mining vendors didn’t get there first.
Categories: Competitive intelligence, Text mining | 2 Comments |
The text mining vendors continue to lack constructive vision
I’ve been thinking for a long time that the various text mining companies doing sentiment analysis should try some public-facing (or at least multi-customer) services. Investors might love such a thing. So might marketing managers (actually, Factiva claims to be active there, at least as per their web site). And as a key part of the strategy, text mining companies selling to enterprises might brand such a site and gain massive awareness accordingly. Well, it seems that public-facing sentiment analysis sites are springing up. At least, Summize has. (Hat tip to TechCrunch.) And the text mining vendors are nowhere to be seen.
So what else is new? Read more
Categories: Application areas, Factiva/Dow Jones, Investment research and trading, Text mining | 1 Comment |
Attivio tries to do it all
When Andrew McKay was at FAST, I grumped about his search/BI integration story. Now that he’s trying to do the same thing at a startup called Attivio, it sounds more plausible.
Attivio is having a house party and product rollout in the latter part of January, and details are scarce in the mean time. But here are some highlights.
- Attivio was founded in August. It has 21 people and 1 VC. The VC has invested >$6 million and committed >$12 million total.
- Attivio has ambitious plans for a fully integrated data management/real-time BI stack. It’s currently called the “Active Intelligence Engine.” Read more
Categories: Attivio, BI integration, Investment research and trading, Lucene, Open source text analytics | 4 Comments |
Thoughts from an overview of technology marketing
As part of the Monash Advantage program, I published a proprietary Monash Letter about online marketing … and another one … and some further stuff so proprietary I’m not even putting out teasers about it. Now I’ve taken the next step, and written another Letter with a complete overview of software-centric marketing strategy and tactics (lead generation aside). That’s proprietary too, and only available in full if you have access to the secure Monash Advantage website, but here are some semi-random highlights for public consumption. Read more
Categories: Online marketing | 1 Comment |
Russian chatbot apparently passes Turing test
Ina Fried reports of a Russian chatbot that sure sounds like it passes the Turing test. To wit (emphasis mine):
A program that can mimic online flirtation and then extract personal information from its unsuspecting conversation partners is making the rounds in Russian chat forums, according to security software firm PC Tools.
The artificial intelligence of CyberLover’s automated chats is good enough that victims have a tough time distinguishing the “bot” from a real potential suitor, PC Tools said. The software can work quickly too, establishing up to 10 relationships in 30 minutes, PC Tools said.
That said, threat reports from PC security companies are notoriously hyped, so I wouldn’t get too excited until there’s stronger confirmation. Read more
Categories: Social software and online media | 11 Comments |
Windows Live search is rather different from MSN
Until the middle of this year, I got negligible search engine traffic from either MSN or Yahoo, or indeed any other search engine except Google. We’re literally talking a 90-95% share for Google, on each of my three main blogs, most months.
But in November, the Windows Live share was 19% on DBMS2, 29% on Text Technologies, and 41% on the Monash Report. And those aren’t blips; in each case there was steady August-November monthly growth. But on the other hand, early December month-to-date figures are all back down. Weird. Read more
QL2 – web text extraction and more
Here are some highlights of the QL2 story, per exec Mike McDermott.
- QL2’s main business is scraping price and other product offering data from the web for high-speed competitive analysis. For example, of their 250ish customers overall, over 90 are airlines. Online retailers are another big chunk of their customer base.
- QL2 also commonly partners with text mining companies in applications such as Voice of the Market or competitive intelligence. E.g., QL2 has been brought into a few deals each by Attensity, Clarabridge, and especially Temis.
- QL2 goes well beyond basic crawling. Notably, the system fills in forms with parameters. And of course it monitors pages for changes.
- QL2’s scripting language is, Mike tells me, very SQL-like. Hence the “QL” in the name.
- QL2 rolls its own filters, rather than using INSO or whoever. (Actually, what are the main file-reading filter choices these days? I’ve lost track.) Indeed, Mike fondly believes QL2 does a better job with PDFs than Adobe does.
- QL2 doesn’t want to be thought of as web-only. Rather, Mike likes my formulation of “text data ETL, web or otherwise.” That said, he freely admits QL2’s strength is in Extract rather than in Transform or Load.
Categories: Application areas, Competitive intelligence, QL2, Text mining | Leave a Comment |
Danny Sullivan thinks blended vertical search is a game-changer
Danny Sullivan thinks blended vertical search — which he’s calling Search 3.0 — is a game changer. (In this context, “vertical” search denotes alternate result types such as video, image, map coordinates, or product listings.) In saying that, he’s focused on search marketers, who now have a lot more ways to try to get their messages onto Google searchers’ top result pages. But I presume what he’s really saying is that there will be a feedback effect — if Google tells all web searchers about videos and product listings, then internet marketers will be more motivated to post videos and product listings, and hence there will be more interesting choices of videos and product listings — which Google will naturally wind up featuring more prominently in its search results. And so on.
Given the Youtube explosion, I find it hard to argue with his claim.
Categories: Google, Search engine optimization (SEO), Search engines, Specialized search, Structured search | Leave a Comment |
So what’s the state of speech recognition and dictation software?
Linda asked me about the state of desktop dictation technology. In particular, she asked me whether there was much difference between the latest version and earlier, cheaper ones. My knowledge of the area is out of date, so I thought I’d throw both the specific question and the broader subject of speech recognition out there for general discussion.
Here’s much of what I know or believe about speech recognition:
- Most major independent commercial speech recognition efforts have wound up being merged into Nuance Communications. That goes for both desktop and server-side stuff. None was doing particularly well before its respective merger. Read more