Text mining
Analysis of text mining companies, technology, and trends. Related subjects include:
Enterprise-specific web search: High-end web search/mining appliances?
OK. I have a vision of one way search could evolve, which I think deserves consideration on at least a “concept-car” basis. This is all speculative; I haven’t discussed it at length with the vendors who’d need to make it happen, nor checked the technical assumptions carefully myself. So I could well be wrong. Indeed, I’ve at least half-changed my mind multiple times this weekend, just in the drafting of this post. Oh yeah, I’m also mixing several subjects together here too. All-in-all, this is not my crispest post …
Anyhow, the core idea is that large enterprises spider and index a subset of the Web, and use that for most of their employees’ web search needs. Key benefits would include:
- Filtering out spam hits. This is obviously important for search, and in some cases could help with public-web text mining as well. It should be OK to be more aggressive on spam-site filtering in an enterprise-specific index than it is in general web search.
- Filtering out malicious/undesirable downloads of various sorts. I’m thinking mainly of malware/spyware here, but of course it can also be used for netnannying porn-prevention and the like as well. Again, this is more easily done for the enterprise market than for the search world at large. (I anyway think that Google could blow Websense out of the water any time they wanted to – except, of course, for the not-so-small matter of not being seen as participating in the censorship business — but that’s a separate discussion.)
- Capturing employees’ search strings. This could be useful for various purposes, including discerning their interests, and building the corporate ontology for internal web search.
- Freshness control. If there’s a site you really care about, you can make sure it’s re-indexed frequently.
Categories: Categorization and filtering, Convera, Enterprise search, FAST, Google, IBM and UIMA, Search engines, Spam and antispam, Specialized search, Text mining, Website filtering | 1 Comment |
KXEN is getting into text mining
Data mining challenger KXEN is getting into text mining, and they’re writing all their own stuff. Not even any Inxight filters. Weird. It will be interesting to see if they stick with that plan.
EDIT: Actually, upon reviewing an e-mail I see that their text mining features are in beta already. So I guess they stuck with the plan, at least for Release 1.
Categories: BI integration, Text mining | Leave a Comment |
Business Objects’ perspective on text mining (and search)
I had a call with Business Objects, mainly about their overall EIM/ETL product line (Enterprise Information Management, a superset of Extract/Transform/Load). But I took the opportunity to ask about their deal with Attensity. (Attensity themselves posted more about the relationship, including some detailed links, here.) It actually sounds pretty real. They also mentioned that there seem to be a bunch of startups proposing search as a substitute for data warehousing, much as FAST sometimes likes to.
Categories: Attensity, BI integration, Search engines, Text mining | 1 Comment |
Text mining into big data warehouses
I previously noted that Attensity seemed to putting a lot of emphasis on a partnership with Business Objects and Teradata, although due to vacations I’ve still failed to get anybody from Business Objects to give me their view of the relationship’s importance. Now Greenplum tells me that O’Reilly is using their system to support text mining (apparently via homegrown technology), although I wasn’t too clear on the details. I also got the sense Greenplum is doing more in text mining, but the details of that completely escaped me. 🙁
It’s just a couple of data points, but I feel a trend here.
Categories: BI integration, Text mining | 3 Comments |
More on free-form text surveys
I’m a huge fan of the idea that companies should deliberately capture as much information as possible for analysis. In the case of text, since I personally hate structured survey forms, I believe that free-form surveys have the potential to capture a lot more information than traditionally Procustean abominations do. SPSS indicated that there’s indeed some activity in this regard.
I found another example. Read more
Categories: ClearForest/Reuters, SPSS, Text mining | 1 Comment |
Principles of enterprise text technology architecture
My August Computerworld column starts where July’s left off, and suggests principles for enterprise text technology architecture. This will not run Monday, August 7, as I was originally led to believe, but rather in my usual second-Monday slot, namely August 14. Thus, I finished it a week earlier than necessary, and I apologize to those of you I inconvenienced with the unnecessary rush to meet that deadline.
The principles I came up with are:
- Deploy search widely across the enterprise.
- It’s OK for your text data to be distributed across a range of silos.
- Integrate fact extraction/text mining aggressively into your predictive analytics and dashboards.
- Having a preferred enterprise text technology tool suite is nice, but accept that there will probably be lots of departmental exceptions.
- Reinvent your customer communication (and other) processes to exploit text technologies.
- Integrate your taxonomies.
I’ll provide a link when the column is actually posted.
Categories: Enterprise search, Ontologies, Search engines, Text mining | 1 Comment |
UIMA data point
While talking with Attensity today about much else, I asked them about UIMA. What they said is not inconsistent with what I heard from IBM itself. According to Attensity:
- A year ago almost no customers cared about UIMA.
- Now UIMA is regularly showing up on government RFPs.
- Private sector interest in UIMA is still very limited.
Categories: Open source text analytics, Text mining | Leave a Comment |
More on Attensity
I had a long and far-ranging talk today with Attensity. Key takeaways included:
- They want to be positioned in the BI space, e.g. as “ETL for text/unstructured data.” They place a lot of value on their partnership with Business Objects and Teradata. And, as a key part of the exhaustive extraction/FRN story, they think that BI/data warehouse information roll-up tools are an excellent (if imperfect) substitute for hardcore semantic extraction.
- Attensity 4 is coming out soon, and it’s going to feature extremely multimodal text analysis.
Categories: Attensity, Text mining | 3 Comments |
Application processes in text mining – finding warning signs
Sergei Ananyan’s claim that analytic business processes involving text are still very primitive is absolutely correct. Indeed, analytic business processes have a lot of maturing to do overall. Still, there’s one area where the industry has devoted a lot of thought over the past few years, and some notion of process has emerged. This is in the finding of warning signs.
Read more
Categories: Application areas, Text mining | 6 Comments |
Pioneers moving on
Ramana Rao is leaving Inxight, or has by now. Today I also discovered that Todd Wakefield is leaving Attensity. Such things happen in all industries, of course.
Categories: Attensity, Business Objects and Inxight, Text mining | Leave a Comment |