Attensity, extractive exhaustion, and the FRN
Two of the clearest and most charismatic speakers in the text mining business are Attensity cofounders Todd Wakefield and David Bean. Last year, Todd’s Text Mining Summit speech gave an excellent overview of the various application areas in which text mining was being adopted; vestiges of that material may be found in a blog post I made at the time, and on Attensity’s web site. This time, David’s Text Analytics Summit speech was basically a pitch for Attensity’s latest product release – and it was a pitch well worth hearing.
The basic story is that selective fact extraction from text is a knowledge-engineering-intensive process. You need to determine which facts to extract, and then determine how to extract those particular kinds of facts. So Attensity has a better idea; it will extract all facts, not just some, and dump them in a “fact relationship network” (FRN). The FRN is two relational tables, one for facts and one for relationships, suitable for copying to a Teradata machine. Attensity calls this “exhaustive extraction.”
To some extent, exhaustive extraction amounts to what in the math biz is called restating the problem.
- Old version: You need to determine which kinds of facts to get out of the documents, and what those facts might look like.
- New version: Same two challenges, but now vis-à-vis the FRN.
Still, this approach would seem to offer some nice advantages. Separating the initial extraction from later lexicography is pure goodness, for all the reasons that modularity is generally good. The same goes for separating the initial extraction from later decisions as to just what information it is you care about anyway. And generally, this approach should help in applications where somebody might say, in David’s phrase, “I don’t know what I’m looking for, but I’ll know it when I see it.”
Comments
10 Responses to “Attensity, extractive exhaustion, and the FRN”
Leave a Reply
[…] ClearForest is one of the two companies whose name comes up for fact extraction applications, probably even a little ahead of Attensity. Their flagship account is the GM deal they did with IBM, kicking off the whole warranty report mining boom. Procter & Gamble is no slouch of a customer either. They’re involved enough in anti-terrorism that, when I asked Jay if he knew who Cogito was, he said “Of course.” And apparently one of their techie founders is the guy who coined the term “text mining” in the first place. […]
[…] Ramana Rao is leaving Inxight, or has by now. Today I also discovered that Todd Wakefield is leaving Attensity. Such things happen in all industries, of course. • • • […]
[…] They want to be positioned in the BI space, e.g. as “ETL for text/unstructured data.” They place a lot of value on their partnership with Business Objects and Teradata. And, as a key part of the exhaustive extraction/FRN story, they think that BI/data warehouse information roll-up tools are an excellent (if imperfect) substitute for hardcore semantic extraction. […]
[…] The closest analogy to what Clarabridge does is Attensity’s new(ish) strategy – extract “facts” from documents and dump them into a relational database management system. In particular, Clarabridge and Attensity alike make the case “Our categorization is more flexible because it’s applied only after the extraction happens.” […]
[…] Attensity FRN (fact-relationship network) […]
[…] with both Clarabridge and Attensity this week. Since they’re the two big proponents of exhaustive extraction, I naturally asked whether there are any cases exhaustive extraction should not be used. In […]
[…] Attensity, which uses a simple normalized relational schema, Clarabridge dumps the extracted data into a star schema. (The Clarabridge folks are from […]
[…] Data mining can involve very high-dimensional problems with super-sparse tables. And while exhaustive text extraction into flat tables works OK, getting from there to common-sense semantic hierarchies can be a bit of a kludge. […]
[…] и используют сверхразреженные таблицы. И хотя полное извлечение фактов из текста в плоские таблицы работает хорошо, переход от него к практичным, […]
[…] […]