MuseGlobal – ETL for text, sort of
Lynda Moulton introduced me to MuseGlobal, and specifically CEO Kate Noerr, last month. MuseGlobal sort of does ETL (Extract/Transform/Load) for text, although they prefer to call it Gather/Transform/Deliver. In any case, each of the three parts of the process are rather different for text than they are for traditional data warehousing. To wit:
-
Gathering happens from a variety of repositories, which often are individually unstructured. Different repositories may use different file formats, manage different kinds of metadata, and require different kinds of authentication to get at. A significant part of MuseGlobal’s value-add seems to be knowing how to get at data from hundreds or thousands of third-party sources, both from a technology and licensing standpoint.
-
Transformation amounts to extracting document attributes and subject matter, and slathering on a whole lot of XML tags accordingly. More precisely, documents are mapped into a grand cosmic XML schema with 2200 or so nodes, and from there the desired output (usually XML) is produced. MuseGlobal has built its own extraction technology, but will work with Inxight or Temis for finer-level extraction as needed. They do some sentiment analysis, but that doesn’t seem to be a strong point. Extracting dates seems to be a strength of theirs, as is recognizing duplicate documents, both of which would be important in news-oriented applications.
-
Delivery is, in database terms, record-at-a-time, not set-at-a-time. (Here I’m basically equating “document” and “record”.) And there are relatively few records, each of which typically undergoes a whole lot of transformation. So most ETL issues of latency and throughput don’t carry over to MuseGlobal’s use cases. Even the batch/real-time distinction is somewhat moot.
The original market for MuseGlobal’s technology was scientific and professional publishers, but they’ve gotten into more general news and blog document handling as well. They also sell to enterprises. Mark Logic and Endeca are among their partners; in each case, MuseGlobal does the preprocessing. Indeed, MuseGlobal usually sells on an OEM basis, exceptions coming most commonly when they want to establish references in a new market segment.
As to MuseGlobal’s own dates and other facts: The company was founded in 2001, but technology development had been going on for 2-3 years prior. Headcount is around 70, and growing by 20% or so this year. The company is headquartered in San Francisco, with support in Salt Lake City. It has been self-funded all the way.
Please subscribe to our feed!
Technorati Tags: ETL, Endeca, Mark Logic
Comments
Leave a Reply