January 31, 2008

The biggest text analytics company you probably never heard of

I caught up with Expert System S.p.A. last week. They turn out to be doing $10 million in text technology annual revenue. That alone is surprising (sadly), but what’s really remarkable is that they did it almost entirely in the Italian market. As you might guess, that figure includes a little bit of everything, from search engines to Italian language filters for Microsoft Office to text mining. But only $3 ½ million of Expert System’s revenue is from the government (and I think that includes civilian agencies), and under 30% is professional services, so on the whole it seems like a pretty real accomplishment. Oh yes – Expert Systems says it’s entirely self-funded.

As of last year, Expert System also has English-language products, and a couple of minor OEM sales in the US (for mobile search and semantic web applications). German- and Arabic-language products are in beta test. The company says that its market focus going forward is national security – surely the reason for the Arabic – and competitive intelligence. It envisions selling through partners such as system integrators, although I think that makes more sense for the government market than it does vis-a-vis civilian companies. In February the company is introducing a market intelligence product focused on sentiment analysis.

Expert System is a bit of a throwback, in that it talks lovingly of the semantic network that informs its products. Read more

Categories: Application areas, Competitive intelligence, Enterprise search, Expert System S.p.A., Ontologies, Search engines, Text mining

Leave a Comment

January 28, 2008

Forrester says 2008 is the year of enterprise social software

We all know how “The Year of X” kinds of predictions go. Still, when I read that Forrester Research says enterprises are ready to seriously adopt wikis and message forums, it made sense to me. Email threads — via Notes/Exchange or otherwise — aren’t doing the job any more. It’s time to go straight to communally-created web pages.

Personally, I think it’s also time to further replace email disasters, by having broadcasts over something like an enterprise version of Twitter. Clearly, enterprise Twitter would have to have a lot more tagging, group filtering, and automated censorship — ::sigh:: — than current public Twitter. But that all fits very well into the CEP-based architecture (or some near equivalent) that I believe to be the future of Twitter anyway. So would a complete integration between enterprise Twitter and point-to-point enterprise instant messaging.

Categories: Social software and online media, Twitter

Leave a Comment

January 26, 2008

Anatomy of spam blogs

A post that gives you a clear sense of how gobbledydook is automatically generated (from another knowledgeable black-hat SEO who can’t be bothered to get his permalink structure sensible 😉 )

Categories: Online marketing, Search engine optimization (SEO), Spam and antispam

Leave a Comment

January 18, 2008

Google is putting more emphasis on phrases

I don’t know how pronounced this trend is, but Google web search seems to be putting more emphasis on phrases than it used to.

For starters, Google doesn’t always ignore stopwords. The Fly and Fly produce different search results. Beyond that, “or” is sometimes assumed to be a word you’re searching on, not an operator — for an example, try live free or die and see the line of text that comes back under the search box. (I’m not sure whether this ever works for “and” as well — even Sanford and Son returns the usual harangue that “the AND operator is unnecessary”.) This is all a pretty clear indicator that Google is looking at phrases. Bill Slawski’s patent-analysis-heavy SEO blog has a lot more to say on that subject, specifically on an indexing scheme that addresses the problems that indexing stopwords in might otherwise cause.

Also, there’s a direct series of patents on “Phrase-Based Indexing.”

Finally, although I don’t recall a link, there seems to be a belief that:

Google is using or moving to Latent Semantic Indexing (LSI)
Word-based LSI is patented by somebody else.

Categories: Google, Search engines

3 Comments

January 17, 2008

Lynda Moulton on enterprise search

Lynda Moulton and I see enterprise search quite similarly, as I discovered when she called me yesterday to praise my post on the many differences between enterprise and web search, and followed up with this one of her own. One of Lynda’s big themes is that large enterprises, much as they use multiple database management systems, use multiple search engines too. Read more

Categories: Business Objects and Inxight, Enterprise search, Search engines

4 Comments

January 17, 2008

Dr. Doolittle in silicon

The Reg passes along a Reuters story that Hungarian scientists have built a system to automatically understand canine vocalizations. I’d like to say it’s a woof-to-Magyar translator, but apparently all it does is recognize the doggies’ emotional states. The story reports that the system has 43% accuracy, vs. 40% for humans.

I must confess, however, to being somewhat puzzled about how they measure success. Does the pooch fill out a survey form afterwards? Do they conclude that the beast wasn’t angry if the experimenter doesn’t get bitten?

I need to know a bit more about the research protocol before I know what to think about this.

EDIT: The CBC has a little more detail. The underlying research paper is appearing in Animal Cognition.

Categories: Language recognition, Speech recognition

1 Comment

January 16, 2008

Automation secrets of black hat SEO

XMCP writes one of the better black hat SEO blogs. In a post last November, he laid out a ton of advice about automating black hat SEO. Personally, I don’t approve of doing black hat SEO. Still, it’s an intellectually interesting subject. What’s more, black hat SEOs create a large fraction of all websites, and certainly of all blog comments, links, and so on. So it’s interesting to track them.

Most interesting to me and probably to most readers here is the part that shows where black hat SEOs get their content: Read more

Categories: Search engine optimization (SEO), Spam and antispam

2 Comments

January 14, 2008

An interesting Matt Cutts interview from December

Stephen Spencer has a great interview with Matt Cutts of Google, from last month’s Pubcon. Almost all of it is SEO-related. But it also contains a few tidbits that may be interesting even if one doesn’t care about SEO, such as:

Google now indexes up to 1/2 a megabyte per page, up from the old 101K limit.
Google needs to do a fair amount of image recognition, but they’re going fairly plain-vanilla. For Flash they use an Adobe-supplied SDK. For detecting hidden text (e.g., white-on-white) they use what Matt characterizes as pretty simple heuristics.
As I noted recently, Google seems to have a lot of heuristics for identifying particular types of pages. In this interview, the example was that a page that would otherwise seem spammy because it consisted only of links would be fine if it were serving as a true site map or archive.

SEO highlights included: Read more

Categories: Google, Search engine optimization (SEO), Search engines

Leave a Comment

January 14, 2008