Categorization and filtering

Analysis of technologies that focus on the categorization and filtering of documents and text. Related subjects include:

Any subcategory
Text mining
Search engines

October 6, 2007

The Clarabridge approach to text mining

And for my sixth text mining post this weekend, here are some highlights of the Clarabridge technology story. (Sorry if it sounds clipped, but I’m a bit burned out …)

Like Attensity, Clarabridge practices exhaustive extraction.* That is, they do linguistics against documents, extract all sorts of entities and relationships among the entities from each document, and dump the results into a relational database.
Unlike Attensity, which uses a simple normalized relational schema, Clarabridge dumps the extracted data into a star schema. (The Clarabridge folks are from Microstrategy, which – surely not coincidentally – also favors star schemas.) Read more

Categories: BI integration, Clarabridge, Comprehensive or exhaustive extraction, Ontologies, Text mining

2 Comments

September 30, 2007

A tip for submitting to DMOZ — make your site description clear

I just picked out a few of the many unreviewed sites in my DMOZ categories to evaluate, and listed most of those I reviewed.

How did I choose them to get screened? Mainly, I picked out ones with focused descriptions, titles, and so on, that just seemed likely to be listable based on that info (which is the essence of what I see on the page where all the various submitted sites are linked). I correctly guessed that I’d be able to quickly understand what I was seeing and judge whether to list the site or not, quickly write the official site description, and so on. Read more

Categories: Categorization and filtering, Directories, ODP and DMOZ, Search engine optimization (SEO)

2 Comments

August 31, 2007

A challenge to DMOZ bashers

Give or take a corrected typo, here’s a challenge to DMOZ bashers I just wrote in the flame war thread.

If you want to do something that is:

A. Correct
B. Credible
C. Potentially useful

just go find a specific category with terrible listings, and publicize the fact with overwhelmingly clear proof of your assessment.

If that’s not EASY for you to do … then maybe DMOZ isn’t so bad after all, eh?

In particular, I’d encourage you to post a version of the category that is clearly better than what is currently there.

Technorati Tags: DMOZ, ODP

Categories: Categorization and filtering, Directories, ODP and DMOZ, Social software and online media

1 Comment

August 31, 2007

DMOZ — yet another flame war

My latest thoughts about DMOZ and the ODP may be found in this blog comment thread.

The gist is:

DMOZ has many problems, such as categories that are at least five years out of date.
Newly, corruptly listed sites are NOT high on the list of problems.
In fact, the attention paid to avoiding such corruption is a terrible drain on ODP resources.
There are a lot of liars and/or idiots bashing DMOZ in the website owner community.
robjones is a sarcastic jerk, but he’s our sarcastic jerk.

Or something like that. As I said, it’s a flame war …

Anyhow, I’m flying off on a two-week snorkeling trip Saturday, and should be much mellower soon.

Categories: Categorization and filtering, Directories, ODP and DMOZ, Social software and online media

8 Comments

July 22, 2007

Text analytics marketplace trends

It was tough to judge user demand at the recent Text Analytics Summit because, well, very few users showed up. And frankly, I wasn’t as aggressive at pumping vendors for trends as I am some other times. That said, I have talked with most text analytics vendors recently,* and here are my impressions of what’s going on. Any contrary – or confirming! — opinions would be most welcome.

*Factiva is the most significant exception. Hint, hint.

If you think about it, text analytics is a “secret ingredient” in search, antispam, and data cleaning,* and this dominates all other uses of the technology. A significant minority of the research effort at companies that do any kind of text filtering is – duh — text analytics. Cold comfort for specialist text analytics vendors, to be sure, but that’s the way it is.

*I.e., part of the “T” in “ETL” (Extract/Transform/Load).

Text-analytics-enhanced custom publishing will surely at some point become a must-have for business and technical publishers. However, it appears that we’re not quite there yet, as large publishers make do with simple-minded search and the like. In what I suspect is a telling market commentary, there’s no headlong rush among vendors to dump text mining for custom publishing, notwithstanding the examples of nStein and (sort of) ClearForest. I don’t want to be overly negative – either my friends at Mark Logic are doing just fine or else they’re putting up a mighty brave front – but I don’t think the nonspecialist publishing market is there yet. Read more

Categories: Application areas, ClearForest/Reuters, Custom publishing, Factiva/Dow Jones, Mark Logic, nStein, SAS, Search engines, Spam and antispam, Text Analytics Summit, Text mining, Voice of the Customer

2 Comments

June 6, 2007

I’ve decided to trust Akismet/Bad Behavior

Akismet recently upgraded so that you can see all the spam it’s holding, not just the last 150 messages. This made me a lot happier — but ironically I quickly gave up, and decided to trust Akismet without checking. Why? Well, Akismet sequesters 15 days of spam, and I currently have the following numbers of messages stashed away in it:

2246 here on Text Technologies.
4427 on DBMS2.
816 on Software Memories.
5156 on the Monash Report.

That’s over 800 spam per day across four blogs. And when I did check, I almost never found a false positive, except occasionally a trackback of my own.

More problematic is my e-mail. Eudora flags pretty much everything that isn’t from an established sender as spam. So along with my 300+ true spam, I get a number of false positives per day, some of which have turned into paying customer relationships. So THAT spam directory I do check carefully …

Categories: Blogosphere, Spam and antispam

Wise Crowds of Long-Tailed Ants, or something like that

Baynote sells a recommendation engine whose motto appears to be “popularity implies accuracy.” While that leads to some interesting technological ideas (below), Baynote carries that principle to an unfortunate extreme in its marketing, which is jam-packed with inaccurate buzzspeak. While most of that is focused on a few trendy meme-oriented books, the low point of my briefing today was the probably the insistence against pushback that “95%” of Google’s results depend on “PageRank.” (I think what Baynote really meant is “all off-page factors combined,” but anyhow I sure didn’t get the sense that accuracy was an important metric for them in setting their briefing strategy. And by the way, one reason I repeat the company’s name rather than referring to Baynote by a pronoun is that on-page factors DO matter in search engine rankings.)

That said, here’s the essence of Baynote’s story, as best I could figure it out. Read more

Categories: Baynote, Google, Ontologies, Search engine optimization (SEO), Search engines, Social software and online media, Software as a Service (SaaS), Specialized search

4 Comments

April 17, 2007

For search, extreme network neutrality must not be compromised

In a recent post on the Monash Report, I drew a distinction between two aspects of the Internet:Jeffersonet and Edisonet.Jeffersonet deals in thoughts and ideas and research and scholarship and news and politics, and in commerce too.It’s what makes people so passionate about the Internet’s democracy-enhancing nature.It’s what needs to be protected by extreme network neutrality.And it’s modest enough in its bandwidth requirements that net neutrality is completely workable.(Edisonet, by way of contrast, comprises advanced applications in entertainment, teleconferencing, etc. that probably do require new capital investment and tiered pricing schemes.)

And if there’s one application that’s at the core of Jeffersonet, it’s search.No matter how much scary posturing telecom CEOs do – and no matter how profitable or monopolistic Google becomes – telecom carriers must never be allowed to show any preference among search engines!At least, that’s the case for text-centric search engines such as Google, Yahoo, and Microsoft run today.The reason is simple:The democratic part of the Internet only works so long as things can be found.And search will long be a huge part of how to find them.So search engine vendors must never be able to succeed based on a combination of good-enough results plus superior marketing and business development.They always have to be kept afraid of competition from engines that provide better actual search engine results. Read more

Categories: Censorship, Google, Search engines, Social software and online media

So THAT’S why Andrew Orlowski still has a job (Part 2)

Andrew Orlowski is an over-the-top jerk, and a pretty sloppy reporter and analyst to boot. But he occasionally makes a good point even so. In the most recent instance, he confronted Tim Berners-Lee. As the article makes clear, Berners-Lee reacted badly to Orlowski, reflecting an attitude that is probably shared by 99% of the people who encounter the guy, and in the future will probably be adopted by sentient computers as well. Even so, Orlowski’s underlying point is valid: If the Semantic Web is going to be any more spam-free than the current Web, nobody has adequately explained why.

Categories: Ontologies, Spam and antispam

2 Comments

February 15, 2007

InQuira’s and Mercado’s approaches to structured search

InQuira and Mercado both have broadened their marketing pitches beyond their traditional specialties of structured search for e-commerce. Even so, it’s well worth talking about those search technologies, which offer features and precision that you just don’t get from generic search engines. There’s a lot going on in these rather cool products.

In broad outline, Mercado and InQuira each combine three basic search approaches:

Generic text indexing.
Augmentation via an ontology.
A rules engine that helps the site owner determine which results and responses are shown under various circumstances.

Of the two, InQuira seems to have the more sophisticated ontology. Indeed, the not-wholly-absurd claim is that InQuira does natural-language processing (NLP). Both vendors incorporate user information in deciding which search results to show, in ways that may be harbingers of what generic search engines like Google and Yahoo will do down the road. Read more

Categories: InQuira, Mercado, Natural language processing (NLP), Ontologies, Search engines, Structured search

2 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Categorization and filtering

The Clarabridge approach to text mining

A tip for submitting to DMOZ — make your site description clear

A challenge to DMOZ bashers

DMOZ — yet another flame war

Text analytics marketplace trends

I’ve decided to trust Akismet/Bad Behavior

Wise Crowds of Long-Tailed Ants, or something like that

For search, extreme network neutrality must not be compromised

So THAT’S why Andrew Orlowski still has a job (Part 2)

InQuira’s and Mercado’s approaches to structured search

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin