Categorization and filtering
Analysis of technologies that focus on the categorization and filtering of documents and text. Related subjects include:
- Any subcategory
- Text mining
- Search engines
Is DMOZ the cure to Wikipedia’s spam problem?
Joost de Valk makes an interesting suggestion, namely that Wikipedia should drop all external links other than to DMOZ, and rely on DMOZ as the outside link directory. As division of labor, it makes perfect sense. However, it’s a total non-starter until at least two problems are solved. Read more
Categories: Categorization and filtering, Directories, ODP and DMOZ, Ontologies, Spam and antispam | 5 Comments |
Fact and Fiction: DMOZ and the ODP
- DMOZ is dead. Fiction!
- New site submissions are being processed. Partial fact.
- Pending site submissions were lost in the outage. Partial fact.
- Other non-public ODP data was lost in the outage too. Partial fact.
- New editor applications aren’t being processed yet. Fact.
- ODP editors are corrupt. Fiction!
- The ODP is secretive and deceptive. Largely fiction.
- If a DMOZ category doesn’t have a listed editor, it’s unlikely to get much attention. Part fact, part fiction.
- ODP editors hate search engine optimization. Partial fact.
- ODP editors hate SEOs. Partial fact.
I shall explain. Read more
Categories: Categorization and filtering, Directories, ODP and DMOZ, Search engine optimization (SEO) | 7 Comments |
A hobbit writes from the ODP Entmoot
Before saying anything about the Open Directory Project or the DMOZ directory it produces, I should offer several disclaimers.
- No editor speaks for the ODP, let alone for Time Warner/AOL/Netscape.
- No single editor’s opinions or choices control any edits in DMOZ, even if s/he is the sole listed editor of a category. Any of us can be overruled on any editing decision at any time.
- I’m effectively as new as they come, or at least was at the time DMOZ editing came back online (late December). There have been no new editors since the well-publicized outage, and I had next to no involvement with the project prior to the outage.
- Notwithstanding point #2, I’m quite opinionated, which I’m sure surprises approximately nobody. And my opinions quite often are different from those of the ODP mainstream.
Categories: Categorization and filtering, ODP and DMOZ | 1 Comment |
Please switch to my back-up e-mail address
At least for the moment.
monash.com e-mail has been turned off by my hosting company, due to what they claim is a still on-going attack. My backup address, however — FirstnameLastname@domain.com, where domain = dbms2 — is working fine. And my e-mail client traditionally checks them at the same time. So I suggest switching, at least for the moment.
Both are through the same hosting company (Hostgator, which I aspire to replace in the immediate future, given that I also lost admin access to the blogs on two separate occasions this week, and given that support claims over half my e-mails are unreadably empty and hence suitable for being ignored, despite me never having that problem elsewhere). Thus, for other kinds of problems there might be a single point of failure. But in this case, the dbms2 address is a working alternative to the standard one.
Categories: About this blog, Spam and antispam | Leave a Comment |
Twist our arm, please!
Slashdot has a long, exclusive article on proposed US legislation to fight foreign internet censorship. The gist is that companies such as Yahoo and Google seem to be saying “Please, pass a law OBLIGATING us to resist censorship and other bad behavior.”
I think this is both admirable-if-true and, better yet, probably true. Clearly, US web search companies are vulnerable in theory to competition from less scrupulous competitors in other nations. But for now our search technology lead is strong enough that their main competition is with each other. If China (for example) can’t play one of them off against the other, there’s at least it chance it will be reluctant to throw the whole lot of them out.
Categories: Categorization and filtering, Censorship, Google, Search engines | Leave a Comment |
A great new (to me) phrase – “Adversarial Information Retrieval”
I’ve just discovered a great new phrase – adversarial information retrieval. It’s not really new, since papers are now being accepted for what will be the third annual conference on the subject. But it seems to have gained currency over the past few months.
Edit: The term seems to have been coined in 2000.
I think this area is really where the bulk of the research into public search engine algorithms goes. And that’s another way of saying that web and enterprise search are very different things.
Categories: Categorization and filtering, Search engine optimization (SEO), Search engines, Spam and antispam | 2 Comments |
The Chinese censorship threat continues to ratchet up
Ted Samsen of Infoworld is worried that the Chinese are attempting to ratchet up internet censorship yet further. Welcome to the club, buddy. This problem is a big one, and I don’t think it’s going to be addressed without vigorous action. I particular, I suspect that what is needed may be some major efforts in white-hat spamming. Lance Cottrell of Anonymizer has clever ideas along those lines for fighting censorship in the short term, but I think a bigger effort is needed as well.
Google, by the way, is caught in a tough spot and knows it.
Categories: Categorization and filtering, Censorship, Google, Search engines, Spam and antispam, Website filtering | Leave a Comment |
Text mining and search, joined at the hip
Most people in the text analytics market realize that text mining and search are somewhat related. But I don’t think they often stop to contemplate just how close the relationship is, could be, or someday probably will become. Here’s part of what I mean:
- Text mining powers search. The biggest text mining outfits in the world, possibly excepting the US intelligence community, are surely Google, Yahoo, and perhaps Microsoft.
- Search powers text mining. Restricting the corpus of documents to mine, even via a keyword search, makes tons of sense. That’s one of the good ideas in Attensity 4.
- Text mining and search are powered by the same underlying technologies. For starters, there’s all the tokenization, extraction, etc. that vendors in both areas license from Inxight and its competitors. Beyond that, I think there’s a future play in integrated taxonomy management that will rearrange the text analytics market landscape.
Categories: Attensity, Business Objects and Inxight, Enterprise search, FAST, Google, IBM and UIMA, Ontologies, Open source text analytics, Search engines, Text mining | 3 Comments |
Enterprise-specific web search: High-end web search/mining appliances?
OK. I have a vision of one way search could evolve, which I think deserves consideration on at least a “concept-car” basis. This is all speculative; I haven’t discussed it at length with the vendors who’d need to make it happen, nor checked the technical assumptions carefully myself. So I could well be wrong. Indeed, I’ve at least half-changed my mind multiple times this weekend, just in the drafting of this post. Oh yeah, I’m also mixing several subjects together here too. All-in-all, this is not my crispest post …
Anyhow, the core idea is that large enterprises spider and index a subset of the Web, and use that for most of their employees’ web search needs. Key benefits would include:
- Filtering out spam hits. This is obviously important for search, and in some cases could help with public-web text mining as well. It should be OK to be more aggressive on spam-site filtering in an enterprise-specific index than it is in general web search.
- Filtering out malicious/undesirable downloads of various sorts. I’m thinking mainly of malware/spyware here, but of course it can also be used for netnannying porn-prevention and the like as well. Again, this is more easily done for the enterprise market than for the search world at large. (I anyway think that Google could blow Websense out of the water any time they wanted to – except, of course, for the not-so-small matter of not being seen as participating in the censorship business — but that’s a separate discussion.)
- Capturing employees’ search strings. This could be useful for various purposes, including discerning their interests, and building the corporate ontology for internal web search.
- Freshness control. If there’s a site you really care about, you can make sure it’s re-indexed frequently.
Categories: Categorization and filtering, Convera, Enterprise search, FAST, Google, IBM and UIMA, Search engines, Spam and antispam, Specialized search, Text mining, Website filtering | 1 Comment |
Principles of enterprise text technology architecture
My August Computerworld column starts where July’s left off, and suggests principles for enterprise text technology architecture. This will not run Monday, August 7, as I was originally led to believe, but rather in my usual second-Monday slot, namely August 14. Thus, I finished it a week earlier than necessary, and I apologize to those of you I inconvenienced with the unnecessary rush to meet that deadline.
The principles I came up with are:
- Deploy search widely across the enterprise.
- It’s OK for your text data to be distributed across a range of silos.
- Integrate fact extraction/text mining aggressively into your predictive analytics and dashboards.
- Having a preferred enterprise text technology tool suite is nice, but accept that there will probably be lots of departmental exceptions.
- Reinvent your customer communication (and other) processes to exploit text technologies.
- Integrate your taxonomies.
I’ll provide a link when the column is actually posted.
Categories: Enterprise search, Ontologies, Search engines, Text mining | 1 Comment |