Twist our arm, please!
Slashdot has a long, exclusive article on proposed US legislation to fight foreign internet censorship. The gist is that companies such as Yahoo and Google seem to be saying “Please, pass a law OBLIGATING us to resist censorship and other bad behavior.”
I think this is both admirable-if-true and, better yet, probably true. Clearly, US web search companies are vulnerable in theory to competition from less scrupulous competitors in other nations. But for now our search technology lead is strong enough that their main competition is with each other. If China (for example) can’t play one of them off against the other, there’s at least it chance it will be reluctant to throw the whole lot of them out.
Categories: Categorization and filtering, Censorship, Google, Search engines | Leave a Comment |
Government-specific search fails to impress
According to Steven Arnold, FirstGov – which has been renamed USASearch.gov — is by far the most effective US government-specific search engine. But there’s something odd about it; whatever the query, it’s determined to give no more than a little over 100 results. Queries for which I’ve noted results in this quantity range include Bush (and this covers all family members), Cheney (ditto), Kennedy (ditto), Condaleeza, Scalia, Coolidge, Red Sox, big dig, Burlingame, Redmond, Pluto, ethanol, spotted owl, and topology. The only ones I’ve found so far coming out above that results range – perhaps inevitably 😉 — are death (137) and taxes (177). Read more
Categories: Convera, Search engines, Specialized search | Leave a Comment |
The Chinese censorship threat continues to ratchet up
Ted Samsen of Infoworld is worried that the Chinese are attempting to ratchet up internet censorship yet further. Welcome to the club, buddy. This problem is a big one, and I don’t think it’s going to be addressed without vigorous action. I particular, I suspect that what is needed may be some major efforts in white-hat spamming. Lance Cottrell of Anonymizer has clever ideas along those lines for fighting censorship in the short term, but I think a bigger effort is needed as well.
Google, by the way, is caught in a tough spot and knows it.
Categories: Categorization and filtering, Censorship, Google, Search engines, Spam and antispam, Website filtering | Leave a Comment |
FAST said to be pursuing BI
Dave Kellogg thinks FAST will be ineffective and defocused because of its efforts in business intelligence. I can’t comment on whether that analysis is brilliant, self-serving, or both, because anything I’ve been told on the subject is under embargo.
Embargos were a crucial PR tactic when Regis McKenna exploited them for the original rollout of the Macintosh in 1984. But I suspect that in many cases they’ve quite outlived their usefulness. If I wait between the time I’m briefed and the time the embargo is up to write something, my thoughts about it get fuzzy. If I write something at the time and put it on ice, it may be obsolete because of what other people write in the mean time.
More and more, if something is embargoed, I wind up not writing about it at all.
EDIT: Point #4 of my post on the mismatch between relational databases and text search is pretty relevant here.
Categories: BI integration, Enterprise search, FAST | 1 Comment |
But Google trumps most site search
Popular on Digg, for obvious reasons, is a post showing that Google is better for searching Digg than Digg’s own search engine. No shock there. If I want to search Wikipedia for information on astrowidgets, I’ll just google on the phrase wikipedia astrowidgets. That works much better than Wikipedia’s own search.
Speaking of which — if you want to search for my writing, I’m using Google web search technology too. It works like a charm.
Categories: Google, Search engines, Specialized search | 2 Comments |
41 differences between web and enterprise search
Based on a patent application, SEOmoz has discerned 65 aspects of the Google ranking algorithm.* I counted only 24 that really had much at all to do with enterprise search. This leaves 41 or so focused on spam/SEO-fighting and/or on-page linking issues that have no enterprise parallel. And for more depth, here’s a long article from another SEO site, on a specific phrase-concurrence spam-fighting technique that has no apparent applicability to trusted corpuses.
*I highly recommend this link. It is by far the best single-page overview of web search algorithmic issues I’ve ever seen.
I’ve said it before, but it bears repeating — web search and enterprise search (or search of a constrained corpus) are very different technical problems.
Categories: Enterprise search, Google, Search engines | 5 Comments |
Text analytics is finally being used for investment analysis
Jay Henderson of ClearForest tells me that hedge funds are one of their more interesting growth areas. It’s about time.
I think a lot of the reason for investment firms not making more use of text analytics has been structural — Factiva, the (relatively speaking) mammoth joint venture of Reuters and Dow Jones, is forbidden by its parent companies from meeting investment firms’ needs. And that’s kind of a pity, as it’s probably the best-positioned firm to do so. It’s good to hear that the little guys are finally filling the gap.
Telling Attensity and ClearForest apart
So far as I can tell, Attensity’s strategy when the company was originally founded was rather like ClearForest’s strategy today – and vice-versa. That said, here’s where they seem to stand at this time:
- Attensity wants to make text analytics very easy to integrate into business intelligence and data mining – at the moment, they’re not too focused on the differences between those two disciplines – and is trying to deliver the best possible fact extraction consistent with that charter.
- ClearForest wants to provide really great information extraction — to the limits of what can be done without excessive knowledge engineering – and is trying to integrate as well as possible with other technologies, the better to serve the customers who need what they offer.
Categories: Application areas, Attensity, ClearForest/Reuters, Custom publishing, Mark Logic, TEMIS, Text mining | Leave a Comment |
Text mining and search, joined at the hip
Most people in the text analytics market realize that text mining and search are somewhat related. But I don’t think they often stop to contemplate just how close the relationship is, could be, or someday probably will become. Here’s part of what I mean:
- Text mining powers search. The biggest text mining outfits in the world, possibly excepting the US intelligence community, are surely Google, Yahoo, and perhaps Microsoft.
- Search powers text mining. Restricting the corpus of documents to mine, even via a keyword search, makes tons of sense. That’s one of the good ideas in Attensity 4.
- Text mining and search are powered by the same underlying technologies. For starters, there’s all the tokenization, extraction, etc. that vendors in both areas license from Inxight and its competitors. Beyond that, I think there’s a future play in integrated taxonomy management that will rearrange the text analytics market landscape.
Categories: Attensity, Business Objects and Inxight, Enterprise search, FAST, Google, IBM and UIMA, Ontologies, Open source text analytics, Search engines, Text mining | 3 Comments |
Enterprise-specific web search: High-end web search/mining appliances?
OK. I have a vision of one way search could evolve, which I think deserves consideration on at least a “concept-car” basis. This is all speculative; I haven’t discussed it at length with the vendors who’d need to make it happen, nor checked the technical assumptions carefully myself. So I could well be wrong. Indeed, I’ve at least half-changed my mind multiple times this weekend, just in the drafting of this post. Oh yeah, I’m also mixing several subjects together here too. All-in-all, this is not my crispest post …
Anyhow, the core idea is that large enterprises spider and index a subset of the Web, and use that for most of their employees’ web search needs. Key benefits would include:
- Filtering out spam hits. This is obviously important for search, and in some cases could help with public-web text mining as well. It should be OK to be more aggressive on spam-site filtering in an enterprise-specific index than it is in general web search.
- Filtering out malicious/undesirable downloads of various sorts. I’m thinking mainly of malware/spyware here, but of course it can also be used for netnannying porn-prevention and the like as well. Again, this is more easily done for the enterprise market than for the search world at large. (I anyway think that Google could blow Websense out of the water any time they wanted to – except, of course, for the not-so-small matter of not being seen as participating in the censorship business — but that’s a separate discussion.)
- Capturing employees’ search strings. This could be useful for various purposes, including discerning their interests, and building the corporate ontology for internal web search.
- Freshness control. If there’s a site you really care about, you can make sure it’s re-indexed frequently.