Categorization and filtering
Analysis of technologies that focus on the categorization and filtering of documents and text. Related subjects include:
- Any subcategory
- Text mining
- Search engines
Drive-by Google de-listing
As previously noted, we got hit with some hidden text, probably by SQL injection, and that lead to a Google de-listing. Of the three blogs affected by the attack, I got a de-indexing notice for one (DBMS2); another was de-listed without a notice (Text Technologies); and a third seems to have waltzed through still indexed (Software Memories). I also received a de-indexing notice for another site I have nothing to do with and indeed had never heard of before. Go figure …
We’ve now upgraded to WordPress 2.5, which should close the vulnerability. (Thank you Melissa Bradshaw!) Fearing our old, buggy theme would degrade further, we upgraded to a new one, Biru, designed by Bob. There are some teething-pain stability issues, but if they don’t cause a reversion in the next day, I’ll apply to Google for re-inclusion. (Uh, does anybody have some boundaries around how long that’s likely to take?)
All these hours of aggravation because some criminal wanted a bit of SEO advantage …
Categories: Google, Search engine optimization (SEO), Spam and antispam | 1 Comment |
Over 80 percent of blog posts are probably spam
Doug Caverly highlights a Matt Mullenweg quote indicating that about 1/4 of all the blogs ever on WordPress.com were spam (aka splogs). Now, that’s probably a higher fraction than for the blogoverse overall, because:
- WordPress.com provides costless hosting; using your own domain costs money.
- Besides being free, WordPress.com hosting may provide a little “google juice”, which is the whole SEO point of spam blogging.
But there’s one more factor. Splogs have much higher posting frequency than real ones. 10-20+ posts per day is not uncommon, and 50-100+ is not unheard of. That’s 5-10X the post frequency of even the more active human-written blogs. So let’s assume:
- 10% of all blogs are spam.
- 10% of all blogs are actively written by humans.
- 80% of all blogs belong to humans, but are updated very infrequently if at all.
In that case, over 80% (and indeed probably over 90%) of all blog posts are made by machines rather than by human beings.
Categories: Blogosphere, Search engine optimization (SEO), Social software and online media, Spam and antispam | Leave a Comment |
19 Microsoft/Yahoo synergies that could revolutionize the Internet
Many – perhaps most — commentators on Microsoft’s bid for Yahoo are thoroughly missing the point. The most interesting part of Microsoft’s bid for Yahoo isn’t the horse-race retrospective “How did they screw up so much as to need each other?” It’s not the incipient bidding war for Yahoo. And it’s certainly not the antitrust implications.
The Microsoft/Yahoo combination could revolutionize the Internet. I’m serious. The opportunities for huge synergies might just be enough to blast the merged companies out of their current uncreative, Innovator’s Dilemma funks. Search is open for radical transformation in user interface, universal search relevancy, Web/enterprise integration, and just about everything to do with advertising and monetization. Email stands to be utterly reinvented. Portals and business intelligence have only scratched the surface of their potential. And social networking is of course in its infancy.
Here’s an overview of where some synergies and opportunities for a combined Microsoft/Yahoo lie. Read more
Categories: Enterprise search, Google, Microsoft, Search engines, Social software and online media, Spam and antispam, Website filtering, Yahoo | 15 Comments |
The biggest text analytics company you probably never heard of
I caught up with Expert System S.p.A. last week. They turn out to be doing $10 million in text technology annual revenue. That alone is surprising (sadly), but what’s really remarkable is that they did it almost entirely in the Italian market. As you might guess, that figure includes a little bit of everything, from search engines to Italian language filters for Microsoft Office to text mining. But only $3 ½ million of Expert System’s revenue is from the government (and I think that includes civilian agencies), and under 30% is professional services, so on the whole it seems like a pretty real accomplishment. Oh yes – Expert Systems says it’s entirely self-funded.
As of last year, Expert System also has English-language products, and a couple of minor OEM sales in the US (for mobile search and semantic web applications). German- and Arabic-language products are in beta test. The company says that its market focus going forward is national security – surely the reason for the Arabic – and competitive intelligence. It envisions selling through partners such as system integrators, although I think that makes more sense for the government market than it does vis-a-vis civilian companies. In February the company is introducing a market intelligence product focused on sentiment analysis.
Expert System is a bit of a throwback, in that it talks lovingly of the semantic network that informs its products. Read more
Categories: Application areas, Competitive intelligence, Enterprise search, Expert System S.p.A., Ontologies, Search engines, Text mining | Leave a Comment |
Anatomy of spam blogs
A post that gives you a clear sense of how gobbledydook is automatically generated (from another knowledgeable black-hat SEO who can’t be bothered to get his permalink structure sensible 😉 )
Automation secrets of black hat SEO
XMCP writes one of the better black hat SEO blogs. In a post last November, he laid out a ton of advice about automating black hat SEO. Personally, I don’t approve of doing black hat SEO. Still, it’s an intellectually interesting subject. What’s more, black hat SEOs create a large fraction of all websites, and certainly of all blog comments, links, and so on. So it’s interesting to track them.
Most interesting to me and probably to most readers here is the part that shows where black hat SEOs get their content: Read more
Categories: Search engine optimization (SEO), Spam and antispam | 2 Comments |
A very fast splogger
The first post ever on Strategic Messaging went up at 2:49 am. Within four hours, I had my first splog trackbacks, all from the same site. The strategicmessaging.com domain itself had just repropagated through DNS hours earlier, and had no incoming links other than Whois and the like.
Pretty impressive spamming. Not that it did him any good, of course, except insofar as he was stealing a bit of my content …
Restoring security and function to my mail and websites
OK. Here’s the story as I now know it.
- monash.com was hit by a massive mail-bomb Christmas Eve. My email and websites went down for a while as a consequence. What’s more, with a flooded mail queue, there were further mail problems through at least 12/28. Some mail bounced, and other mail that appeared to go through was lost forever. If you’ve mailed me since 12/24 and I haven’t answered, please send again.
- The mail-bomb paved the way for an injection of some malware. I started noticing possible trojans on monash.com 12/31. Melissa Bradshaw, my stellar web designer, noticed Javascript that she hadn’t written, both on monash.com and dbms2.com. So far as we could tell, standard anti-malware client protections were sufficient to keep any trojans from being successfully downloaded to clients.
- My very attentive web hosting company, Dimension Servers, is rebuilding its Linux kernel accordingly. Scheduled downtime for all my sites and mail is midnight to 3:00 am Eastern tonight, but that’s obviously just a rough estimate. Company president Jonathan MacAllister telephoned me to tell me this personally, notwithstanding that his wife delivered a baby by emergency C-section today. (Wife and baby are OK!)
- Jonathan also told me that after the attack, he bought a Cisco appliance. Every web hosting company needs to do that, as appliances are much more efficient at dealing with overloading attacks than the servers themselves. Cisco was a brand choice pretty much dictated by his remote data center.
- David Ferris and Richi Jennings have convinced me to move monash.com email to Google’s free mail hosting service. This is what they’re doing with ferris.com mail and all of Richi’s domains as well. NO analysts are more reliable on email than David and Richi. And hosting is surely no exception, as David and I did a research project together some years ago uncovering the Critical Path sham.
- The net effect of that move will be that monash.com and dbms2.com have their email managed quite separately, so if you can’t get me at one, please try the other. Generally, if you don’t know me you should write to monash.com, and I’ll probably write back from dbms2.com.
- I’ll post about all this again after things seem to have worked out, possibly over on the Monash Report.
Happy New Year,
CAM
Categories: Spam and antispam | 2 Comments |
I’m getting mailbombed again
Shortly after my first reference to Shoemoney’s DMOZ issues — who did you think I meant with “shoe in his mouth“? — I got mailbombed big time. Things calmed down after a month or so, although I did change web hosting companies in the fallout.
Starting Christmas Eve — which coincidentally was shortly after a forum mention of various Shoemoney flaps, and of the first attack — I got hit again. And there was another wave right after Christmas. A fair amount of email was lost forever, possibly both professional and personal. My blogs also were down for a while, as were other sites on the same server. (And if you sent me any email over that time period, please resend it.)
It seems that I should move my email/MX record to a different service than hosts my websites, perhaps one that has invested in technology to efficiently deflect DDOS attacks. (Or perhaps I should move one domain with it, if a traditional hosting deal seems best.) Does anybody have any recommendations of such services? Read more
An Occurrence at Owl Creek Bridge and other SEO spam explained
I average upwards of 100 spam comments per day per blog, very little of which actually gets through (although that very little is obviously enough to be quite annoying!). Recent research from Sunbelt explains part of what’s going on. (More here in Computerworld.) What’s going on is this:
1. Aggressive black-hat SEO is being done for all kind of long-tail terms and phrases, by posting comment spam filled with little except links on those phrases. For example, one of the first spams I checked for this post consists simply of 10 links to the same .cn, with anchor text, with anchor text and subdomain name being the same keyphrase. Keyphrases included “an occurrence at owl creek bridge”, “allegheny assessment county tax”, and “am been hate i ive who who.” As this kind of spam came by, I’d been wondering why people bothered, since it didn’t seem terribly easy to monetize. Read more
Categories: Search engine optimization (SEO), Spam and antispam | 1 Comment |