August 2, 2006

Petabyte-scale search scalability

I’ve had a couple of good talks with Andrew McKay of FAST recently. When discussing FAST’s scalability, he likes to use the word “petabytes.” I haven’t probed yet as to exactly which corpus(es) he’s referring to, but here’s a thought for comparison:

Google, if I recall correctly, caches a little over 100Kb/page (assuming, of course, that the page has at least that much text, which is not necessarily the case at all). And they were up into Carl Sagan range – i.e., “billions and billions” – before they stopped giving counts of how many pages they’d indexed.

10 billion times 100 Kb is, indeed, a petabyte. So, in the roughest of approximations, the Web is a petabyte-range corpus.

EDIT: Hah. I bet eBay and its 2-petabyte database is one of the examples Andrew is referring to …

Comments

4 Responses to “Petabyte-scale search scalability”

  1. Text Technologies»Blog Archive » Introduction to FAST on August 2nd, 2006 10:16 pm

    […] FAST, aka Fast Search & Transfer (www.fastsearch.com) is a pretty interesting and important company. They have 3500 enterprise customers, a rapidly growing $100 million revenue run rate, and a quarter billion dollars in the bank. Their core business is of course enterprise search, where they boast great scalability, based on a Google-like grid architecture, which they fondly think is actually more efficient than Google’s. Beyond that, they’ve verticalized search, exploiting the modularity of their product line to better serve a variety of niche markets. And they’re active in elementary fact/entity extraction as well. Oh yes – they also have forms of guided navigation, taxonomy-awareness, and probably everything else one might think of as a checkmark item for a search or search-like product. […]

  2. DBMS2 — DataBase Management System Services»Blog Archive » Really big databases on February 23rd, 2007 1:04 am

    […] Business Intelligence Lowdown has a well-dugg post listing what it claims are the 10 largest databases in the world. The accuracy leaves much to be desired, as is illustrated by the fact that #10 on the list is only 20 terabytes, while entirely unmentioned is eBay’s 2-petabyte database (mentioned here, and also here). Only one phone company was listed, and no credit-raters. And for some databases listed, the size given seemed too low. E.g., for Google I’d guess that the average page size it indexes is 10K+ (vs. teh 100ish max), even with all the junk stuff in there, so it’s in the 100s of terabytes at a minimum, for raw data alone before considering indexes and so on. […]

  3. xlrdbms on January 5th, 2008 1:27 am

    With Web 2.0 companies capturing 10-50 TB of data a day (growing exponentially), Petabytes aren’t all the much after all. Its all about value density coupled with mixed workload. At a recent XLDB workshop at Stanford is has become very clear that the designs currently in the works are for 10-100PB in a SINGLE rdbms to be deployed over the next 2-3 years. The 1-2PB systems are there right now, growing by 1-2 orders of magnitude will require new concepts and potentially the marriage of multiple technologies. Its not going to be Vendor A vs Vendor B, but most likely Vendor X AND Vendor Y. Which brings up the point of consolidation in this industry. Now that the BI folks are consolidating, it will only be natural to see rdbms vendors to do the same.
    There is a need for 100PB+ solutions out there, the current mindset of vendors is either too immature or too greedy to make that a reality.

  4. DBMS2 — DataBase Management System Services » Blog Archive » Teradata apparently has crossed the petabyte barrier on April 25th, 2008 12:08 am

    […] of user data in a single instance. He wouldn’t disclose any names, but I’d guess one is eBay, who he did confim is a customer. The intelligence area is another one where I’d speculate […]

Leave a Reply




Feed including blog about text analytics, text mining, and text search Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.