Petabyte-scale search scalability
I’ve had a couple of good talks with Andrew McKay of FAST recently. When discussing FAST’s scalability, he likes to use the word “petabytes.” I haven’t probed yet as to exactly which corpus(es) he’s referring to, but here’s a thought for comparison:
Google, if I recall correctly, caches a little over 100Kb/page (assuming, of course, that the page has at least that much text, which is not necessarily the case at all). And they were up into Carl Sagan range – i.e., “billions and billions” – before they stopped giving counts of how many pages they’d indexed.
10 billion times 100 Kb is, indeed, a petabyte. So, in the roughest of approximations, the Web is a petabyte-range corpus.
EDIT: Hah. I bet eBay and its 2-petabyte database is one of the examples Andrew is referring to …
Comments
4 Responses to “Petabyte-scale search scalability”
Leave a Reply
[…] FAST, aka Fast Search & Transfer (www.fastsearch.com) is a pretty interesting and important company. They have 3500 enterprise customers, a rapidly growing $100 million revenue run rate, and a quarter billion dollars in the bank. Their core business is of course enterprise search, where they boast great scalability, based on a Google-like grid architecture, which they fondly think is actually more efficient than Google’s. Beyond that, they’ve verticalized search, exploiting the modularity of their product line to better serve a variety of niche markets. And they’re active in elementary fact/entity extraction as well. Oh yes – they also have forms of guided navigation, taxonomy-awareness, and probably everything else one might think of as a checkmark item for a search or search-like product. […]
[…] Business Intelligence Lowdown has a well-dugg post listing what it claims are the 10 largest databases in the world. The accuracy leaves much to be desired, as is illustrated by the fact that #10 on the list is only 20 terabytes, while entirely unmentioned is eBay’s 2-petabyte database (mentioned here, and also here). Only one phone company was listed, and no credit-raters. And for some databases listed, the size given seemed too low. E.g., for Google I’d guess that the average page size it indexes is 10K+ (vs. teh 100ish max), even with all the junk stuff in there, so it’s in the 100s of terabytes at a minimum, for raw data alone before considering indexes and so on. […]
With Web 2.0 companies capturing 10-50 TB of data a day (growing exponentially), Petabytes aren’t all the much after all. Its all about value density coupled with mixed workload. At a recent XLDB workshop at Stanford is has become very clear that the designs currently in the works are for 10-100PB in a SINGLE rdbms to be deployed over the next 2-3 years. The 1-2PB systems are there right now, growing by 1-2 orders of magnitude will require new concepts and potentially the marriage of multiple technologies. Its not going to be Vendor A vs Vendor B, but most likely Vendor X AND Vendor Y. Which brings up the point of consolidation in this industry. Now that the BI folks are consolidating, it will only be natural to see rdbms vendors to do the same.
There is a need for 100PB+ solutions out there, the current mindset of vendors is either too immature or too greedy to make that a reality.
[…] of user data in a single instance. He wouldn’t disclose any names, but I’d guess one is eBay, who he did confim is a customer. The intelligence area is another one where I’d speculate […]