19 bullet points about the difference between enterprise and web search
Eric Lai wrote in this week’s Computerworld about “Why is enterprise search harder than Google Web search?” Highlights included:
- He described enterprise search as consisting mainly of a search box plus faceted searching, with maybe some automated tagging as well.
- He observed that off-page factors such as PageRank don’t work nearly as well in an enterprise as they do on the Web, and that manual tagging by enterprise users falls far short of closing the gap.
- He stumbled a bit compare/constrasting search engines and “structured” DBMS.
- He basically endorsed the worldview of Ali Riaz, late of FAST, now of Attivio.
On the whole, that’s not bad. If this were an easy subject to write about, I’d have explained it a lot more clearly in the past myself. OK. Let me get off my duff and give it a whirl now.
Actually, when writing, I generally stay on my duff. At least, that’s true if I’m guessing correctly what a “duff” is. And this is not just a vaguely humorous digression — it’s also an example of why information retrieval is so hard if you only have the text itself to go by.
With that said, here are some notes on web search, enterprise search, single-site search, and database management.
- Web search has a huge problem that enterprise search doesn’t — adversarial information retrieval. Enterprise document creators generally don’t try to game search results.
- Single-site search sometimes works very well, and sometimes works laughably badly. (I often use regular Google after a site-search engine offers up a long list of bug reports and minor upgrade notes, instead of product overviews.) For an egregious example see Oracle.com, where search seems to have gotten even worse than it was a couple of years ago.
- The difference in almost every case, I think, is whether or not the site owner has done a good job of manual tagging. That would explain why changing a choice of search engine can make a site worse, if you don’t rebuild the tags; I suspect this is what happened in the Oracle case, after some acquisition.
- Full structured search technology can be the difference between “pretty good search” and an e-commerce site that really rocks, but it doesn’t work at all without a lot of manual tagging.
- When you have a revenue-generating e-commerce site, it’s easy to justify the work of manual tagging. When you have a marketing site, it can make sense. But inside an enterprise, the tagging isn’t going to happen much.
- Documents in an enterprise lie in a broad range of disparate corpuses. There are spreadsheets, PowerPoint presentations, structured-field database entries, free-text-field database entries, email, instant messages, and so on. And there also are many different corpuses of traditional text documents. A large-enterprise search engine needs tools to dial up or dial down the relevance of different corpuses, in general and to a specific search.
- Each corpus may also have its own kinds of metadata that are helpful in ranking and summarizing search results.
- As Guy Creese already pointed out in a comment to Eric’s article, security comes into play in enterprise search. I think of that as different users seeing different corpuses.
- Link analysis doesn’t work inside enterprises. Indeed, most of the documents aren’t in HTML and don’t link to each other.
- In web and enterprise search alike, you’re often satisfied by a page that doesn’t help you directly, if it points you at a better resource. But the details of the two cases differ. In enterprises, the better resource is usually a person — i.e., a document author — that you can contact directly. (This is pretty much the main part of knowledge management that actually works.)
- I wrote two years ago about a key missing ingredient to enterprise search technology — an ontology management system. Unfortunately, I’m not aware of significant progress in that direction subsequently.
Finally, I’d like to lay out a few points about the integration of text search and database management.
- Text search became an option in object/relational database management systems a little over a decade ago. The basic paradigm is that a document is stored in a single field, and a full-text index on it is integrated into the RDBMS. SQL syntax was extended to include text operators. Oracle, IBM, and Informix all did this cleanly; Microsoft found a workaround. Although all these vendors were very disappointing in the quality and performance of their search, these engines get a lot of use in application areas where it’s obviously beneficial to integrate search and relational queries.
- Search engines — both standalone and DBMS-integrated — have in recent years gained the capability to query any kind of database field. Why not? They typically can handle 1-200+ other document types as well. Anyhow, business intelligence based on that capability is part of the FAST and now Attivio stories.
- The fundamental technology of full-text indexing is similar to a number of things that go on in the relational/tabular/structured worlds, such as ADABAS’s inverted lists, or anything in the bitmapped and columnar areas. As just one example, SAP’s columnar BI Accelerator is directly built on TREX technology.
- Mark Logic makes a good case that if you’re going to integrate search and DBMS, XML may be the way to go rather than relational.
Comments
16 Responses to “19 bullet points about the difference between enterprise and web search”
Leave a Reply
Curt –
Do you have any thoughts on why enterprise search vendors (to the best of my knowledge) have so far assiduously ignored the existence and importance of program source code?
Source code is where the rules of the business have been automated, not MSWord policy documents.
I agree with you that ontology management would help tremendously, but to date anything remotely practical remains a distant dream.
– David
David,
Are you talking more about:
A. A non-programmer using a search engine and wanting source code as an answer?
B. A programmer not having a better way to find source code than via a search engine?
A strikes me as unlikely. B, if it’s likely, would be disappointing to me. I understand that configuration management systems use weird file types, or at least did a decade ago when I was more fully up to speed on them. But geez; I’d think they would have had that capability even then, let alone now.
CAM
Curt –
I could certainly see a savvy systems analyst (perhaps having come up through the ranks as a programmer so they have code reading/comprehension skills) being capable of searching code.
It is my assumption that since the rules of the business are buried in source code, some one (likely NOT a programmer) is going to need to find all the places across systems where changes need to be made. Example: I’ve heard that NASDAQ is expanding its trading symbol from 4 to 6 characters. At an absolute minimum that impacts both Windows based systems and mainframes (with probably some Unix & AS400 thrown in for fun). First pass through for a project guesstimation needs to somehow understand how many different places need to be changed. A programmer typically isn’t going to have the sort of span-of-view to worry about both Windows and mainframe, but a business analyst would (should?).
Configuration management systems aren’t much help since they’re typically like the silos they hold… isolated & hard to get at. COBOL & .NET code are not likely to be stored in the same place or format.
In any case… by your answer I will (safely?) assume that this crazy idea—of there being value in searching source code more effectively—has not appeared on your enterprise search radar screen…?
I’ve found a few companies that appear to be stepping up to at least part of the challenge… Krugle (www.krugle.com), codefetch (www.codefetch.com), codase (www.codase.com), and koders (www.koders.com). I have no idea how strong these offerings are, but a first thought would be that they seem to only look at software languages that have risen to popularity in the past 10 years, such as C, HTML, JavaScript, Java, PHP, Ruby, etc. No mention of legacy—but still in widespread use—languages such as COBOL, JCL, CICS, Assembler, etc.
These vendors appear to be entirely disconnected from the enterprise search space.
– David
Such vendors should be pretty disconnected from enterprise search. There’s been the occasional attempt at an exception, but basically they’re very different problems. Search generally starts by identifying — tokenizing — words. Then it looks at syntax (grammatical structure) and semantics (synonyms and clusters). And it goes on from there.
There’s very little overlap between how that’s done in human languages and how it’s done in computer languages.
CAM
[…] search quite similarly, as I discovered when she called me yesterday to praise my post on the many differences between enterprise and web search, and followed up with this one of her own. One of Lynda’s big themes is that large […]
Curt –
>
> There’s very little overlap between how that’s done in human languages and how it’s done in computer languages.
>
Acknowledged.
Obviously there’s been a huge amount of work done with human languages—proximity, stemming, probabilities, etc.—that simply will not work when applied to software languages.
My point is that the rules of the business are deeply buried and highly fragmented in extremely difficult to comprehend software, not in scores of well written document formats (MSWord, email, PowerPoint, etc.) intended for human consumption.
Enterprise search is going about the problem by looking for the lost keys under the street light (“Because that’s where the light is… but I lost the keys over by the car which is in the dark.”) simply because it’s easy.
My primary beef here is that software (source code) is simply not considered to be a document… and therefore is not worthy of being brought to the search table.
How are people going to know what’s really happening “behind the firewall” if source code is not included in the searching process?
When the CEO issues the command “I want social security number either encrypted or removed from use where not necessary.” you’re going to rely on what your easily findable word processing format documents tell you? Surely you’re not telling me that?
– David
What I’m saying, David, is that a specialized tool is needed. It’s not realistic to ask one search product to find EVERYTHING.
General search products don’t even work well across the full range I think they should cover. And the specific problem you’re referring to falls outside that range.
Configuration management and app dev tools have ever more understanding of software’s syntax and semantics. I’d start from them as a base, rather than from traditional inverted-file text-string indexing.
CAM
Curt –
>
> a specialized tool is needed. It’s not realistic to ask one search product to find EVERYTHING.
>
Again, agreed.
The ultimate enterprise search tool is obviously going to need a passel of highly specialized tools under the covers. Making it look as easy & slick as Google is going to be interesting. A major challenge, of course, is that most enterprises have only the foggiest idea of their applications inventory.
First, source code has to be considered a valuable & important BUSINESS search resource before we start thinking about what sort of exotic tools are needed.
It is my argument (clearly a voice of one) that the corporate knowledge buried in source code needs to be recognized as worthwhile to mine… rather than to leave it walking around in the heads of soon-to-retire experts. Currently, changing systems is slow, expensive, manual work, far too often highly dependent upon domain experts. It is my belief that through enterprise search could be a significant help in whittling away at the “80% of my IT budget goes to legacy systems” problem.
Obviously (after a lot of non-obvious rat holes) you have to approach enterprise search with a “white list” approach… first pass you identify what it is you’re trying to read (e.g. PowerPoint, MSWord, COBOL, dBase, etc.), second pass you process it with the appropriate reader. If you can’t identify what it is, then don’t try to read more than a few lines. Probably best not to rely on extensions (.exe, .doc) as gospel as to what the document truly is.
I’m not aware of any application development tools that bring semantic understanding to the table. But then maybe I’m quibbling over the definition of “semantics.”
Development tools (Xcode/ObjectiveC being my most current knowledge) that I’m familiar with are equally happy with:
a = b * c or
weeklyPay = hoursWorked * payRate.
If an Eclipse plug-in has brought something more robust to the table, please point me in the right direction.
– David
David,
Why don’t you take a look at the tools that purported to automate the finding (if not fixing) of Y2K 2-character data fields? That was, er, 8+ years ago, so they’ve had a lot of time to evolve since then.
CAM
Curt –
That’s precisely why I’m interested in enterprise search.
I was a Y2K inventory/impact analysis tool vendor, getting into the market in late 1994. We had a tool that explicitly handled “odd” languages (beyond the biggies of COBOL & PL/1)… things like EasyTrieve, EasyTrieve Plus (they’re not related), Natural and others long forgotten. It was a hoot chasing down folks who had these bizarre languages I’d never heard of before. Ever heard of Extracto?
I knew we were onto something interesting/challenging in when Capers Jones sent me his Function Points languages list in 1995. There were 400+ software languages on the list. By 2005 that list had expanded to 650 before being “pruned” back to a more manageable 500.
To the best of my knowledge none of the Y2K inventory/impact analysis tools have survived. I know ours didn’t (I know of a single surviving site).
We got to 1/1/00. The world didn’t end. The tools & systems inventory knowledge went into the bit-bucket. End of story.
It’s my belief that most “civilians” see Y2K as a giant techie hoax. I’m sure a lot of IT departments did not cover themselves in glory in the eyes of business executives for heavily porking up IT budgets under the dodge of “we need it for Y2K.”
The business value of actively maintaining a complete, accurate & edge-to-edge inventory of an organization’s applications portfolio (with the additional benefit of being able to trace how the pieces are interrelated) is a very hard sell.
The high-school dropout running the local 7-11 knows how many candy bars & jugs of milk he has on hand (inventory). Why doesn’t IT keep an inventory? There was a news item last year about EDS doing an outsourcing contract for the US Navy. They went in believing there were 5,000 systems. EDS ultimately found 100,000+.
What is different now is that we have the delight of Google… which means people now want to have the same ease-of-use access to knowledge/answers/information behind the firewall. The fact that serious analysts clearly emphasize that Google & enterprise search are not even remotely comparable problems just falls on the floor as useless noise.
Thanks for being interested.
– David
I use my Desktop Search to search source code at a Telephone billing software company. .
It is a non indexing search. The first step is to “Merge/Append” all the source code into
1 file. Then search that file. When merging the files have a start and stop header is put
in the merged file. When a match is found the originating file name is displayed in the
form title bar. It searches text at 20,000,000 cps. Any system worth it’s salt can export
data to text. I have all my emails since 1996 in large text files. I can even use the
search to extract lists of email addresses.
The search has evolved to randomly play mpg video and mp3 audio as well as pictures.
I have been arguing search with everybody on the net, for years now.
http://channel9.msdn.com/showuserthreads.aspx?userid=31672
http://forums.thedailywtf.com/forums/t/7593.aspx
[…] Text search is a huge business on the web, and a separate big business in enterprises. And text doesn’t fit well into the relational paradigm at […]
[…] “Looking at this list, you can see that the conceptual changes (breakthroughs?), with the exception of better phrase handling, are primarily focused around Web searches. When dealing with one-of-a-kind document collections behind the corporate firewall, many of these developments turn out not to add much to older approaches. So, at least for enterprise search, I too remain partial to some of the older products you mention, though I am disappointed that most of the old-time vendors have not updated their approaches beyond adding taxonomy support.” [CAM] Yep, web search and enterprise search are very different things. […]
[…] http://www.texttechnologies.com/2008/01/14/enterprise-search-versus-web-search/ […]
You guys are geeks.
[…] include links to the pieces that influenced the following comments. But one by Curt Monash in his piece on January 14 summarized the state of this industry and its already long history. It is noteworthy […]