OnSearch, the Series

Few people with a deep understanding of search can write eloquently about it. Search engine pioneer Tim Bray is one of the few. He’s written an absolutely fabulous series of essays that should be essential reading for anyone wanting a thorough understanding of search technology.

Bray’s worn many hats over his career. He’s best known as a coauthor of the XML specification. But in the very early days of the Web, he was deeply involved in creating and running one of the first search engines, the long-gone Open Text Index. Open Text was Yahoo’s original search partner.

These days, Bray is CTO of Antarctica Systems, best known for its visual search interfaces.

His series of essays, “On Search, the Series,” is virtually a textbook on search engine technology. The essays are highly readable and replete with Bray’s personal insights and opinions.

In an overview to the series, he says:

I’ve written these pieces because I care about search and because the lessons of experience are worth writing down; but also because I’d like to change this part of the world. In short, I’d like to arrange for basically every serious computer in the world to come with fast, efficient, easy-to-manage search software that Just Works.

There are 15 installments, as well an informal overview/table of contents.

The series begins with a backgrounder covering the business of search, as well as an excellent history. Choice quote: “The fact of the matter is that there really hasn’t been much progress in the basic science of how to search since the seventies.”

Rather than diving head-first into search engine mechanics and engineering, the next essay considers what people search for. Analyzing user logs of searches on the Open Text Index between late 1994 and early 1996, Bray gained deep insight into the information needs of users, coming away with “two lessons that loom larger than all the others put together.”

The lessons? I won’t spoil the suspense, but you’ll likely nod your head in agreement once you read the essay.

Next, Bray discusses search engine basics — popular features of search engines, their costs, and benefits. Things start to get a bit technical here, but it’s well worth the effort to understand the basic data structures and algorithms search engines use to provide “results.”

How do you measure search engine effectiveness? Know if one system is improving, or if there are meaningful differences between two systems? One way is to measure “precision” and “recall,” the most common measures of search performance. Though useful, the next essay also demonstrates the limitations of precision and recall as really good metrics.

“Here’s the problem: searching for words isn’t really what you want to do. You’d like to search for ideas, for concepts, for solutions, for answers.” In the fifth essay, Bray considers keyword analysis, how search engines look at position, frequency, and word emphasis to distill meaning. This essay is somewhat bleak — Bray isn’t optimistic about the future of making search engines more “intelligent.”

Squirmy Words and Interface

The sixth installment looks at “squirmy words.” Language is inherently complex and often ambiguous — a major challenge for search engines. Interestingly, Bray concludes this lexical chaos has surprisingly moderate consequences for search systems.

Next, a detour to describe an unusual search user interface Bray built just as the Web was emerging, with philosophical observations on why it didn’t succeed at the time.

Returning to search mechanics, the next installment considers “stopwords,” common words that “appear unreasonably often and carry unreasonably little information,” causing many search engines to ignore them.

In the essay on metadata, Bray may surprise some readers with his much broader (but quite accurate) definition of metadata and how successful players such as Yahoo and Google use it to their advantage. “Neither has actually had better text search technology than the competition,” Bray writes, which may sound like heresy to some. It’s well worth reading what metadata is, where it comes from, and how to use it.

Internationalization is the focus of the tenth essay. What happens when people write in languages other than English, using characters not found in our alphabet? That’s a major issue for search engines, and one that will increase in importance over time.

Result ranking is the next topic considered by Bray’s critical eye. When you’ve got a lot of stuff in a big database (e.g., a search engine’s index of the Web), how do you decide what goes on top of result lists? Bray concludes the current state of result ranking (beyond the top few results for most searches) isn’t very good. But he also discusses some promising techniques he believes are underexplored.

In the next essay, Bray literally thinks outside the (search) box, describing the current state of search interfaces and again proposing an alternative approach he feels may provide a better user experience.

As the coauthor of the XML specification, it’s no surprise Bray includes an essay on XML searching. XML is gradually creeping into just about everything we do with computers. Bray believes it’s important to think about searching XML.

A “tour through Robot Village” looks at the crawlers, spiders, and other critters that traverse the Web, discover information, and bring it back to search engines for indexing.

To conclude the series, Bray proposes a model and conceptual framework for where search should go in the future.

Most technical writing about search technology is jargon-laden, filled with arcane equations and tightly knit logic. “On Search, the Series” offers a refreshingly different approach to explaining search tools we rely on daily. As a bonus, it’s filled with comments and anecdotes from the personal experiences of someone who can genuinely lay claim to being a Web search pioneer.

Related reading