Search Engine Internals: Crawl, Index, Rank

A search engine doesn’t actually search the internet; it searches a pre-built index of the internet.

Let’s watch it happen. Imagine we want to index example.com.

Crawl: A crawler, often called a "spider" or "bot," starts with a seed URL. It fetches the page content, parses it for links, and adds new URLs to a queue. For example.com, it might fetch https://example.com/ and find links to /about and /products. These are added to the queue. The crawler respects robots.txt, which tells it which paths to avoid. It also needs to manage politeness – not hitting the same server too frequently. This is usually done with a per-domain delay, say 1 second between requests to example.com.

Index: Once content is fetched, it needs to be indexed. This is where we build the searchable map. For example.com/about, which might contain "About Us: We are a company.", we create an inverted index. This is a mapping from terms to the documents they appear in.

about -> document_id_1 us -> document_id_1 we -> document_id_1 are -> document_id_1 a -> document_id_1 company -> document_id_1

We also store document metadata, like the URL itself. A key step here is tokenization (splitting text into words) and normalization (e.g., converting "Company" to "company", removing punctuation). Stop words like "a", "is", "the" are often removed if they don’t add much meaning.

Rank: When a user searches for "company about", the engine looks up "company" and "about" in the inverted index. It finds document_id_1 for both. Now, how do we decide if document_id_1 is the best result? This is ranking. A simple approach is term frequency-inverse document frequency (TF-IDF).

Term Frequency (TF): How often does a term appear in a document? More occurrences suggest relevance.
Inverse Document Frequency (IDF): How rare is the term across all documents? Rare terms are more discriminative.

A document gets a score for a query term based on TF * IDF. For a multi-term query, scores are summed. More advanced ranking uses link analysis (like PageRank, where a link from page A to page B is a vote of confidence for B), user click data, and machine learning models.

The core challenge at scale is that the internet is enormous. We’re talking petabytes of data.

Crawl at Scale: Crawlers run in parallel, managed by a distributed system. A URL frontier manages the queue of URLs to visit. This frontier needs to be highly available and capable of storing billions of URLs. Systems like Apache Nutch or custom-built solutions using Kafka for queues and distributed storage for visited URLs are common. The crawler agents themselves run on thousands of machines. They need to handle massive concurrency and network I/O efficiently, often using asynchronous programming models. The robots.txt and sitemap parsing also need to be distributed.

Index at Scale: The inverted index cannot fit on a single machine. It’s sharded (split) across thousands of servers. Each shard holds a portion of the index, typically based on a hash of the term. When a query comes in, it’s broadcast to all relevant shards. Each shard returns a list of documents matching its portion of the query. These results are aggregated and re-ranked. Storage for the index can be in-memory for speed, backed by persistent storage like SSDs or distributed file systems (e.g., HDFS). Indexing pipelines are complex, involving data ingestion, parsing, tokenization, stop word removal, stemming/lemmatization, and finally, writing to the sharded index. Updates to the index (new pages, changed content) are handled by incremental indexing processes that merge new data into existing shards.

Rank at Scale: Ranking computations need to be fast – users expect results in milliseconds. Pre-computing parts of the rank score (like IDF or PageRank) is crucial. Ranking models themselves can be complex. A simple TF-IDF model might be computed on the fly, but a modern search engine uses hundreds of signals. These signals are often pre-computed or updated periodically. For instance, PageRank is calculated offline in massive batch jobs. User behavior signals (click-through rates, dwell time) are collected in real-time and used to update ranking models. When a query arrives, the system retrieves candidate documents from the index, fetches pre-computed features for those documents, and then runs a fast scoring function (often a learned model) to produce the final ranked list. Distributed systems are used for both feature retrieval and model inference.

The most surprising aspect of a large-scale search index is how much of the "ranking" work is actually done before the user even types a query. Pre-computation of scores, link analysis, and learning from past user behavior allows the system to respond in milliseconds to novel queries, rather than performing massive computations on the fly.

The next frontier after mastering crawl, index, and rank is understanding and implementing the nuances of relevance signals and user intent modeling.

Related Concepts

More Deep Dives in System Design