Let's design a Web Crawler.

web crawler is a system where we visit the entire internet, retrieve the data from the pages and store them for later processing. the processing is secondary and can be anything ranging from using the data to rank those pages for a search engine to training an LLM over the data. the main aspect here is to how to crawl over the internet in order to get the data.

we can't crawl the entire internet. because, in many cases, we can't even reach a certain percentage of the internet that is at corners.

the major requirements are as follows (functional):

we should be able to take in a bunch of seed urls or root urls and crawl over it's neighbors repeating the process.
we should be able to store the data for later processing we fetch from each of the pages we crawl.

things that need to be keep in mind in order to maintain this system (non-functional):

crawling needs to be resume if affected mid-way. as we can't loose the entire progress of crawling, we should be able to get back if there is any failure. handling failures gracefully without loosing any progress. fault-tolerance.
we should respect the web servers by adhering to robots.txt and not overflowing the web servers with a lot of requests. politeness and blah blah.
scalability. as the web is huge, we should be able to handle a heavy load.

general parts in a crawl system includes:

frontier: this is a queue where the urls sit in order to get picked up by the crawler workers. centralized frontier is the go-to approach. local frontier has some major issues such as making some workers idle as there is no load-balancing, duplicacy, etc.
html downloader: the crawler itself. it is generally a single worker where communicating with DNS and fetching the webpage data is involved. parsing the page and other processing is also involved in this but, it is better to separate this in order to maintain fault tolerance.
DNS and webpage: DNS helps the crawler to fetch the details of the location from which the webpage data can be fetched. webpage is the webpage itself :) (server).
database: storage unit where we store our data. it can be anything, generally blob storages like S3, google cloud storage or azure blob storage are preferred.

the whole internet is a graph. and there are generally two ways we can traverse a graph. DFS or BFS. DFS is generally not preferred as it involves going deeper depths and it makes the system vulnerable to infinite loops. also, internet is enormous and we won't be able to discover diverse range of pages if we use DFS. hence, BFS is considered as go-to in web crawling.

for frontier part, instead of centralized frontier which requires an extra network call, we can employ distributed frontier something like this to avoid that extra network call:

urls are passed along between the nodes (in a distributed environment where multiple workers are present in a single node) using a load balancer. every node contains it's own frontier.
to avoid duplicacy, we have two solutions. one is to maintain again a centralized db containing urls and ping that, requires extra network call. or else hash based balancing works based on the URL. if x == y then, both x and y will reach the same node based on their hash. nodes maintain their own logic to avoid crawling over same urls twice.
but, how to avoid duplicate content, again this requires an external space where we can store the hash of the content and manage that separately. something like redis. offers strong consistency. we can maintain replicas to take care of failures.
or we can also maintain CRDTs (conflict-free replicated data types) and an anti-entropy background task in order to merge them. this is expected to be idempotent and eventually consistent. still this sometimes can pass on some duplicate crawler requests.

distributed frontier requires a complex implementation. and it gets more cumbersome when we introduce priority based queues, etc. everything below is based on the assumption that we are maintaining a centralized frontier. i mentioned wherever we have something useful regarding distributed frontier.

key things to maintain in web crawling:

politeness:
1. not hitting the same server again and again in a certain period of time. we shouldn't be overloading any single server.
2. hence, we have to maintain politeness. but, let us say we are crawling over a certain page. generally, a domain contains various number of pages with different paths and they all can be discovered at the same location.
3. so, in the frontier part, we have to maintain a domain-based queue in order to maintain politeness. this means, all the webpages that are from the same domain will be there in the same queue. crawlers are assigned with a single queue instead of a single webpage.
4. this makes no two crawl workers hit a single domain webpage at the same time. of course, this can be configured further, we can set a certain threshold on how many parallel requests can be hit to a certain domain and make our queue follow that.
5. we should also adhere to robots.txt file. it contains all the paths that we are not allowed to crawl. it is maintained by the webpage owner/organization. it also contains the crawl-delay which is the delay we should maintain between subsequent requests to the same domain.

in some cases, having separate queues is tiresome. we can have a single space where we maintain the metadata required for a domain including the disallowed webpages of the domain (got from robots.txt), crawl-delay which is a external metric provided by the domain, etc. each crawl worker before fetching a page communicates with this metadata and then decides whether to fetch the page or wait for a few seconds. generally, crawl-delay should be the primary metric in case of crawling webpages related to a certain domain if the domain provides it or else we need to impose our own rate-limiting (generally, 1 request per second) in order to maintain politeness.

so, in the case of multiple crawl workers:

all the crawl workers check the metadata stores in some kind of memory (like redis) and then decide whether to crawl the webpages or not. we can use sliding window algorithm in order to track the number of requests to a domain in a certain time window. this time window will primarily be crawl-delay if domain provides it or else usual rate-limiting threshold,
crawl workers return the webpage back to the frontier queue if they are not allowed to crawl at any certain time period. queues like SQS has a parameter called DelaySeconds in order to bump up the period.
there is a chance that multiple crawl workers can ping the metadata at the same time where the domain is rate-limited, they all will get a rejection. hence, we can introduce some jitter or randomness to the crawl requests so that they do not all retry at the same time.

priority:
1. each webpage are given certain priority based on the PageRank, update frequency, number of visitors, etc.
2. a crawler for any task should first consider the pages with more value than the ones which can be ignored.
3. hence, a queue where priority is maintained can be maintained prior to the politeness queue in order to separate the webpages based on their priority. crawler takes the webpages with some bias towards the webpages with greater priority.

how can we achieve fault-tolerance?

a crawler may fail at any time. we need to make sure that we return to the state at the time of failure.
separating various parts of the system like DNS look-up & fetching as one part and parsing the webpage & retrieving necessary information as another. this way, we can scale the parts as necessary and also, the systems do not depend on each other. we can maintain queues in between and they take care of the processing. we have to maintain a separate url metadata queue in order to pass the fetched content to the parser. we can't maintain html data in the queue as it is heavy.

fetching a webpage is most prone to failures as often we can find webpages being unresponsive, overloaded, or simple not reachable. hence, we have to maintain a system where we respect politeness and also fault tolerance. there are several approaches with it's own challenges:

in-memory timer: we can set a timer for every fetching and we will try to reach the webpage again if the timer goes out. this is not feasible as we may loose the timer if there is a failure in the system and also we have to maintain our own logic in order to separate out the normal webpages from the failed ones. and also, we have to implement our own logic on the exponential back-off for the sake of politeness.
kafka queue: in kafka, there is no in-built exponential back-off or timeout conditions. but, we can implement our own. we can create a separate topic which contains all the failed webpages along with the time and crawlers can maintain politeness and retry count threshold while fetching those webpages. also, crawler can monitor the kafka logs in order to know the status of a webpage processing. the webpage will be considered processed only if it is logged by a crawler which has successfully fetched the webpage.

when it comes to scaling:

we can assume a certain metric like getting 10 billions pages under 5 days (ref. hellointerview artcile). the major part of our pipeline (unless that parsing gets complex with a lot of things) is fetching the pages. this is certainly a I/O intensive task. if we consider each page to be 2 MB in size then, we need total storage of approximately 20 PB. hence, we use blob-storage/object stores for storing html and text data.
apart from this, if we consider a network optimized instance can handle 400 Gbps = 50 GBPS (general consensus), that means a single instance can handle about 50000 MBPS / 2 MB = 25000 pages / second. this seems a lot but there are a lot of constraints like politeness, priority, rate-limiting, DNS, etc. so let us say we have been able to utilize 30% of the promised load, i.e ., 7500 pages / second. That would require around 15.4 days to complete 10 B pages for a single instance. So, taking 4 instances would reduce the time to 3.85 days.

all the above are just assumptions. but, assuming will let us think what is going on and pave the path to procedure.
when it comes to parsing, we can scale this up maintaining the same pace as the I/O intensive task.

DNS:

for DNS, generally 3rd party DNS have some rate-limiting. so, we need to adhere to these tings which will become a bottleneck to our crawler. hence, maintaining DNS cache would be beneficial. again this has to be maintained on it's own premise, when to go for the data in the cache and when to go for the actual look-up.
we can also do multiple DNS providers, distributing the load across them in order to adhere to the rate-limiting and also improve the system performance.

duplicates:

generally, there is a lot of internet and the internet is a lot of duplicates. we have re-directions, duplicates, plagiarized websites, etc where the same content is available across multiple different webpages.
hence, we need to find whether the content is already present or not. just comparing the URL won't help because different urls containing the same content is the whole point.
we can hash the content and store the hash along with the URL. if the hash is already present, we skip storing the content or else we will store it & maintain the hash in the metadata. we can arrange an index in the DB itself on the hashes.
another approach is to use bloom-filters to look whether the content is already present or not.

depth:

as mentioned before, we will have traps or you can say loops if we use DFS. but in BFS too, we will have this issue as the same content can be disguised by different urls. hence, we need to maintain a depth parameter indicating the number of pages from the seed we visited before reaching the current.
by keeping a threshold on it helps us avoid looping forever.

small bits to consider:

dynamic content can be handled using headless browsers like puppeteer or playwright.
large content can be handled using a threshold or monitoring the param Content-Length.
we can use something like datadog or new relic to monitor the health of our system. as this is a extremely distributed system, we need to make sure of this.
updates and everything can be handled by the priority aspect of our frontier. whenever we require some existing page to be crawled again, we can schedule it. this is based on the data like update frequency, popularity, etc.

again, each one has their own approach. i have combined a lot of them and gave several options at each stage of the system. so, take everything with a grain of salt. good luck!

References

https://www.hellointerview.com/learn/system-design/problem-breakdowns/web-crawler
https://bytebytego.com/courses/system-design-interview/design-a-web-crawler
https://youtu.be/mMSJO4SrQLI?si=5CJAKiqFTmo3RL7R