How to build a web crawler in C#

What is a web crawler?

A web crawler, or oftentimes referred to as a spider, is a bot used to browse the world wide web and discover web pages and their content, typically for the purposes of indexing. This premise is used by search engines so they know what web pages and sites are available so that they can be retrieved quickly.

Crawler or bot refers to accessing a web page through the use of software. With the data obtained from these bots, search engines are able to provide relevant suggestions to users’ queries.

This is the most typical use case for this type of application. However, similar methods can be used for other tasks, such as routinely obtaining information from a specific page or ensuring that your content is not being used elsewhere without your consent.

Designing a web crawler

The diagram below outlines the logical flow of a web crawler:

  • Root or seed URLs
    • The crawler needs somewhere to start; this is provided by a seed file that can contain one or more known URLs from which the crawler can start from. These are added to a queue. 
  • URL queue
    • This is a list of URLs that are waiting to be crawled. For this particular application, only URLs that have never been visited should find their way into the queue. 
  • The crawl
    • The top item is taken from the queue and the web page is obtained. From here, its data could be processed or stored in some way.
  • URL parsing and filtering
    • The web page is scouted for other links for our crawler to explore, but before they are added to the queue they are vetted to see if they have already been crawled or if they are already in the queue.

Design considerations

Our web crawler will only add URLs to the queue if they have not already been crawled. In other words, a URL will not be crawled again once it has been processed. This would be bad news for most search engines as there would be no way of knowing how up to date the information retrieved would be. The design of our crawler could be improved by implementing a last crawled date for each URL. Instead of discarding URLs that had already been crawled, we could instead add them to the queue if their last crawl date was greater than a certain time period.

Some web pages are more important than others. For instance, some may prioritise domains over subdomains, others may prioritise particular sites such as news sites. There is no such priority filter in place for our crawler, however, you may choose to implement something along these lines. 

For debugging purposes, the Seed, Queue and Crawled lists are wiped every time the application is started. Some may want to alter this so that the application can be stopped and then pick up where it left off once started again. 

Building the web crawler

We will be enlisting the help of a brilliant HTML parser called Html Agility Pack. This library will make light work of extracting all of the links from an HTML page.

The application itself is a C# console application that has two main methods, Initialize and Crawl

class Program
{
    private static Seed seed;
    private static Queue queue;
    private static Crawled crawled;

    static async Task Main(string[] args)
    {
        Initialize();

        await Crawl();
    }
}

The Initialize method instantiates three classes that represent three lists, Seed, Queue and Crawled. In this example, these lists are simply text files, however, database storage may also be considered. 

static void Initialize()
{
    string path = Directory.GetCurrentDirectory();
    string seedPath = Path.Combine(path, "Seed.txt");
    string queuePath = Path.Combine(path, "Queue.txt");
    string crawledPath = Path.Combine(path, "Crawled.txt");

    seed = new(seedPath);
    var seedURLs = seed.Items;
    queue = new(queuePath, seedURLs);
    crawled = new(crawledPath);
}

Our Seed class represents our list of seed or root URLs. As the class gets constructed, it uses the path passed into it to create a new file and writes our seed URL. In this case, our root URL is another blog post of mine, detailing how to build a book barcode scanner. You should definitely check it out. There is also a property that returns the full list of seed URLs. 

class Seed
{
    /// <summary>
    /// Returns all seed URLs.
    /// </summary>
    public string[] Items
    {
        get => File.ReadAllLines(path);
    }

    private readonly string path;

    public Seed(string path)
    {
        this.path = path;

        string[] seedURLs = new string[]
        {
            "https://trystanwilcock.com/2021/03/31/how-to-build-a-book-barcode-scanner-in-blazor/"
        };

        using StreamWriter file = File.CreateText(path);

        foreach (string url in seedURLs)
            file.WriteLine(url.ToCleanURL());
    }
}

Our Queue class represents our list of URLs that are waiting to be crawled. The constructor accepts a list of our seed URLs and will write these to a file on construction. 

Our class contains a number of properties to retrieve the first item in the queue, all items in the queue and an indicator as to whether there are URLs in the queue. 

There are also methods to add an item to the queue, remove an item from the queue and determine whether an item is already in the queue. 

class Queue
{
    /// <summary>
    /// Returns the first item in the queue.
    /// </summary>
    public string Top
    {
        get => File.ReadAllLines(path).First();
    }

    /// <summary>
    /// Returns all items in the queue;
    /// </summary>
    public string[] All
    {
        get => File.ReadAllLines(path);
    }

    /// <summary>
    /// Returns a value based on whether there are URLs in the queue.
    /// </summary>
    public bool HasURLs
    {
        get => File.ReadAllLines(path).Length > 0;
    }

    private readonly string path;

    public Queue(string path, string[] seedURLs)
    {
        this.path = path;

        using StreamWriter file = File.CreateText(path);

        foreach (string url in seedURLs)
            file.WriteLine(url.ToCleanURL());
    }

    public async Task Add(string url)
    {
        using StreamWriter file = new(path, append: true);

        await file.WriteLineAsync(url.ToCleanURL());
    }

    public async Task Remove(string url)
    {
        IEnumerable<string> filteredURLs = All.Where(u => u != url);

        await File.WriteAllLinesAsync(path, filteredURLs);
    }

    public bool IsInQueue(string url) => All.Where(u => u == url).Any();
}

Our Crawled class represents our list of URLs that have already been crawled. This file will be recreated on construction, or its contents cleared if the file already existed. There are a couple of methods here we need, one that determines whether a URL has been crawled and another to add an item to the list. 

class Crawled
{
    private readonly string path;

    public Crawled(string path)
    {
            this.path = path;
            File.Create(path).Close();
    }

    public bool HasBeenCrawled(string url) => File.ReadAllLines(path).Any(c => c == url.ToCleanURL());

    public async Task Add(string url)
    {
        using StreamWriter file = new(path, append: true);

        await file.WriteLineAsync(url.ToCleanURL());
    }
}

Note that whenever anything is written to our three lists, the item being written is sanitised using an extension method aptly named ToCleanURL. This derives from a string extensions class as seen below:

static class StringExtensions
{
    public static string ToCleanURL(this string str) => str.Trim().ToLower();
}

Once Initialize has been called and everything that needs to be has been constructed, the Crawl method is called, which consists of a loop that will iterate so long as there are URLs waiting in the queue:

static async Task Crawl()
{
    do
    {
        string url = queue.Top;

        Crawl crawl = new(url);
        await crawl.Start();

        if (crawl.parsedURLs.Count > 0)
            await ProcessURLs(crawl.parsedURLs);

        await PostCrawl(url);

    } while (queue.HasURLs);
}

The first URL in the queue is grabbed, before passing this to a new instance of a Crawl object. We then call Start which initiates the crawl of that particular URL.

The Crawl class has three properties, one for the URL of the web page currently being crawled, one for the web page content (all the downloaded HTML for that page) and one for a list of URLs it finds on that page. 

class Crawl
{
    public readonly string url;
    private string webPage;
    public List<string> parsedURLs;

    public Crawl(string url)
    {
        this.url = url;
        webPage = null;
        parsedURLs = new List<string>();
    }

    public async Task Start()
    {
        await GetWebPage();

        if (!string.IsNullOrWhiteSpace(webPage))
        {
            ParseContent();
            ParseURLs();
        }
    }

    public async Task GetWebPage()
    {
        using HttpClient client = new();

        client.Timeout = TimeSpan.FromSeconds(60);

        string responseBody = await client.GetStringAsync(url);

        if (!string.IsNullOrWhiteSpace(responseBody))
                webPage = responseBody;
    }

    public void ParseURLs()
    {
        HtmlDocument htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(webPage);

        foreach (HtmlNode link in htmlDoc.DocumentNode.SelectNodes("//a[@href]"))
        {
            string hrefValue = link.GetAttributeValue("href", string.Empty);

            if (hrefValue.StartsWith("http"))
                parsedURLs.Add(hrefValue);
        }
    }

    public void ParseContent()
    {
        // You may want to process or parse elements of the web page here.
        // Html Agility Pack may also be useful for something like this!
    }
}

Upon start of the crawl, HttpClient is used to obtain the contents of the web page. Microsoft instructs the use of this class as WebClient and HttpWebRequest are both now obsolete. Note that we implement a timeout so that it never spends ages on any one page.

If content is retrieved successfully, there are two subsequent method calls. One is a call to an empty method named ParseContent. This may be where the contents of the page are processed or saved to a database. If you were a search engine, you’d probably index the information somehow.

The method ParseURLs makes use of Html Agility Pack in order to obtain a list of anchor or a href nodes from the retrieved HTML. 

Going back to the Crawl method of our main program, if any URLs have been retrieved from the web page, they are passed to another method named ProcessURLs

static async Task ProcessURLs(List<string> urls)
{
    foreach (var url in urls)
    {
        if (!crawled.HasBeenCrawled(url) && !queue.IsInQueue(url))
            await queue.Add(url);
    }
}

This method iterates through each URL and determines whether it has already been crawled or is already in the queue to be crawled. If neither, the URL is added to the queue. 

Lastly, the PostCrawl method is called which simply removes the current URL from the queue (as it’s already been processed by now) and adds it to the list of crawled URLs.

static async Task PostCrawl(string url)
{
    await queue.Remove(url);

    await crawled.Add(url);
}

I hope that this post has demonstrated how web crawling can be achieved. Although this particular example was written as a C# console application, the concepts can be taken over to any language and applied in the same way. 

I certainly had fun building this application and will certainly look to improve it in the future. That being said, any comments or suggestions are welcome in the comments below.

5 thoughts on “How to build a web crawler in C#”

  1. In the Queue class, the line “await File.WriteAllLinesAsync(path, filteredURLs);” in the method “public async Task Remove(string url)” gives the error: “CS0117 ‘File’ does not contain a definition for ‘WriteAllLinesAsync’ “. I have tried multiple framework versions and language versions with no success. Any ideas? Cheers.

  2. Great Tutorial, thank you.
    Can you please include a picture of the completed code as I’m a bit stuck on where some of the formatting and location of all the code snippets.
    Thank you very much

Comments are closed.

Scroll to Top