Post

Automated link checking

My website contains a lot of internal references and it’s quite tedious to check that all the references are correct. So, I thought to myself that there must be a way to check this automatically. Not using generative AI, but using the “oldschool” web crawling method.

So I decided to try htmlproofer.

Why?

Well, it directly works with Jekyll, so it made sense to test the native solution.

Without modifying anything and running all default tests, I got over 9,000 errors.

Hm.

So I decided to debug the largest chunks that weren’t relevant to me, such as:

1
2
3
* At ./_site/tags/ws/index.html:187:

  internal script reference /assets/js/dist/misc.min.js does not exist

The first sticking point was the parameter --ignore-files which, citing the documentation, should be:

An array of Strings or RegExps containing file paths that are safe to ignore.

If you’ve ever worked with regular expressions, though, you might already anticipate the problem with this option: figuring out which syntax to use!

After a quick search, I learned that Ruby has its own regular expression syntax definition and figuring it out wasn’t that complicated in the end.

1
htmlproofer --assume-extension ./_site --check-external-hash --ignore_files ["/tags.+/", "/.+\/tabs.+/"]

Also, I decided to check only for Links, instead of all of ['Links', 'Images', 'Scripts'].

However, there were still so many cases I needed to filter out that I decided to abandon the idea of static site checking. Although htmlproofer does accept a URL of a live server, the scan isn’t recursive, and I couldn’t find a way to make it so.

Then I found muffet which proved to be the tool I was looking for! After a bit of tinkering with the configuration, I got the following:

1
docker run --network host raviqqe/muffet --color=always --skip-tls-verification --buffer-size=32768 http://localhost:4000

which immediately gave me what I was looking for and found real broken links on my website. This is my final setup:

1
docker run --network host raviqqe/muffet --color=always --skip-tls-verification --buffer-size=65536 --ignore-fragments --accepted-status-codes='200,201,202,300,301,403' --timeout=60 --max-connections=4 http://localhost:4000 

The crucial factor for the stability of the output seemed to be the HTTP connection pool size, controlled by max-connections. 4 is fast and reliable enough, the default 512 was too much.

Muffet is now a part of my build pipeline and its automation has already helped and relieved me tremendously.

This post is licensed under CC BY 4.0 by the author.

Music Fun Fact

Loading a fun music fact...