Automated link checking
My website contains a lot of internal references and it’s quite tedious to check that all the references are correct. So, I thought to myself that there must be a way to check this automatically. Not using generative AI, but using the “oldschool” web crawling method.
So I decided to try htmlproofer
.
Why?
Well, it directly works with Jekyll, so it made sense to test the native solution.
Without modifying anything and running all default tests, I got over 9,000 errors.
Hm.
So I decided to debug the largest chunks that weren’t relevant to me, such as:
1
2
3
* At ./_site/tags/ws/index.html:187:
internal script reference /assets/js/dist/misc.min.js does not exist
The first sticking point was the parameter --ignore-files
which, citing the documentation, should be:
An array of Strings or RegExps containing file paths that are safe to ignore. |
If you’ve ever worked with regular expressions, though, you might already anticipate the problem with this option: figuring out which syntax to use!
After a quick search, I learned that Ruby has its own regular expression syntax definition and figuring it out wasn’t that complicated in the end.
1
htmlproofer --assume-extension ./_site --check-external-hash --ignore_files ["/tags.+/", "/.+\/tabs.+/"]
Also, I decided to check only for Links
, instead of all of ['Links', 'Images', 'Scripts']
.
However, there were still so many cases I needed to filter out that I decided to abandon the idea of static site checking. Although htmlproofer
does accept a URL of a live server, the scan isn’t recursive, and I couldn’t find a way to make it so.
Then I found muffet
which proved to be the tool I was looking for! After a bit of tinkering with the configuration, I got the following:
1
docker run --network host raviqqe/muffet --color=always --skip-tls-verification --buffer-size=32768 http://localhost:4000
which immediately gave me what I was looking for and found real broken links on my website. This is my final setup:
1
docker run --network host raviqqe/muffet --color=always --skip-tls-verification --buffer-size=65536 --ignore-fragments --accepted-status-codes='200,201,202,300,301,403' --timeout=60 --max-connections=4 http://localhost:4000
The crucial factor for the stability of the output seemed to be the HTTP connection pool size, controlled by max-connections
. 4 is fast and reliable enough, the default 512
was too much.
Muffet is now a part of my build pipeline and its automation has already helped and relieved me tremendously.