You're viewing a single post. I have much more to say! The main blog page is a good starting point.

Validating XHTML websites

If you’re a web developer using XHTML, you have to make sure your pages are valid XHTML (that implies well-formed XML). This is easy to do using the command-line utility xmllint:

$ xmllint --valid --noout somefile.html

This works fine for all static content. However, if you use server-side scripts to create dynamic pages, you won’t have any real files for xmllint to check. Once again, xmllint to the rescue:

$ xmllint --valid --noout http://example.com/dynamic/page

This one works fine if you want to check just one page. Since xmllint is just an XML validator and not a web crawler, it doesn’t interpret HTML and is not able to follow links and recursively retrieve pages.

That’s where wget comes in. This nifty command-line download utility has a -r flag to recursively grab a site off the net. Be careful what you’re doing, because wget can quickly generate a lot of network traffic and can dramatically increase your server load. Oh, I don’t think I need to tell you that you shouldn’t try this on other people’s websites.

To automate the task of recursively retrieving pages from a server and having xmlllint check those pages for validaty, I wrote a small script. This script will grab all pages from a website and runs xmllint on the retrieved data. All errors will be displayed on your screen.

The script, named check-valid-xhtml, should be called with the top URL you want to check as its first and only parameter. So, if you specify http://example.com/subdir/ it will only check for url’s in and below that subdirectory.

#!/bin/sh
DUMPDIR=$(mktemp -d)
echo "Using $DUMPDIR to store files"
cd $DUMPDIR
echo "Downloading files..."
wget -r -np -nH -nv -E --header="Accept: text/xml,application/xml,application/xhtml+xml,text/html" $1
echo "All files downloaded"
echo "Checking files..."
find . -type f -name '*.html' -print0 |xargs -0 -r -n1 xmllint --valid --noout
echo "All files checked."
read -n1 -p "Clean up? (y/n) " CLEAN_UP
echo
if [ $CLEAN_UP == "y" ] ; then
echo -n "Cleaning up $DUMPDIR... "
rm -r $DUMPDIR
echo "done"
else
echo "$DUMPDIR not cleaned."
fi

Happy fixing! I didn’t need to, because all sites I tried this script on produced perfectly valid XHTML. Yay for me!