Export WordPress as static HTML site

Why static?

WordPress is really convenient for people to publish to the web. It’s flexible enough to be a blog or an e-commerce site. But for a purely content-oriented site, it has some downsides:

  • Performance: If the content is static, there is no need to generate the page dynamically for every view – hitting the database, and requiring processing by PHP.
  • Security: WordPress is essentially a program exposed in the public web; it’s possible to run it securely but one more thing running is one more risk – default behaviours like auto-upgrade is like having the ability to rewrite itself. Oh, and you must also have MySQL database running too.

Wouldn’t it be easier to just export the whole thing as plain old HTML files? The basic idea:

  1. Run WordPress on a non-public host or your own PC.
  2. Compose your content (posts, photos, etc) as normal.
  3. Export the site as static files.
  4. Upload the files to a public host. There is no need to run PHP, database etc there any more. Any web server can serve files. GitHub and Amazon S3 can also serve static files as a website – in such case even a host and web server are not needed.

So I did some research… But firstly I had considered two other related approaches:

Caching Plug-ins

  • Basically exports a cached copy of the page instead of generating it every time, so helps with performance.
  • Still requires having WordPress running so this is out of the question anyway, but…
  • In one sense it should be the simplest way, as you should just need to enable a plug-in, and leave everything else as-is.
  • Though I also heard that getting it exactly right takes some effort (e.g. content changed but it doesn’t refresh)

Static Template Engines E.g. Jekyll

  • No WordPress in the picture – write posts as plain text (markdown) files, then run a tool to combine these and web layout templates (similar to WordPress themes) to generate static web pages. These can then be uploaded to the public server.
  • A clean approach; static output is what these engines are intended for so it feels right; while static output is not an intended use of WordPress, so can feel like going against the grain.
  • But the main problem is losing WordPress’s convenience in the composing stage – in particular for photos and media management. For a mainly text-based blog though, it would be a viable choice.

So sticking with WordPress, how can we export the a static site?

Export Static Plug-ins

  • At first glance, these fit the bill exactly. But…
  • When I first started this blog, WP Static HTML Output was the only game in town, but it had some fairly worrying reviews. It seemed to work for some people, but not always. I didn’t feel confident to rely on it.
  • From late-2015 there came Simply Static. It has been upgraded regularly and seems to have good reviews so far (as of now in Nov 2016), looking promising. However, by the time I found out about it (and of course that’s after my site went live) I already came up with my own solution (to be described later), so I haven’t had a chance to try it.
  • Regardless of which one, one concern would be the export task might be fairly processing intensive or time-consuming, in particular if the site is large. Probably not a suitable use of a PHP script.

Web Crawlers

Stand-alone tools to crawl from a given URL, following all the links and download all the pages and files, so effectively producing a static copy of the whole website.

Seems to be 2 available choices:

httrack

  • Initially I picked this, as its stated features like “local cache” and “concurrent downloads” sounded attractive from the performance point of view.
  • But turned out the speed was unacceptable – from scratch, crawling my early site with just 25 posts took 20 minutes; and crawling without any change (so it should be able to use its “cache” to speed up), it still took 3 minutes.
  • Has both GUI and command-line versions. I won’t be using GUI anyway as I intend to use it in a script.

wget

  • The standard UNIX command-line tool for downloading files – it can actually be used as a crawler too.
  • Most importantly, it’s amazingly quick compared to httrack – a fresh crawl took just a few minutes.
  • I felt more confident relying on a standard tool like wget for my workflow; httrack is lesser known and who knows when it will become an abandoned project.

So I decided to go with wget.

This is the command:

wget --no-host-directories --recursive --page-requisites --no-parent --timestamping http://$LOCALHOST

…but not so quick. Just using wget will give “a static site”, but a few things need to be sorted out.

Problems and solutions for static export (using wget)

Dynamic content

It’s stating the obvious – static site means you cannot have any content that requires server-side processing. Search and comments are obvious ones, but some minor features can be easy to over-look.

Generally, the idea is to replace server-side dynamic content with client-side JavaScript and 3rd-party server handling.

Site search

Replace the default WordPress site search with Google Custom Search. This allows you to create a custom search engine that searches your site only. As both the search itself and the result display are handled by Google, it does not require any processing on your server. I will describe more on how to set up Custom Search in a separate post.

Comment

The most common replacement is Disqus. Similar to search, this allows you to off-load the whole comment handling part elsewhere, so there’s no hit on your server. But I heard some poor reviews on Disqus so I’m not sure if it’s the best solution. There should be other alternatives but currently I do not use comment, so I simply disabled it.

Rotating banner images

WordPress has a default feature to allow uploading several top banner images and on each page load, it displays a random one. So this feature depends on server processing too. When doing a static export of the site, each page captures only one specific banner when it was exported, i.e. it’s fixed. On the static export site, you no longer have the effect of reloading a page and a different banner shows each time.

Having said that, this is probably an old feature now – these days most people instead use a slider included in their theme. Anyway, below shows the basic idea, where you specify the list of banner images, then use JavaScript to randomly pick one and set it to the display element:

var headerList = [
  "/resources/imgA.jpg",
  "/resources/imgB.jpg"
];
var headerImg;

function setRandomHeaderImage() {
  var num = Math.floor(Math.random() * headerList.length);
  headerImg = headerList[num];
}

function preloadHeaderImage() {
  var img = new Image();
  img.src = headerImg;
}

function displayHeaderImage() {
  var mh = document.getElementById("masthead");
  mh.style.backgroundImage = "url('" + headerImg + "')";
}

setRandomHeaderImage();
preloadHeaderImage();

jQuery(document).ready(function() {
  displayHeaderImage();
});

wget not using permalinks, instead output index.html?p=123

All SEO articles, or even common sense, would tell you the WordPress default query string URL like http://myhost/?p=123 is no good compared to the more informative permalink URL such as http://myhost/my-new-article/. My site was configured to use permalinks, but somehow wget still created links with the query string URL.

After some debugging, I found that despite the permalink config, there’s a “shortlink” output by WordPress in the HTML, and wget used that. So the solution is to remove this action (in the child theme function.php):

remove_action('wp_head', 'wp_shortlink_wp_head', 10, 0);

Replacing URL / hyperlinks

Avoid wget option --convert-links

This may seem the right option to use for crawling, but it’s not at all desirable:

  • It changes all links to relative (see later section about why it is bad)
  • It adds “index.html” to all the links (again see later section about why)
  • It is not clever enough to convert all links anyway – if you have URL in a form action, or srcset tag, wget misses them. So you end up with an inconsistent conversion!

So just avoid this option, to have wget leaves all the links as they are. We can just run a simple custom script to convert them.

Replacing URL / hyperlinks

Just to reiterate the expected setup:

  • WordPress site running on a local PC (or some private domain), so your site address would be http://localhost
  • Exported static files will be published to a public domain, e.g. http://mydomain.com

As explained above, we want wget to leave links as they are. We can then run a search and replace (for example) “localhost” with “mydomain.com”, in all the exported html files.

For me, I opt to replace with just “/”, i.e. making all my links “root-relative”. Example command:

find . -name '*.html' -exec sed --in-place "s/https*:\/\/localhost\//\//g" {} \;

How to root-relative? Digress about relative and absolute URL

Just to digress a bit – WordPress has a debatable design decision to use absolute URL everywhere.

URL type Example: URL to some-post from front page Example: URL to category page on some-post
Absolute http://mydomain.com/category-a/some-post/ http://mydomain.com/category-a/
Root-Relative /category-a/some-post/ /category-a/
Relative category-a/some-post/ ../category-a/

Understandably, relative URL is brittle, because if you change the location of the page, then all its links “relative to it” could break. So, that option definitely out.

Many people argue for using absolute URL because of this downside of relative URL. But that’s not picking the right battle. The comparison should be against root-relative.

For root-relative:

  • Not brittle like relative because one page is not directly relative to each other.
  • Easy to change domain and environment (professional developers would know having dev, staging etc separate from live environment, right?), because the domain part (http://mydomain.com) is inferred. Comapare with absolute URL – every time you need to do search and replace.
  • Doesn’t it look cleaner without all these duplicated domain part in absolute URL? Do we really need to mention the same domain name in every link?

Anyway, since I am going static, I don’t have to agonise over this WordPress design – the replace command above would change all back to root-relative.

One caveat when replacing to root-relative (instead of to the public domain) is special care needed for canonical link – it is a non-displayed HTML element, but is important for SEO to uniquely identify a page, so it should be an absolute URL. Normally it should be like <link rel="canonical" href="http://mydomain.com/some-page/" />, so of course it’s no good if the href becomes just “/some-page/”!

The workaround is to replace the default WordPress function for outputting canonical link (remove_action(‘wp_head’, rel_canonical);) then add back a custom implementation that writes the absolute URL of the live domain. Then, since this link is not your local domain, the link replace script would leave it alone.

Of course, if you are not fussed about just having absolute links everywhere, the above problem wouldn’t apply.

Also, for the record, it is kind of possible to bend WordPress to use root-relative URLs. This SO answer explains a workaround: http://stackoverflow.com/questions/17187437/relative-urls-in-wordpress. I tried that it works – mostly, but not always. For example, if you change the default upload location (/wp-content/uploads/) to a custom location, then this workaround would fail.

What “index.html”?

With a permalink like /categoy-a/some-post/, wget would export a corresponding static page, as a file, to a 2-level directory /category-a/some-post/ and in it an index.html file. That’s because something ended with “/” must be a folder on the local OS. For a file, it needs to have a proper name.

That’s fine to export the page as file name “index.html”, but does that mean all the links also need to have that, like href="/category-a/some-post/index.html? It looks ugly!

By not using the --convert-links wget option, we won’t have them in links.

srcset images missing

OK, with the custom script to convert URL and links, the ones within the srcset image tag are also converted properly. But the problem is because wget doesn’t recognise them as links. So wget does get the particular image file in img src=, but misses out all the other size images in srcset=. The nice responsive image functionality is broken!

Until wget gets an upgrade to handle srcset, I have to use a custom workaround – simply doing an rsync to sync up files from the WordPress upload directory, to the static export site’s upload directory. E.g.

rsync -auv --modify-window=1 --delete --progress -i ${WP_ROOT_DIR}${UPLOAD_DIR}/ ${STATIC_EXPORT_DIR}${UPLOAD_DIR}