Downloading an entire website on a Mac using wget
Published onI recently had to take a copy of a client’s website before they transferred from another provider. It was running an old copy of Joomla, and getting backend access proved difficult. So we opted to grab a static copy of the site and keep that live until we had their new WordPress website ready.
There are plenty of apps out there that will download whole websites for you, but the simplest way is to use wget. If you don’t have a copy, you can install wget on a Mac without using MacPorts or HomeBrew using this guide from OS X Daily.
Once it’s installed, open Terminal and type:
wget -help
You’ll see there are a ton of options. At it’s simplest, you can just type:
wget example.com
That will download a copy of the index page of example.com to whichever directory you’re calling wget from in Terminal. But I wanted to get a copy of the whole website, and have it to work locally, i.e. using root-relative URLs, rather than referring back to example.com live on the web.
So here’s the code:
wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --random-wait --domains example.com --no-parent www.example.com
Let’s step through the options used:
--recursive
Recrusively download the directories, up to a max of 5 deep.
--no-clobber
Can also use “-nc”. Stops the same files on a server being downloaded more than once.
--page-requisites
Causes Wget to download all the files that are necessary to properly display a given HTML page. Including such things as inlined images, sounds, and referenced stylesheets.
--html-extension
Renames HTML files as .html. Handy for converting PHP-based sites, such as the Joomla one I needed to copy.
--convert-links
After the download is complete, convert the links in the document to make them suitable for local viewing.
--restrict-file-names=windows
Escapes characters to make them safe on your local system.
--random-wait
Don’t act like we’re downloading the whole site…
--domains example.com
The domain you want to download the whole site from.
--no-parent www.example.com
Do not ever ascend to the parent directory when retrieving recursively.
After all that you’re left with a folder that should be a complete copy of the domain you’ve targeted. Very handy.
However, typing all that is a bit of a pain. I think a bash script taking the domain as an input would save the pain of typing all that out, maybe even wrap it up into an app using Appify. Hmm, one for the to-do list.