I recently learned how to use wget and thought I’d capture some of my thoughts and notes.
From the wget home page: “GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without Xsupport, etc.”
A great summary of wget usage can be found in Downloading without a Browser (LinuxGazette). This was the first place I learned about wget.
I’ve used it on Linux and Windows and it’s available for many other platforms as well. Wget is usually present in Linux distributions; for Windows, see Heiko Herold’s Windows wget page. (There is also a Windows GUI wrapper for wget, but I have not tried it yet.)
Some simple usage examples:
- wget url
- Grab a single web page from ‘url’.
- wget -p url
- Grab a single web page from ‘url’, including all content included in it (e.g., images, CSS, etc.).
- wget -m url -nv -a ~/home.log
- Perform a mirror copy operation on ‘url’, saving the non-verbose log results to home.log.
The “-p” option is particularly useful for grabbing a copy of a complete web page. A great example would be a site (or page) where you are trying to understand how it was constructed. By using wget (instead of saving from your web browser, for example), you’ll get the exact content from the server, including referenced images, CSS files, etc.
The “-m” option turns on all of the mirroring options. You can run wget in this manner to mirror an existing site. (Run a nightly update, for example.) There are some additional options which can help throttle the HTTP requests so as to not impact the web server. I suspect it would only be appropriate and in good form to mirror your own sites or those for which you already have permission. Brute-force copying of entire sites has probably helped label wget as a bad web robot or spider. Wget by default will obey robots.txt if it is present on the target system. This can be overwritten if needed, but again, we don’t want to use wget against the wishes of a target web site.
Wget does have a problem with getting a page from a dynamic URL. If the URL contains characters such as ‘?’ or ‘&’ (which is fairly common), wget will have a problem when it tries to save those pages to the local filesystem. It’s possible that there are some wget options that can configure this, but I have not pursued it yet.
Some other options for copying single files or mirroring entire sites include curl and sitecopy.