Archiving a blog to html files
or Mirroring your site for offline browsing
First, if it's a wordpress blog you're archiving,
you should login into the admin of the blog and go to Options -> Permalink
Make sure the permalink uses postname or postnum but not ?p=postnum
So url for posts will be something like:
This way the wget will add an html after the postname or number when it creates the flat file, such as:
Otherwise you'd get files called
itp.nyu.edu/~netid/blogname/?p=123.html which will cause problems.
To run wget
- ssh to itp.nyu.edu
- Change directory: cd to public_html and
- make a directory to put your files in: mkdir somename
- Change directory: cd to where you'd like to save your message file
Run this command (it should all be on one line, which may wrap on screen)
nohup wget --mirror -w 2 -p --html-extension --no-parent --convert-links -P /path/to/save/to http://itp.nyu.edu/path/to/blog >msg 2>&1 &
nohup wget --mirror -w 2 -p --html-extension --no-parent --convert-links -P /home/ndl5/public_html/save_myblog http;//itp.nyu.edu/~ndl5/myblog >msg 2>&1 &
What it all means
nohup: means no hangup so if you logout or get disconnected it should keep running
wget: command line tool for retrieving files using HTTP, HTTPS and FTP
-mirror -w 2 -p --html-extension --no-parent --convert-links -P /path/to/save/to : wget options (see explanation below)
http://itp.nyu.edu/path/to/blog: full url of top level of the blog you want to convert to html or the site you want to mirror
msg: puts messages to a file called out.
2>&1: puts error messages to same file (out)
&: at end means run in background, so you can do other things
Note: Because it's running in the background, you may need to check on the process from time to time.
- You can type 'tail msg' to see the last 10 lines of the message file
- You can take a look at what files have been created in the directory you're saving to
- You can check that wget is still running or tell it to stop running by follow the instructions on dealing with a Runaway Process
Explanation of wget options
Mostly taken from this page:
--mirror: Specifies to mirror the site. Wget will recursively follow all links on the site and download all necessary files. It will also only get files that have changed since the last mirror, which is handy in that it saves download time.
-w: Tells wget to wait or pause between requests, in this case for 2 seconds. This is not necessary, but is the considerate thing to do. It reduces the frequency of requests to the server, thus keeping the load down. If you are in a hurry to get the mirror done, you may eliminate this option.
-p: Causes wget to get all required elements for the page to load correctly. Apparently, the mirror option does not always guarantee that all images and peripheral files will be downloaded, so I add this for good measure.
--HTML-extension: All files with a non-HTML extension will be converted to have an HTML extension. This will convert any CGI, ASP or PHP generated files to HTML extensions for consistency.
--convert-links: All links are converted so they will work when you browse locally. Otherwise, relative (or absolute) links would not necessarily load the right pages, and style sheets could break as well.
-P (prefix folder): The resulting tree will be placed in this folder. This is handy for keeping different copies of the same site, or keeping a browsable copy separate from a mirrored copy.
--no-parent: The simplest, and often very useful way of limiting directories is disallowing retrieval of the links that refer to the hierarchy above the beginning directory, i.e. disallowing ascent to the parent of the parent directory.