Hi. I'm trying to download the following from a website:
http://site.com/features/X/a.html
http://site.com/features/X/b.html
http://site.com/features/X/c.html
http://site.com/features/Y/d.html
http://site.com/features/Z/e.html
etc.
I also want to download embedded images and all links for these pages, but not following links outside site.com/features/*
Man pages for wget led me to try this:
Quote:
wget -r -l2 http://site.com/features
But this simply gets site.com/features/index.html and site.com/robots.txt. Their robots.txt disallows several areas of the site but not features/
The other thing is I have a text file listing
http://site.com/features/X/
http://site.com/features/Y/
http://site.com/features/Z/
etc
But using -i list.txt still just gets files called index.html.
The other thing is I have an html file which links to
one file in each of /features/*/ but if I use recursive downloading like this:
Quote:
wget -x -r -l2 --convert-links -p -Dsite.com pagewithlinks.html
Then it tries to download masses of stuff from elsewhere in the site, and doing -Dsite.com/features causes only only pagewithlinks.html to be downloaded.
I'm betting the last approach is the most promising. Anyone know how to restrict recursive downloading to links within site.com/somedirectory/* ?