Advertisement

wget usage

Started by January 22, 2007 09:05 AM
0 comments, last by gumpy 17 years, 10 months ago
Hi. I'm trying to download the following from a website: http://site.com/features/X/a.html http://site.com/features/X/b.html http://site.com/features/X/c.html http://site.com/features/Y/d.html http://site.com/features/Z/e.html etc. I also want to download embedded images and all links for these pages, but not following links outside site.com/features/* Man pages for wget led me to try this:
Quote: wget -r -l2 http://site.com/features
But this simply gets site.com/features/index.html and site.com/robots.txt. Their robots.txt disallows several areas of the site but not features/ The other thing is I have a text file listing http://site.com/features/X/ http://site.com/features/Y/ http://site.com/features/Z/ etc But using -i list.txt still just gets files called index.html. The other thing is I have an html file which links to one file in each of /features/*/ but if I use recursive downloading like this:
Quote: wget -x -r -l2 --convert-links -p -Dsite.com pagewithlinks.html
Then it tries to download masses of stuff from elsewhere in the site, and doing -Dsite.com/features causes only only pagewithlinks.html to be downloaded. I'm betting the last approach is the most promising. Anyone know how to restrict recursive downloading to links within site.com/somedirectory/* ?
spraff.net: don't laugh, I'm still just starting...
i haven't done this in ages, but try something like:
wget -r -p -np -l 2 -k -x http://whatever.com/whatever.htm

the -np is "no parent" and the -k is short for --convert-links.
This space for rent.

This topic is closed to new replies.

Advertisement