开发者

How to wget on website with end trailing slash, and save just like no end trailing slash

开发者 https://www.devze.com 2023-03-05 06:42 出处:网络
I created a crawler with Wget for personal use. wget -k -m -Dwww.website.com -r -q -R gif,png,jpg,jpeg,GIF,PNG,JPG,JPEG,js,rss,xml,feed,.tar.gz,.zip,rar,.rar,.php,.txt -t 1 http://www.website.com/ &a

I created a crawler with Wget for personal use.

wget -k -m -Dwww.website.com -r -q -R gif,png,jpg,jpeg,GIF,PNG,JPG,JPEG,js,rss,xml,feed,.tar.gz,.zip,rar,.rar,.php,.txt -t 1 http://www.website.com/ &

The post example URL in the website is http://www.website.com/post-one/, which every post has trailing slash in the end of the URL.

When saved, Wget will create:

www.website.net/post-one
www.website.net/post-one/index.html

The fir开发者_StackOverflow社区st line is folder while second line is the actual HTML file I'm looking for. The problem is, Wget will create a folder for each post, which make more difficult to work with the data.

I want Wget to create www.website.net/post-one which post-one this is the HTML file, and not create folder for each post.

I've tried many ways with no luck. Use -R .html results folder with no contents.


The wget I use supports the following directory options:

-nd, --no-directories           don't create directories.
-x,  --force-directories        force creation of directories.
-nH, --no-host-directories      don't create host directories.
     --protocol-directories     use protocol name in directories.
-P,  --directory-prefix=PREFIX  save files to PREFIX/...
     --cut-dirs=NUMBER          ignore NUMBER remote directory component

Maybe -nd OR -P can help you.

Otherwise a shell script can easily convert the files to a single level dir after you have all the files downloaded using your existing wget.

#!/bin/bash
cd www.website.net
for d in $( find . -type -d -print ) ; do
   if [[ -f $d/index.html ]] ; then
     echo mv $d/index.html $.html && echo rmdir $d
    fi
done

remove the echos when you are sure the loop is producing output that will work for you.

I hope this helps.

P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, and/or give it a + (or -) as a useful answer.

0

精彩评论

暂无评论...
验证码 换一张
取 消