I have a collection of html files that I gathered from a website using wget. Each file name is of the form details.php?id=100419&cid=13%0D, where the id and cid varies. Portions of the html files contain articles in Asian language (Unicode text). My intention is to extract the Asian-language text only. Dumping the rendered html using a command-line browser is the first step that I have thought of. It will eliminate some of the frills.
The problem is, I cannot dump the rendered html to a file (using, say, w3m -dump ). The dumping works if only I direct the browser (at the command-line) to the 开发者_如何学运维properly formed URL : http://<blah-blah>/<filename>
. But this is way I will have to spend the time to download the files once again from the web. How do I get around this, what other tools could I use?
w3m -dump <filename>
complains saying:
w3m: Can't load details.php?id=100419&cid=13%0D.
file <filname>
shows:
details.php?id=100419&cid=13%0D: Non-ISO extended-ASCII HTML document text, with very long lines, with CRLF, CR, LF, NEL line terminators
精彩评论