How to render an HTML file offline?_问答_开发者

开发者 https://www.devze.com 2023-01-25 03:14 出处：网络

I have a collection of html files that I gathered from a website using wget. Each file name is of the form details.php?id=100419&cid=13%0D, where the id and cid varies. Portions of the html files

相关专题：html-parsing

I have a collection of html files that I gathered from a website using wget. Each file name is of the form details.php?id=100419&cid=13%0D, where the id and cid varies. Portions of the html files contain articles in Asian language (Unicode text). My intention is to extract the Asian-language text only. Dumping the rendered html using a command-line browser is the first step that I have thought of. It will eliminate some of the frills.

The problem is, I cannot dump the rendered html to a file (using, say, w3m -dump ). The dumping works if only I direct the browser (at the command-line) to the 开发者_如何学运维properly formed URL : http://<blah-blah>/<filename>. But this is way I will have to spend the time to download the files once again from the web. How do I get around this, what other tools could I use?

w3m -dump <filename> complains saying: w3m: Can't load details.php?id=100419&cid=13%0D.

file <filname> shows: details.php?id=100419&cid=13%0D: Non-ISO extended-ASCII HTML document text, with very long lines, with CRLF, CR, LF, NEL line terminators