I am reading a html file using an inputstream from a java servlet. But the contents of the original and the read one are in a different format although when displayed in a web browser they are the same. These are the two links for the html files after reading output http://www.fileflyer.com/view/gQREGAe orginal output http://www.fileflyer.com/view/mWXHVAE Is there a way to get the original html when reading? why is this happening? my java code is as follows;
InputStreamReader isr = new InputStreamReader(inputStre开发者_如何转开发am);
BufferedReader br = new BufferedReader(isr);
String line = null;
while ( (line = br.readLine()) != null)
{
System.out.println(line);
}
Any help would be greatly appreciated!!
Thank you, rana.
The one in different format (the one named extracted.html
) is clearly generated by Microsoft Word.
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">
Your problem is more in the source of the InputStream
, not in the Java or Servlet side. They do for sure not randomly change the content of the InputStream
without your intervention.
You seem to be using MS Word as a HTML editor, you should not do that, there it is not for. Rather use a textbased editor like Notepad, Notepad++, Editplus, etc for HTML editing.
I have seen both the html files. The extracted.html obviously has more tags/comments/css info that you doesn't seem to be interested. So, the only option you are left with is to use 1 of the below parsers and remove the unnecessary nodes/attributes you doesn't require (or just extract what you need)
- Mozilla html parser
- HTML Parser
精彩评论