I need to transform a HTML file, by removing certain tags from the file. To do this I have something like this -
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Entities;
import org.jsoup.nodes.Entities.EscapeMode;
import java.io.IOException;
import java.io.File;
import java.util.*;
public class TestJsoup {
public static void main(String[] args) throws IOException {
Validate.isTrue(args.length == 1, "usage: supply url to fetch");
String url = args[0];
Document doc = null;
if(url.contains("http")) {
doc = Jsoup开发者_Python百科.connect(url).get();
} else {
File f = new File(url);
doc = Jsoup.parse(f, null);
}
/* remove some tags */
doc.outputSettings().escapeMode(Entities.EscapeMode.extended);
System.out.println(doc.html());
return;
}
}
The issue with the above code is that, when I use extended escape mode, the output has the html tag attributes being html encoded. Is there anyway to avoid this? Using escape mode as base or xhtml doesn't work as some of the non standard extended (like ’
) encoding give problems. For ex for the HTML below,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<title>Test®</title>
</head>
<body style="background-color:#EDEDED;">
<P>
<font style="color:#003698; font-weight:bold;">Testing HTML encoding - ’ © with a <a href="http://www.google.com">link</a>
</font>
<br />
</P>
</body>
</html>
The output I get is,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>

<title>Test®</title>

</head>

<body style="background-color:#EDEDED;">

<p>
 <font style="color:#003698; font-weight:bold;">Testing HTML encoding - ’ © with a <a href="http://www.g
oogle.com">link</a></font> <br />
</p>




</body>
</html>
Is there anyway to get around this issue?
What output encoding character set are you using? (It will default to the input, which if you are loading from URLs, will vary according to the site).
You probably want to explicitly set it to either UTF-8
, or ASCII
or some other low setting if you are working with systems that cannot deal with UTF-8. If you set the escape mode to base
(the default), and the charset to ascii, then any character (like rsquo
) than cannot be represented natively in the selected charset will be output as a numerical escape.
For example:
String check = "<p>’ <a href='../'>Check</a></p>";
Document doc = Jsoup.parse(check);
doc.outputSettings().escapeMode(Entities.EscapeMode.base); // default
doc.outputSettings().charset("UTF-8");
System.out.println("UTF-8: " + doc.body().html());
doc.outputSettings().charset("ASCII");
System.out.println("ASCII: " + doc.body().html());
Gives:
UTF-8: <p>’ <a href="../">Check</a></p>
ASCII: <p>’ <a href="../">Check</a></p>
Hope this helps!
精彩评论