With the help of tucuxi from the existing post Java remove HTML from String without regular expressions I have built a method that will parse out any basic HTML tags from a string. Sometimes, however, the original string contains html hexadecimal characters like é (which is an accented e). I have started to add functionality which will translate these escaped characters into real characters.
You're probably asking: Why not use regular expressions? Or a third party library? Unfortunately I cannot, as I am developing on a BlackBerry platform which does not support regular expressions and I have never been able to successfully add a third party library to my project.
So, I have gotten to the point where any é is replaced with "e". My question now is, how do I add an actual 'accented e' to a string?
Here is my code:
public static String removeHTML(String synopsis) {
char[] cs = synopsis.toCharArray();
String sb = new String();
boolean tag = false;
for (int i = 0; i < cs.length; i++) {
switch (cs[i]) {
case '<':
if (!tag) {
tag = true;
break;
}
case '>':
if (tag) {开发者_开发技巧
tag = false;
break;
}
case '&':
char[] copyTo = new char[7];
System.arraycopy(cs, i, copyTo, 0, 7);
String result = new String(copyTo);
if (result.equals("é")) {
sb += "e";
}
i += 7;
break;
default:
if (!tag)
sb += cs[i];
}
}
return sb.toString();
}
Thanks!
Java Strings are unicode.
sb += '\u00E9'; # lower case e + '
sb += '\u00C9'; # upper case E + '
You can print out just about any character you like in Java as it uses the Unicode character set.
To find the character you want take a look at the charts here:
http://www.unicode.org/charts/
In the Latin Supplement document you'll see all the unicode numbers for the accented characters. You should see the hex number 00E9 listed for é for example. The numbers for all Latin accented characters are in this document so you should find this pretty useful.
To print use character in a String, just use the Unicode escape sequence of \u followed by the character code like so:
System.out.print("Let's go to the caf\u00E9");
Would produce: "Let's go to the café"
Depending in which version of Java you're using you might find StringBuilders (or StringBuffers if you're multi-threaded) more efficient than using the + operator to concatenate Strings too.
try this:
if (result.equals("é")) {
sb += char(130);
}
instead of
if (result.equals("é")) {
sb += "e";
}
The thing is that you're not adding an accent to the top of the 'e' character, but rather that is a separate character all together. This site lists out the ascii codes for characters.
For a table of accented in characters in Java take a look at this reference.
To decode the html part, use Apache StringEscapeUtils from Apache commons lang:
import org.apache.commons.lang.StringEscapeUtils;
...
String withCharacters = StringEscapeUtils.unescapeHtml(yourString);
See also this Stack Overflow thread: Replace HTML codes with equivalent characters in Java
精彩评论