开发者

filter/remove invalid xml characters from stream

开发者 https://www.devze.com 2023-01-06 19:43 出处:网络
First things first, I can not change the output of the xml, it is being produced by a third party. They are inserting invalid characters in the the xml. I am given a InputStream of the byte stream rep

First things first, I can not change the output of the xml, it is being produced by a third party. They are inserting invalid characters in the the xml. I am given a InputStream of the byte stream representation of the xml. Is their a cleaner way to filter out the offending characters besides consuming the stream into a String and processing it? I found this: using a FilterReader but that doesn't work for me as I have a byte stream and not a character stream.

For what it's worth this is all part of a jaxb unmarshalling procedure, just in case that offers options.

We aren't willing to toss the whole stream if it has bad characters. We have decided to remove them and carry on.

Here is a FilterReader I tried to build.

public class InvalidXMLCharacterFilterReader extends FilterReader {

    private static final Log LOG = LogFactory
    .getLog(InvalidXMLCharacterFilterReader.class);

    public InvalidXMLCharacterFilterReader(Reader in) {
        super(in);
    }

    public int read() throws IOException {
        char[] buf = new char开发者_如何学Go[1];
        int result = read(buf, 0, 1);
        if (result == -1)
        return -1;
        else
        return (int) buf[0];
    }

    public int read(char[] buf, int from, int len) throws IOException {
        int count = 0;
        while (count == 0) {
            count = in.read(buf, from, len);
            if (count == -1)
                return -1;

            int last = from;
            for (int i = from; i < from + count; i++) {
                LOG.debug("" + (char)buf[i]);
                if(!isBadXMLChar(buf[i])) {
                    buf[last++] = buf[i];
                }
            }

            count = last - from;
        }
        return count;
    }

    private boolean isBadXMLChar(char c) {
        if ((c == 0x9) ||
            (c == 0xA) ||
            (c == 0xD) ||
            ((c >= 0x20) && (c <= 0xD7FF)) ||
            ((c >= 0xE000) && (c <= 0xFFFD)) ||
            ((c >= 0x10000) && (c <= 0x10FFFF))) {
            return false;
        }
        return true;
    }

}

And here is how I am unmarshalling it:

jaxbContext = JAXBContext.newInstance(MyObj.class);
Unmarshaller unMarshaller = jaxbContext.createUnmarshaller();
Reader r = new InvalidXMLCharacterFilterReader(new BufferedReader(new InputStreamReader(is, "UTF-8")));
MyObj obj = (MyObj) unMarshaller.unmarshal(r);

and some example bad xml

<?xml version="1.0" encoding="UTF-8" ?>
<foo>
    bar&#x01;
</foo>


In order to do this with a filter, the filter needs to be XML entity aware, because (at least in your example and likely sometimes in actual use) the bad characters are in the xml as entities.

The filter is seeing your entity as a sequence of 6 perfectly acceptable characters and thus not stripping them.

The conversion that breaks JAXB is happening later in the process.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号