Text Processing - Detecting if you are inside an HTML tag in Java_问答_开发者

Text Processing - Detecting if you are inside an HTML tag in Java

开发者 https://www.devze.com 2023-02-22 22:49 出处：网络

I have a program that does text processing on a html formatted document based on information on the same document without the html information. I basically, locate a word or phrase in the unformatted document, then find the corresponding word in the formatted document and alter the appearance of the word or phrase using HTML tags to make it stick out (e.g. bold it or change its color).

Here is my problem. Occasionally, I want to do formatting to a word or phrase which might be part of a html tag (for example perhaps I want to do some formatting to the word "font" but only if is a word that is not inside an html tag). Is there an easy way to detect whether a string is part of an html开发者_运维问答 tag in a block of text or not?

By the way, I can't just strip out the html tags in the document and do my processing on the remaining text because I need to preserve the html in the result. I need to add to the existing html but I need to reliably distinguish between strings that are part of tags and strings that are not.

Any ideas?

Thank you,

Elliott

You could do a few things

Write a regular expression for what you're doing. There are plenty of prewritten ones you can find on Google
Find a library to parse the document (e.g., http://htmlparser.sourceforge.net/) and only replace text

The first is likely to the be the fastest and easiest, but the second will be more reliable.

Use the following regex code to detect if it has HTML tags: "\<.*?\>"

And here you can learn how to effectively use regex in your java code. Happy coding ;)

If you have parsed the DOM, what you have, if you are doing it correctly. Then ask the super tag that contains current tag, and keep doing that, if that is not the tag, that you are looking for.

If you use some custom search or regex to parse html, then check best answe for this question:

RegEx match open tags except XHTML self-contained tags (It has +4000 upvotes for a reason)