How to abbreviate HTML with Java?_问答_开发者_运维开发者技术经验分享

A user enters text as HTML in a form, for example:

<p>this is my <strong>blog</strong> post, 
very <i>long</i> and written in <b>HTML</b></p>

I want to be able to output only a part of the string ( for example the first 20 characte开发者_开发知识库rs ) without breaking the HTML structure of the user's input. In this case:

<p>this is my <strong>blog</strong> post, very <i>l</i>...</p>

which renders as

this is my <strong>blog</strong> post, very <i>lo</i>...

Is there a Java library able to do this, or a simple method to use?

MyLibrary.abbreviateHTML(string,20) ?

Since it's not very easy to do this correctly I usually strip all tags and truncate. This gives great control on the text size and appearance which usually needs to be placed in places where you do need control.

Note that you may find my proposal very conservative and it actually is not a proper answer to your question. But most of the times the alternatives are:

strip all tags and truncate
provide an alternate content manageable rich text which will serve as the truncated text. This of course only works in the case of CMSes etc

The reason that truncating HTML would be hard is that you don't know how truncating would affect the structure of the HTML. How would you truncate in the middle of a <ul> or, even worst, in the middle of a complex <table>?

So the problem here is that HTML can not only contain content and styling (bold, italics) but also structure (lists, tables, divs etc). So a good and safe implementation would be to strip everything out apart inline "styling" tags (bold, italics etc) and truncate while keeping track of unclosed tags.

I don't know any library but it should not be so complicated (for 80%). You only need a simple "parser" that understand 4 type of tokens:

opening tags - everything that starts with < but not </ and ends with > but not />
closing tags - everything that starts with </ and ends with >
self-closing tags (like <br/>) - everything that starts with < but not </ and ends with /> but not >
normal character - everything that is none of the other types

Then you must walk through your input string, and count the "normal characters". While you walking along the string and count, you copy every token to the output as long as the counted normal chars are less or equals the amount you want to have.

You also need to build a stack of current open tags, while you walk thought the input. Every time you walk trough a "opening tag" you put it to the stack (its name), every time you you find a closing tag, you remove the topmost tag name from the stack (hopefully the input is correct XHTML).

When you reach the end of the required amount of normal chars, then you only need to write closing HTML tags for the tag names remaining on the stack.

But be careful, this works only with the input is well-formed XML.

I don't know what you want to do with this piece of code, but you should pay attention to HTML/JavaScript injection attacks.

If you really want to abbreviate HTML then just do it (cut the text at desired length), pass the abbreviated result through http://jtidy.sourceforge.net/ and hope for the best.

It seams that there are a lot of libs and tools for this common task:

truncateNicely from Jakarta Taglibs String (Jakarta Taglibs has been retired)
org.displaytag.util.HtmlTagUtil#abbreviateHtmlString from Display tag library 1.2 (allready Mentioned by Marnix van Bochove in his comment.)