I am working in the indexation of feeds from Internet. I would like to remove tha html code which appears in some of them. I have used regular expression for the ones i have seen, but I would like to find some way to remove all of them automatically, because I don't know if I have seen all possible html code in my feeds. Is there any possibility? I add an example of things I would like to remove: /0831/oly_g_liukin_576.jpg" height="49" width="41" /> BEIJING - AUGUST 15: Nast开发者_如何学运维ia Liukin of the...
Use Jsoup utility, very good util to strip HTML code from a string
http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer
In C# it could look something like (it will remove HTML Tags) this:
public static String RemoveHtmlTagsFromString(String source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;
foreach (char let in source)
{
if (let == '<')
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
return new string(array, 0, arrayIndex);
}
精彩评论