开发者

What would I use to remove escaped html from large sets of data

开发者 https://www.devze.com 2022-12-26 23:08 出处:网络
Our database is filled with articles retrieved from RSS feeds.I was unsure of what data I would be getting, and how much filtering was already setup (WP-O-Matic Wordpress plugin using the SimplePie li

Our database is filled with articles retrieved from RSS feeds. I was unsure of what data I would be getting, and how much filtering was already setup (WP-O-Matic Wordpress plugin using the SimplePie library). This plugin does some basic encoding before insertion using Wordpress's built in post insert function which also does some filtering. Between the RSS feed's encoding, the plugin's encoding using PHP, Wordpress's encoding and SQL escaping, I'm not sure where to start.

The data is usually at the end of the field after the content I want to keep. It is all on one line, but separated out for readability:

<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:V_sGLiPBpWU" border="0"></img>

<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?d=qj6IDK7rITs" border="0"></img>

&lt;img src=&quot;http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:D7DqB2pKExk&quot;

Notice how some of the images are escape and some aren't. I believe this has to do with the last part being cut off so as to be unrecognizable as an html tag, which then caused it to be html endcoded while the actual img tags were left alone.

Another record has only this in one of the fields, which means the RSS feed gave me nothing for the item (filtered out now, but I have a bunch of records like this):

&lt;img src=&quot;http://farm3.static.flickr.com/2183/2289902369_1d95bcdb85.jpg&quot; alt=&quot;post_img&quot; width=&quot;80&quot;

All extracted samples are on one line, but broken up for readability. Otherwise, they are copied exactly from the database from the command line mysql client.

Question: What is the best way to work with the above escaped html (or portion of an html tag), so I can then remove it without affecting the content?

I want to remove it, because the images at the end of the field are usually images that have nothing to do with content. In the case of the feedburner ones, feedburner adds those to every single article in a feed. Other times, they're broken links surrounding broken images. The point is not the valid html img tags which can be removed easily. It's the mangled tags which if unencoded will not be valid html, which will not be parsable with your standard html parsers.

[EDIT] If it was just a matter of pulling the html I wanted out and doing a strip_tags and reinserting the data, I wouldn't be asking this question.

The portion that I have a problem with is that what used to be an img tag was html encoded and the end cut off. If it's deencoded it will not be an html tag, so I cannot parse it the usual way.

With all the &lt;img src=&quot; crap, I can't get my head around searching for it other than SELECT ID, post_content FROM table WHERE post_content LIKE '&lt;img' which at least gets me those posts. But when I get the data, I need a way to find it, remove it, but keep the rest of the content.

[/EDIT]

[EDIT 2]

<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.<div> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bizj_portland?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:qj6IDK7rITs">&lt;img src=&quot;http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs&quot;

The part I want to keep:

<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the 开发者_运维技巧first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.

To reiterate: It's not about removing the valid html img tags. That's easy. I need to be able to find specifically the &lt;img src=&quot;http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs&quot; if it's part of the pattern of img tag img tag mangled img tag or anchor img anchor img img mangled image etc etc, but not remove &lt;img if it is indeed part of the article. Out of the few dozen samples I've reviewed, it's been pretty consistent that this mangled img tag is at the end of the field.

The other one is the single mangled image tag. It's consistently a mangled flickr img tag, but as above, I can't just search for &lt;img as it could be a valid part of the content.

The problem lies in that I can't simply decode it and parse it as HTML, because it will not be valid html. [/EDIT 2]


The best way is to:

  1. Install HTML::Entities from CPAN and use that to unescape the URIs.
  2. Install HTML::Parser from CPAN and use that to parse and remove the URIs after they're unescaped.

Regexes are not a suitable tool for this task.


Question updated...

To extract the data you want, you could use this approach:

use HTML::Entities qw/decode_entities/;

my $decoded = decode_entities $raw;

if ($decoded =~ s{ (<img .+? (?:>.+?</img>|/>)) } {}x) {  # grab the image
    my $img = $1;
    $decoded =~ s{<.+?>}      {}xg;  # strip complete tags
    $decoded =~ s{< [^>]+? $} {}x;   # strip trailing noise

    print $img.$decoded;
}

Using a regex to parse HTML is generally frowned upon, however, in this case, it is more about stripping out segments that match a pattern. After testing the regexes on a larger set of data, you should have an idea of what might need to be tweaked.

Hope this helps.


I wouldn't strip it out. It's far from unrecoverable junk.

First apply HTML::Entities::decode_entities conditionally (use the occurence of < as the first character as heuristic), then let HTML::Tidy::libXML->clean(…, 'UTF-8', 1) reconstruct the mark-up as intended. clean returns a whole document, but it's trivial to extract just the needed img element.


Your best bet will be to recollect all of the articles that are in the database so that they aren't truncated and corrupted. If this is not an option then...

Based on your examples above it looks like you're stripping out everything that follows the text content of each article. In your example the text content is followed by a DIV tag and a bunch of IMG tags that may or may not have been truncated and or been converted into HTML entities.

If all of your records are similar you can strip out everything after the Text content by removing the final div tag and everything that follows it using perl like this:

my $article = magic_to_get_an_article();
$article =~ s/<div>.*//s;
magic_to_store_article($article);

If your records include anything more complex than this you're better off using an HTML parsing module and reading the documentation carefully to find out how it handles invalid HTML.


How about a stupid simple Perl find and replace on the var containing your data...

foreach $line(@lines) {
    $line =~ s/&lt;/</gi;
    $line =~ s/&gt;/>/gi;
}


Given the sample input and output you give at the end of your post, the following will get you the desired output:

#!/usr/bin/perl

use strict; use warnings;

use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new( \*DATA );

if ( my $tag = $parser->get_tag('img') ) {
    print $tag->as_is;
    print $parser->get_text('div');
}

__DATA__
<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.<div> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bizj_portland?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:qj6IDK7rITs">&lt;img src=&quot;http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs&quot;

Output:

<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="po st_img" width="80" />Through the first two months of the year, the volume of car go handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.

However, I am puzzled as to the size and scope of each chunk you are supposed to process.

0

精彩评论

暂无评论...
验证码 换一张
取 消