HTML::PullParser splits up text element randomly_问答_开发者

HTML::PullParser splits up text element randomly

开发者 https://www.devze.com 2023-03-28 19:34 出处：网络

I\'m using Perl module HTML::PullParser. I noticed that it sometimes splits up a text element (as far as I can tell) randomly.

I'm using Perl module HTML::PullParser. I noticed that it sometimes splits up a text element (as far as I can tell) randomly.

For example, if I have a html file test.html with the content of

<html>开发者_Python百科;
...
<FONT STYLE="font-family:Times New Roman" SIZE="2">THE QUICK BROWN FOX</FONT>
...
</html>

And my perl code looks something like

my $html = HTML::PullParser->new(file => 'test.html', text => '"T", text');
while (my $token = $html->get_token) {
    print "$$token[1]\n";
}

Then sometimes I get back

THE QUICK BROWN FOX    # correctly parsed

But other times I get

THE QUICK
 BROWN FOX

where the text element is parsed into two separate tokens. Yet at other times, depending on the other content of the html file, I get

THE QUICK BROWN
 FOX

where the breaking point is different. This behavior is extremely annoying. And I tried my best to isolate the problem. Looks like it is dependent on the entirety of the file (i.e. if I delete the rest of the file to have only that element left, then it is fine). However, I'm not able to identify what part of the rest of the file caused this. Wondering if anyone had similar experience and know how to get around the issue? Thx!!

UPDATE: the occurrence of this errant behavior is also NOT dependent on a single section of html code elsewhere in the file. I was able to isolate two sections of html codes prior to that text element - when both of them are present, this error occurs. But when either one is present without the other, this problem goes away... I'm absolutely confused and annoyed.

HTML::PullParser is a subclass of HTML::Parser. HTML::Parser has an unbroken_text attribute that controls whether it spits out text events as soon as possible, or whether it buffers text up until the parser knows that no more text is coming. The default is to generate text nodes as soon as possible. a $p->unbroken_text(1) call should make it buffer :)