I've attempted to build a program to scrape the web for company management teams. It's very accurate at obtaining many things, including:
-names
-job titles
-images
-emails
-Qualifications (MD, PhD, ect) and Suffixes (II, III, JR.)
The issue I'm running into is scraping the person's description. For instance on Facebook's Executive Bios page I开发者_JS百科 would want Mark Zuckerberg's description. However, with all the differences in HTML structure, it is very difficult to scrape this with close to 100% accuracy.
I am using Perl and many, what I believe to be advanced, regular expressions. Is there a better way / tool to approach the problem with?
My latest attempt was to find the last occurrence of the persons full name on the page, then take all text until I hit a co-workers name. While this seems like it would work it gives me less than desirable results.
EDIT: I realized this question came off as just trying to parse this specific page, I need something that is general enough to work on any companies "people-page". I know 100% accuracy is unachievable, looking for something that would get me to 50% plus as currently I'm down around 15-20 percent.
Using regular expressions for parsing HTML will certainly fail at one time or the other.
Few modules that could help with parsing HTML are:
WWW::Mechanize
HTML::TreeBuilder
If you need more control over parsing HTML, you could use HTML::Parser
.
Furthermore, there have been several questions on parsing HTML using Perl in StackOverflow. The answers there can be helpful.
A sample scraper for the Facebook Executive Bios page, which makes use of LWP::UserAgent
to fetch page content and HTML::TreeBuilder
for parsing:
#!/usr/bin/env perl
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TreeBuilder;
binmode STDOUT, ':utf8';
my $ua = LWP::UserAgent->new( 'agent' => 'Mozilla' );
my $response = $ua->get('http://www.facebook.com/press/info.php?execbios');
my $tree = HTML::TreeBuilder->new();
if ( $response->is_success() ) {
$tree->parse_content( $response->decoded_content() );
}
else {
die $response->status_line();
}
for my $biosummary_tag ( $tree->look_down( 'class' => 'biosummary' ) ) {
my $bioname_tag = $biosummary_tag->look_down( 'class' => 'bioname' );
my $biotitle_tag = $biosummary_tag->look_down( 'class' => 'biotitle' );
my $biodescription_tag
= $biosummary_tag->look_down( 'class' => 'biodescription' );
my $bioname = $bioname_tag->as_text();
my $biotitle = $biotitle_tag->as_text();
my $biodescription = $biodescription_tag->as_text();
print "Name: $bioname\n";
print "Title: $biotitle\n";
print "Description: $biodescription\n\n";
}
You are never going to get 100%, or not with today's technology.
The most reliable way is to have markup the source, but as you are web scraping you don't have this. Rather than regex, you could try some more sophisticated Natural Language Processing (NLP) techniques. I don't know what is available for Perl, but Python's NLTK is good for getting started. It is a toolkit designed so you can pick and choose what you need to extract the info you need, plus there are a couple of good books out there - including the open sourced O'Reilly book Natural Language Processing with Python.
精彩评论