I've got a file with some book data in MARC format, of which some lines are ISBNs. I'd like to replace these lines with the Google Books ID of that ISBN, if it exists. Here's the code so far, which just ends up removing the lines:
perl -pe "s#ISBN(.*)#$(wget --output-document=- --quiet --user-agent=Mozilla/5.0 \"http://books开发者_开发技巧.google.com/books?jscmd=viewapi&bibkeys=\1\")#mg" < 5-${file} > 6-${file}
PS: Google are a bit fuzzy on the use of automated tools: The Books Data API recommends tools like curl / wget, but there are no instructions on how to avoid being blocked when using such tools. I'm also pretty sure I saw a clause in a ToS saying users can't send automated queries, but I can't find it again. This is discussed in their forum.
The reason you end up having to lie about the user agent is because you are violating Google's TOS: Don't do that.
Instead, use the Google Book Search API.
The code below is slightly hampered by my lack of familiarity with modules such as XML::Atom, Data::Feed, WWW::OpenSearch. However, it should provide a good starting point.
#!/usr/bin/perl
use strict;
use warnings;
use Business::ISBN qw( valid_isbn_checksum );
use LWP::Simple;
use XML::Simple;
while ( <> ) {
s/ISBN:([0-9]+)/'Google Books ID:' . get_google_id_for_isbn($1)/ge;
print;
}
use Carp;
sub make_google_books_query {
sprintf 'http://books.google.com/books/feeds/volumes?q=isbn:%s', $_[0];
}
sub get_google_id_for_isbn {
my ($isbn) = @_;
my $google_id = eval {
defined(valid_isbn_checksum $isbn)
or croak "Invalid ISBN: $isbn";
my $query = make_google_books_query($isbn);
my $xml = get $query;
defined($xml)
or croak "No response to <$query>";
my $data = XMLin($xml, ForceArray => 1);
my @ids = @{ $data->{entry}[0]{'dc:identifier'} };
unless ("ISBN:$isbn" eq $ids[1]
or "ISBN:$isbn" eq $ids[2] ) {
croak "Invalid search results: '@ids'";
}
$ids[0];
};
defined($google_id) ? $google_id : '';
}
Given a text file t.txt
containing:
ISBN:0060930314 ISBN:9780596520106
it outputs:
Google Books ID:ioXFqlzsmK8C Google Books ID:lNVHi3TunxsC
I think the OP is on the right track and could use a one-liner for this, and just needs to replace some bash-style syntax with the correct Perl syntax. I think this would work (newlines added for readability):
perl -pe 's#ISBN(\w+)#qx(wget --output-document=-
--quiet --user-agent=Mozilla/5.0
"http://books.google.com/books\\?jscmd=viewapi\\&bibkeys=$1")#ge' \
< 5-${file} > 6-${file}
You have to escape (edit: double escaping seems to work) the $
or &
characters in the url.
精彩评论