开发者

Normalising book titles - Python

开发者 https://www.devze.com 2022-12-23 07:17 出处:网络
I have a list of books titles: \"The Hobbit: 70th Anniversary Edition\" \"The Hobbit\" \"The Hobbit (Illustrated/Collector Edition)[There and Back Again]\"

I have a list of books titles:

  • "The Hobbit: 70th Anniversary Edition"
  • "The Hobbit"
  • "The Hobbit (Illustrated/Collector Edition)[There and Back Again]"
  • "The Hobbit: or, There and Back Again"
  • "The Hobbit: Gift Pack"

and so on...


I thought that if I normalised the titles somehow, it would be easier to implement an automated way to know wh开发者_开发技巧at book each edition is referring to.

normalised = ''.join([char for char in title 
                       if char in (string.ascii_letters + string.digits)])

or

normalised = ''
for char in title:
  if char in ':/()|':
    break
  normalised += char
return normalised

But obviously they are not working as intended, as titles can contain special characters and editions can basically have very different title layouts.


Help would be very much appreciated! Thanks :)


It depends completely on your data. For the examples you gave, a simple normalization solution could be:

import re

book_normalized = re.sub(r':.*|\[.*?\]|\(.*?\)|\{.*?\}', '', book_name).strip()

This will return "The Hobbit" for all the examples. What it does is remove anything after and including the first colon, or anything in brackets (normal, square, curly) as well as leading and trailing spaces.

However, this is not a very good solution in the general case, as some books have colons or bracketed parts in the actual book name. E.g. the name of the series, followed by a colon, followed by the name of the particular entry of the series.


I would suggest using a 3rd party web service, such as librarything which I believe can do what you're asking, for a starting point, see their documentation:

http://www.librarything.com/services/rest/documentation/1.0/librarything.ck.getwork.php

0

精彩评论

暂无评论...
验证码 换一张
取 消