开发者

Python HTML parsing with beautiful soup and filtering stop words

开发者 https://www.devze.com 2023-02-24 03:37 出处:网络
I am parsing out specific information from a website into a file. Right now the program I have looks at a webpage, and find the right HTML tag and parses out the right contents. Now I want to further

I am parsing out specific information from a website into a file. Right now the program I have looks at a webpage, and find the right HTML tag and parses out the right contents. Now I want to further filter these "results".

For example, on the site : http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

I am parsing out the ingredients which are located in < div class="ingredients"...> tag. This parser does the job nicely but I want to further process these results.

When I run this parser, it removes numbers, symbols, commas, and slash(\ or /) but leaves all text. When I run it on the website I get results like:

cup olive oil
cup chicken broth
cloves garlic minced
tablespoon paprika

Now I want to further process this by removing stop words like "cup"开发者_开发百科, "cloves", "minced", "tablesoon" among others. How exactly do I do this? This code is written in python and I am not very good at it, and I am just using this parser to get information which I can manually enter but I would rather not.

Any help on how to do this in detail would be appreciated! My code is below: how would I do this?

Code:

import urllib2
import BeautifulSoup

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip('123456789.,/\ ') for s in ingreds.findAll('li')]

    fname = 'PorkRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()


import urllib2
import BeautifulSoup
import string

badwords = set([
    'cup','cups',
    'clove','cloves',
    'tsp','teaspoon','teaspoons',
    'tbsp','tablespoon','tablespoons',
    'minced'
])

def cleanIngred(s):
    # remove leading and trailing whitespace
    s = s.strip()
    # remove numbers and punctuation in the string
    s = s.strip(string.digits + string.punctuation)
    # remove unwanted words
    return ' '.join(word for word in s.split() if not word in badwords)

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [cleanIngred(s.getText()) for s in ingreds.findAll('li')]

    fname = 'PorkRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()

results in

olive oil
chicken broth
garlic,
paprika
garlic powder
poultry seasoning
dried oregano
dried basil
thick cut boneless pork chops
salt and pepper to taste

? I don't know why it's left the comma in it - s.strip(string.punctuation) should have taken care of that.

0

精彩评论

暂无评论...
验证码 换一张
取 消