开发者

Parsing HTML in python3, re, html.parser, or something else?

开发者 https://www.devze.com 2023-02-10 08:32 出处:网络
I\'m trying to get a list of craigslist states and their associates urls. Don\'t worry, I have no intentions of spaming, if you\'re wondering what this is for see the * below.

I'm trying to get a list of craigslist states and their associates urls. Don't worry, I have no intentions of spaming, if you're wondering what this is for see the * below.

What I'm trying to extract begins the line after 'us states' and is the next 50 < li >'s. I read through html.parser's docs and it seemed too low level for this, more aimed at making a dom parser or syntax highlighting/formatting in an ide as opposed to searching which makes me think my best bet is using re's. I would like to keep myself contained to what's in the standard library just for the sake of learning. I'm not asking for help writing a regular expression, I'll figure that out on my own, just making sure there's not a better way to do this before spending the time on that.

*This is my first program or anything beyond simple python scripts. I'm making a c++ program to manage my posts and remind me when they've expired in case I want to repost them, and a python script to download a list of all of the US states and cities/areas in order to populate a combobox in the gui.开发者_StackOverflow社区 I really don't need it, but I'm aiming to make this 'production ready'/feature complete both as a learning exercise and to create a portfolio to possibly get a job. I don't know if I'll make the program publicly available or not, there's obvious potential for misuse and is probably against their ToS anyway.


There is xml.etree an XML Parser available in the Python Standard library itself. You should not using regex for parsing XMLs. Go the particular node where you find the information and extract the links from that.


Use lxml.html. It's the best python html parser. It supports xpath!

0

精彩评论

暂无评论...
验证码 换一张
取 消