What is the best way to organize scraped data into a csv? More specifically each item is in this form
url
"firstName middleInitial, lastName - level - word1 word2 word3, & wordN practice officeCity."
JD, schoolName, date
Example:
http://www.examplefirm.com/jang
"Joe E. Ang - partner - privatization mergers, media & technology practice New York."
JD, University of Chicago Law School, 1985
I want to put this item in this form:
(http://www.examplefirm.com/jang, Joe, E., Ang, partner, privatization mergers, media & technology, New York, University of Chicago Law Scho开发者_如何转开发ol, 1985)
so that I can write it into a csv file to import to a django db.
What would be the best way of doing this?
Thank you.
There's really no short cut on this. Line 1 is easy. Just assign it to url
. Line 3 can probably be split on ,
without any ill effects, but line 2 will have to be manually parsed. What do you know about word1-wordN? Are you sure "practice" will never be a "word". Are you sure the words are only one word long? Can they be quoted? Can they contain dashes?
Then I would parse out the beginning and end bits, so you're left with a list of words, split it by commas and/or & (is there a consistent comma before &? Your format says yes, but your example says no.) If there are a variable number of words, you don't want to inline them in your tuple like that, because you don't know how to get them out. Create a list from your words, and add that as one element of the tuple.
>>> tup = (url, first, middle, last, rank, words, city, school, year)
>>> tup
('http://www.examplefirm.com/jang', 'Joe', 'E.', 'Ang', 'partner',
['privatization mergers', 'media & technology'], 'New York',
'University of Chicago Law School', '1985')
More specifically? You're on your own there.
精彩评论