Here is the (mess) of a script I have thus far. http开发者_运维知识库://pastebin.com/prpdJXsq
#Jeopardy!
#Goal is to create a list of lists ie.
#[[Category 1, Question 1, Answer 1], [Category 1, Question 2, Answer 2]]
#First iteration will just be Q
import urllib.request, re
Question = []
first_game_id = 3458
last_game_id = 3713
for gameid in range(first_game_id, last_game_id):
webpageid = "http://www.j-archive.com/showgame.php?game_id=" + str(gameid)
temp=urllib.request.urlopen(webpageid)
webpage=temp.read()
temp.close()
for line in webpage:
if question != None:
Question.append(question)
print(Question)
#wrong. ??? = figure out which re to insert?
question = re.match('clue_text\"></td>')
answer= re.match'correct_response">???&'
#trying to use re match and compile to match the string and output tuple?
import urllib.request, re
webpageid = "http://www.j-archive.com/showgame.php?game_id=" + str(3713)
temp=urllib.request.urlopen(webpageid)
webpage=temp.read()
temp.close()
question=re.compile(r'clue_text">*?</td>')
Question = []
##
##for line in webpage:
## print(line)
##
## if question.match(line) != None:
## Question.append(question)
##
##print(Question)
I'm a novice (at best) trying to write a python script to extract every Jeopardy question/answer from this awesome website: http://www.j-archive.com/showseason.php?season=27
My general approach was to follow the psuedo code I found here in response to a similar question but this is as far as I have gotten: Jeopardy questions in Excel or other database format?
Any constructive criticism or outright jeering would be greatly appreciated.
I'd recommend using lxml
, and take advantage of it's XPath support:
import lxml.html
doc = lxml.html.parse('http://www.j-archive.com/showgame.php?game_id=1')
# get all td's with class="clue_text", these are the clues
clues = doc.xpath('//td[@class="clue_text"]')
# create a dict of clue_id, clue_text
clues_by_id = dict((x.attrib['id'], x.text) for x in clues)
精彩评论