I am parsing a huge (I mean really huuuuge) xml file. It contains some million article entrys like this one:
<article key="journals/cgf/HaeglerWAGM10" mdate="2010-11-12">
<author>Simon Haegler</author>
<author>Peter Wonka</author>
<author>Stefan Müller Arisona</author>
<author>Luc J. Van Gool</author>
<author>Pascal Müller</author>
<title>Grammar-based Encoding of Facades.</title>
<pages>1479-1487</pages>
<year>2010</year>
<volume>29</volume>
<journal>Comput. Graph. Forum</journal>
<number>4</number>
<ee>http://dx.doi.org/10.1111/j.1467-8659.2010.01745.x</ee>
<url>db/journals/cgf/cgf29.html#HaeglerWAGM10</url>
</article>
I step through the file and parse those articles by lxml. If I run the code without storing the items into my database (comenting out populate_database()
it makes some 1000 entrys in ~3 seconds. But if I activate the storage (uncomment
populate_database()`) it makes some 10 entrys per second. Is this normal? I remember parsing the file once upton a time and the database was not such a bottle neck. But I had a different approach... (looking throug my files to find it...)
Here is the function, that makes my head ache. I commented out those three cursor.executes
and the code was racing again. So it seams, that something is wrong with MySQL or something is wrong with the executes (my noobisch guesses). Any advice?
def add_paper(paper, cursor):
questionmarks = str(('?',)*len(paper)).replace("'", "")
# The line above: produces (?, ?, ?, ... ,?) for oursql query
keys, values = paper.keys(), paper.values()
keys = str(tuple(keys)).replace("'", "")
# The line above: produces (mdate, title, ... date, some_key)
query_paper = '''INSERT INTO dblp2.papers {0} VALUES {1};'''.\
format(keys, questionmarks)
values = tuple(v.encode('utf8') for v in values)
cursor.execute(query_paper, values)
paper_id = cursor.lastrowid
return paper_id
def populate_database(paper, authors, cursor):
paper_id = add_paper(paper, cursor)
query_author ="""INSERT INTO dblp2.authors
(name) VALUES (?) ON DUPLICATE KEY UPDATE
id=LAST_INSERT_ID(id)"""
query_link_table = "INSERT INTO dblp2.author_paper
(author_id, paper_id) VALUES (?, ?)"
for author in authors:
cursor.execute(query_author, (author.encode('utf8'),))
author_id = cursor.lastrowid
cursor.execute(query_link_table, (author_id, paper_id))
I added a profiling output from cProfile
:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 15.666 15.666 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 __init__.py:49(normalize_encoding)
1 0.000 0.000 0.000 0.000 __init__.py:71(search_function)
510 0.002 0.000 0.002 0.000 _elementpath.py:222(_build_path_iterator)
510 0.005 0.000 0.008 0.000 _elementpath.py:260(iterfind)
408 0.005 0.000 0.017 0.000 _elementpath.py:270(find)
102 0.003 0.000 0.011 0.000 _elementpath.py:285(findall)
10 0.000 0.000 0.000 0.000 _elementpath.py:70(xpath_tokenizer)
5 0.000 0.000 0.000 0.000 _elementpath.py:85(prepare_child)
987 0.009 0.000 0.013 0.000 _elementpath.py:87(select)
1 0.000 0.000 0.000 0.000 codecs.py:77(__new__)
1 0.000 0.000 0.000 0.000 utf_8.py:15(decode)
1 0.000 0.000 0.000 0.000 utf_8.py:33(getregentry)
102 0.008 0.000 5.601 0.055 xml2db.py:25(add_paper)
680 0.003 0.000 0.006 0.000 xml2db.py:31(<genexpr>)
102 0.005 0.000 15.468 0.152 xml2db.py:36(populate_database)
477 0.003 0.000 0.013 0.000 xml2db.py:45(clean_parse)
101 0.002 0.000 0.005 0.000 xml2db.py:52(clear_element)
103 0.019 0.000 0.024 0.000 xml2db.py:57(extract_paper_elements)
1 0.017 0.017 15.557 15.5开发者_如何学C57 xml2db.py:63(fast_iter)
1 0.004 0.004 15.665 15.665 xml2db.py:89(main)
1 0.000 0.000 0.000 0.000 {__import__}
1 0.000 0.000 0.000 0.000 {_codecs.utf_8_decode}
1 0.000 0.000 0.000 0.000 {built-in method __new__ of type object at 0x8245fc0}
5 0.000 0.000 0.000 0.000 {built-in method findall}
1 0.000 0.000 0.000 0.000 {hasattr}
2 0.000 0.000 0.000 0.000 {isinstance}
515 0.001 0.000 0.001 0.000 {iter}
107 0.000 0.000 0.000 0.000 {len}
477 0.010 0.000 0.010 0.000 {lxml.etree.strip_tags}
5 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
101 0.002 0.000 0.002 0.000 {method 'clear' of 'lxml.etree._Element' objects}
1 0.000 0.000 0.000 0.000 {method 'cursor' of 'oursql.Connection' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
778 0.007 0.000 0.007 0.000 {method 'encode' of 'str' objects}
5 0.000 0.000 0.000 0.000 {method 'encode' of 'unicode' objects}
516 15.544 0.030 15.544 0.030 {method 'execute' of 'oursql.Cursor' objects}
408 0.004 0.000 0.023 0.000 {method 'find' of 'lxml.etree._Element' objects}
102 0.001 0.000 0.012 0.000 {method 'findall' of 'lxml.etree._Element' objects}
103 0.001 0.000 0.001 0.000 {method 'format' of 'str' objects}
2 0.000 0.000 0.000 0.000 {method 'get' of 'dict' objects}
204 0.001 0.000 0.001 0.000 {method 'get' of 'lxml.etree._Element' objects}
100 0.000 0.000 0.000 0.000 {method 'getparent' of 'lxml.etree._Element' objects}
201 0.001 0.000 0.001 0.000 {method 'getprevious' of 'lxml.etree._Element' objects}
510 0.004 0.000 0.004 0.000 {method 'iterchildren' of 'lxml.etree._Element' objects}
1 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
102 0.000 0.000 0.000 0.000 {method 'keys' of 'dict' objects}
204 0.001 0.000 0.001 0.000 {method 'replace' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'split' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'translate' of 'str' objects}
102 0.000 0.000 0.000 0.000 {method 'values' of 'dict' objects}
2 0.000 0.000 0.000 0.000 {time.time}
It seems to me you are running a whole bunch of individual insert
statements. Turning on logging on the mysql db should show you a bunch of statements like this:
....
INSERT INTO dblp2.authors (name) VALUES (a) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)
INSERT INTO dblp2.authors (name) VALUES (b) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)
INSERT INTO dblp2.authors (name) VALUES (c) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)
INSERT INTO dblp2.authors (name) VALUES (d) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)
INSERT INTO dblp2.authors (name) VALUES (e) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)
INSERT INTO dblp2.authors (name) VALUES (f) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)
....
You want to run either paramaterized/multiple insert
statement, or package everything into a csv file and do a bulk insert
.
Bulk insert is faster, and it supports replacing duplicate rows (check the docs). If you want a true update, use multiple inserts or try this temp table idea.
Multiple insert statements would be sort of like this:
cursor.execute(query_author, [author.encode('utf8') for author in authors])
which should give you entries in the log like this:
INSERT INTO dblp2.authors (name) VALUES (a,b,c,d,e,f) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)
You may have some issues with the ids for the second table.
Each database provides the EXPLAIN command for figuring out executing plans.
In addition: Python has a profiler to detect which code is slow.
So first: you making your analysis and and if you don't get any further you come back.
精彩评论