开发者

Close to serial textfile reading performance in MySQL

开发者 https://www.devze.com 2023-01-30 03:23 出处:网络
I am trying to perform some n-gram counting in python and I thought I could use MySQL (MySQLdb module) for organizing my text data.

I am trying to perform some n-gram counting in python and I thought I could use MySQL (MySQLdb module) for organizing my text data.

I have a pretty big table, around 10mil records, representing documents that are indexed by a unique numeric id (auto-increment) and by a language varchar 开发者_开发知识库field (e.g. "en", "de", "es" etc..)

select * from table is too slow and memory devastating. I ended up splitting the whole id range into smaller ranges (say 2000 records wide each) and processing each of those smaller record sets one by one with queries like:

select * from table where id >= 1 and id <= 1999
select * from table where id >= 2000 and id <= 2999

and so on...

Is there any way to do it more efficiently with MySQL and achieve similar performance to reading a big corpus text file serially?

I don't care about the ordering of the records, I just want to be able to process all the documents that pertain to a certain language in my big table.


You can use the HANDLER statement to traverse a table (or index) in chunks. This is not very portable and works in an "interesting" way with transactions if rows appear and disappear while you're looking at it (hint: you're not going to get consistency) but makes code simpler for some applications.

In general, you are going to get a performance hit, as if your database server is local to the machine, several copies of the data will be necessary (in memory) as well as some other processing. This is unavoidable, and if it really bothers you, you shouldn't use mysql for this purpose.


Aside from having indexes defined on whatever columns you're using to filter the query (language and ID probably, where ID already has an index care of the primary key), no.


First: you should avoid using * if you can specify the columns you need (lang and doc in this case). Second: unless you change your data very often, I don't see the point of storing all this in a database, especially if you are storing file names. You could use an xml format for example (and read/write with a SAX api)

If you want a DB and something faster than MySQL, you can consider an in-memory databasy such as SQLite or BerkeleyDb, which have both python bindings.

Greetz, J.

0

精彩评论

暂无评论...
验证码 换一张
取 消