开发者

sqlite database design with millions of 'url' strings - slow bulk import from csv

开发者 https://www.devze.com 2023-03-10 05:09 出处:网络
I\'m trying to create an sqlite database by importing a csv file with urls. The file has about 6 millio开发者_C百科n strings. Here are the commands I\'ve used

I'm trying to create an sqlite database by importing a csv file with urls. The file has about 6 millio开发者_C百科n strings. Here are the commands I've used

create table urltable (url text primary key);
.import csvfile urldatabase

After about 3 million urls the speed slows down a lot and my hard disk keeps spinning continuously. I've tried splitting the csv file into 1/4th chunks but I run into the same problem.

I read similar posts on stackoverflow and tried using BEGIN...COMMIT blocks and PRAGMA synchronous=OFF but none of them helped. The only way I was able to create the database was by removing the primary key constraint from url. But then, when I run a select command to find a particular url, it takes 2-3 seconds which won't work for my application. With the primary key set on url, the select is instantaneous. Please advise me what am I doing wrong.

[Edit] Summary of suggestions that helped :

  • Reduce the number of transactions
  • Increase page size & cache size
  • Add the index later
  • Remove redundancy from url

Still, with a primary index, the database size is more than double the original csv file that I was trying to import. Any way to reduce that?


Increase your cache size to something large enough to contain all of the data in memory. The default values for page size and cache size are relatively small and if this is a desktop application then you can easily increase the cache size many times.

PRAGMA page_size = 4096;
PRAGMA cache_size = 72500;

Will give you a cache size of just under 300mb. Remember page size must be set before the database is created. The default page size is 1024 and default cache size is 2000.

Alternatively (or almost equivalently really) you can create the database entirely in an in-memory database and then use the backup API to move it to an on-disk database.


A PRIMARY KEY or UNIQUE constraint will automatically generate an index. An index will dramatically speed up SELECTs, at the expense of slowing down INSERTs.

Try importing your data into a non-indexed table, and then explicitly CREATE UNIQUE INDEX _index_name ON urltable(url). It may be faster to build the index all at once than one row at a time.

0

精彩评论

暂无评论...
验证码 换一张
取 消