Using sqlite for very large merges and basic queries_问答_开发者

Using sqlite for very large merges and basic queries

开发者 https://www.devze.com 2023-01-27 13:30 出处：网络

I\'m a new guy to databases, and I\'m trying to figure out a good solution for dealing with large datasets.I mostly do statistical analyses using R, so I don\'t need a database as the backend of web p

相关专题：r sql

I'm a new guy to databases, and I'm trying to figure out a good solution for dealing with large datasets. I mostly do statistical analyses using R, so I don't need a database as the backend of web pages or anything. By datasets are generally static - they are just big.

I was trying to do a simple left join of a ~10,000,000 record table on a ~1,400,000 table. The 1.4 m table had unique records. After churning for 3 hours, it quit on me. The query was specified correctly - I ran it limiting the retrievals to 1000 records and it returned exactly as I expected. Eventually, I found a way to split this up into 10 queries and it ran, but by this time, I was able to do that merge in R pretty quickly, without all the fancy calls to sqlite and indexing.

I've been looking to use databases because I thought they were faster/more effective for these basic data manipulations, but maybe I'm just overlooking something. In the above example, I had indexed in the appropriate columns, and I'm surprised that sqlite could not handle it whilst R could.

Sorry if th开发者_如何学Cis question is a little foggy (I'm a little foggy on databases), but if anyone has any advice on something obvious I'm doing wrong to not take advantage of the power of sqlite, that would be great. Or am I just expecting to much of of it, and a 100 m X 1.4 m record merge is just too big to execute without breaking it up?

I would think that a database could outperform R in this respect?

thanks!

EXL

I am going through the same process. If you look through the questions I've asked recently, you may get some good pointers, or at least avoid a lot of the time I've wasted :). In short, here's what's been most helpful to me.

-- the RSQLite package

-- the RSQLite.extfuns package

-- the SQLite FAQ

I'm still a newbie, but in general, you should be using SQLite for subsetting data that is too large to bring in to RAM. I would think that if the data are small enough to handle in RAM, then you're better off using the native R tools for joins/subsets. If you find that you become more comfortable with SQL queries, then there is the sqldf package. Also, JD Long has a great discussion on using sqldf with large datasets.

I have to admit that I'm surprised that this has been a problem for you. SQLite has always worked well for me, at least speed-wise. However -- SQLite is easy because it is so flexible. SQLite can be dangerous because it is so flexible. SQLite tends to be very forgiving with data types. Sometimes this is an absolute god-send, when I don't want to spend a bunch of time tweaking things to perfection, but with great flexibility comes great responsibility.

I have noticed that I need to be careful moving data into SQLite. Text is easy. However, sometimes numbers get stored as text rather than numbers. Doing a JOIN on a column of numbers is faster than the same JOIN on a column of text. If your number columns are stored as text and then coerced into numbers for the comparison, you would lose most of the advantage of using an index.

I don't know how you got your data into SQLite, so the first thing I would do is look at your table schemas and make sure they make sense. And while they may seem obvious, indexes can be tricky. Taking a look at the queries might also result in something useful.

Without being able to see the underlying structure and queries, answers to this question will be educated guesses.