Basically I have an xml feed from an offsite server.
The xml feed has one parameter ?value=n now N can only be between 1 and 30
What ever value i pick, there will always be 4000 rows returned from the XML file. My script will call this xml file 30 times for each value once a day. So thats 120000 rows. I will be doing quite complicated queries on these rows. But the main thing is I will always filter by value first so SELECT * WHERE value = 'N'
etc. That will ALWAYS be used.
Now is it better to have one table where all 120k rows are stored? or 30 tables were 4k rows are stored?
EDIT: the SQL database in question will be MySQL
EDIT: Just to make it abit clearer, the data will be UPDATED every day, so the old tables will be overwritten, I do not want any archive solutions, just the best way to store the data to have as little performance bottlenecks as possible, the database results once outputted will be cached and also updated daily.
EDIT: I guess i was being to开发者_JAVA百科o vague for my own good :( Basically the feeds are leaderboards, each value is a different leaderboard location
The values will only be updated if the leaderboard position changes and there will always be ONLY 120k rows. no more, no less.
Lets say:
- Blue
- Green
- Red
Thats the current leaderboard and the next update the feed returns:
- Blue
- Red
- Green
Only rows 2 and 3 will change. That is my plan anyway :)
ANOTHER EDIT >.<: The rows will only contain at most 12 columns and less than 1kb per row. And the update wil only happen ONCE a day because the server the feeds are from is slow and it takes 80 minutes for my server to get all feed values from it.
In terms of storage, there is little difference between a 120k rows table and 30 4k tables.
In terms of maintenance, I would always go with one table. It makes your code and SQL much easier to work with, and as you are already using the WHERE
clause, I don't see any reason to split the table.
As long as you index properly, one table will be faster. You'll definitely need an index on your value (which you should call something else as 'value' is a reserved word in sql).
At the volume you're considering, storage should not be an issue. If you're doing this long term, you may want to investigate archiving solutions for old data.
A single table is my preferred choice.
I understand it will not contain data for just a single import, but that what a WHERE clause comes in.
Queries may not initially come back as fast as you'd like, and you can solve that with proper indexing.
More importantly, what will you do if for some reason you choose to go with 45 times a day or 90 times a day or once every 5 minutes (12 * 24 = 288 times a day). Creating 288 tables and changing all queries that tie to those tables will be a huge exercise.
You want one table. Otherwise, you'd have to write 30 different queries, or build a dynamic query solution (yuck).
How wide are the rows? More to the point, how many rows fit on an 8k SQL page? (You can guesstimate your disk I/O based on that.) How long does it take your hardware to read that much data? Or can it all fit in memory, so that you're not hitting the disk all that often? My point being, do you actually have a performance problem?
Putting a compound clustered index on the table such that your "n" value is the first column would optimize these reads (but only if you always have an "n" value in a WHERE clause). Alternatively, if "n" always ranges between the fixed values of 1 and 30 and you are using SQL 2005 and up, you could implement table partitioning, which would give you the same performance boost, and possibly a bit more flexibility when it comes to loading or unloading data.
As all the others have said, go with one table. There will not be any bottleneck performance on the database side because of this one table, unless your database is already badly set up, in which case this will reveal the situation, not cause it. If you do a performance analysis to the detail involving all the components in the flow (from when the user starts the request to when the results are returned to), you will see that in your example the database component will not add any important performance hit. And as the other answers have pointed out, you must define the right index or indexes, based on your specific queries.
As Oded said, there's no real scaling/performance issue at 120K rows, so I'd go for a unique table, too (to keep things simple).
If in future you need to scale a lot, just keep in mind this article on "why SQL databases don't scale". Among the other things, the article explains why "partitioning" (or "sharding") is bad for an SQL database:
Sharding divides your data along some kind of application-specific boundary. For example, you might store users whose names start with A-M on one database, and N-Z on another. Or use a modulo of the user id by the number of databases.
This requires deep integration into the application and careful planning of the partitioning scheme relative to the database schema and the kinds of queries you want to do. Summary: big pain in the ass.
So while sharding is a form of horizontal scaling, it fails point #2: it is not transparent to the business logic of the application.
The deeper problem with sharding is that SQL databases are relational databases, and most of the value in a relational database is that it stores relationships. Once you split records across multiple servers, you’re servering many of those relations; they now have to be reconstructed on the client side. Sharding kills most of the value of a relational database.
Even if this was originally referred to partitioning among more databases, the same concept can be applied to your case, in which you're trying to achieve some sort of "internal" partitioning.
Conclusion, the answer for real scaling is NoSQL. Again, not with 120K rows :)
精彩评论