开发者

High read performance database solution for storing simple data on Windows

开发者 https://www.devze.com 2023-02-13 13:53 出处:网络
I though that I have a simple problem but I am struggling with it for the last few days. To make a long story short, here is a description of it:

I though that I have a simple problem but I am struggling with it for the last few days. To make a long story short, here is a description of it:

  • I have about 1 mil new records daily, each record consists of record_name (string, about 20 characters), date, value, the records are stored for about two years (~700mil records in database);
  • most of the record_names are repeating every day;
  • I need to have the ability to find the biggest value gains between the given dates, while preserving the ability to filter the results by wildcarding the record_name;
  • the software running this needs to work on Windows XP.
  • the most important thing is the execution time of each query.

So far I had tried MySQL database and Cassandra. While the MySQL has rather acceptable performance on Linux开发者_JS百科 (by acceptable I mean that my not-so-high skills were sufficient to program something which works), it is very slow on Windows. The same thing is with Cassandra.

The data which are inserted to those databases are being imported from .csv files. First import takes about 5 mins to MySQL and 20 mins to Cassandra, the latter ones are taking more time. I suspect that maybe I mis-configured something, but to be honest - I didn't changed anything performance-related in the config files.

The question is: what is the best solution for working with this kind of data having performance in mind. The programming language is not important, almost any will work, as the queries are simple and implementing them in any of the languages will not take big amounts of time.

Thank you very much for interest in helping.


Without more detail, the problems posed seems to be fairly classic BI problems, where the solution is usually to prematerialize aggregate data, optimizing for query than insertion.

I'd ask a simple question 1st - do you need all the data in the DB? I'd be tempted to just cook the different aggregates - all, record type etc. by time range e.g. by day or hour into a simpler, smaller rowsets using simple console apps to read rows, compute aggregates, then dump them into SQL for access by query.

While it's cool to select * across all rows - unless there's real business need, resist it. With aggregate queries in mind, the No-SQL storage is a distraction, plain ol' files would be great for retaining your raw row storage, and tools like SQL are more than enough for the aggregate queries.


You need to identify the bottleneck first. Possible options: harddrive, database, driver, application code. You should also try postresql but honestly your choices of having something running reliably on Windows are slim. unless you go with Windows SQL Server (not free). Also Cassandra and MySql are VERY different and solve different problems.


Another option to consider is Berkeley DB. It's small, fast, scalable and very lightweight. It supports various APIs, including SQL, key-value pairs (NoSQL) and a Java Object persistence API. Generally speaking, Berkeley DB will outperform a general-purpose RDBMS because it's much smaller, more efficient and has much less overhead. Berkeley DB runs on Windows and is an excellent choice for developers who are looking for simple, easy to use, embedded data management.


SQL Server Express should fit your needs just fine.

Just aggregate the data- use SQL to get the values you need and store those in other tables.

Using this method, you can easily get around any performance difficulties and size limitations with SQL express.

0

精彩评论

暂无评论...
验证码 换一张
取 消