Speed up processing from CSV file_问答_开发者_运维开发者技术经验分享

I have a project and I have to improve its performance. I have a big Mysql database constructed from a huge CSV file (100 millions rows). Insert time is not a problem, but response time of a request is much important and sometimes a query with 2 join takes around 20 hours...

In a goal to reduce this response time, I tried to migrate my database to Cassandra but without success : my data model is not appropried to Cassandra concepts. Then I would like to try another way to improve performance : Parallel Virutal File System. Instead insert data in Mysql database and send then send a query, I tried to read the whole csv file with multi-threading and did my computing. But the result was not good : 2m20s for only 1 000 000 rows.

For the moment, my computing is very simple : in C++ with the MPI-IO API, I just count the number of differents pair value from 2 columns. To perfom that computing, I use a hashmap where each key is a pair value from csv file. At the end, I return the hashmap size. Here a small code :

 MPI::Init(argc,argv); 
 cout << " INFO init done" << endl;
 int myrank = MPI::COMM_WORLD.Get_rank(); 
 int numprocs = MPI::COMM_WORLD.Get_size(); 
 get_filename(path_name, myrank);
 cout << " INFO open file : " << path_name << endl;
 MPI::File thefile = MPI::File::Open(MPI::COMM_WORLD, path_name.c_str(), 
                  MPI::MODE_RDONLY, 
                  MPI::INFO_NULL); 
 MPI::Offset offset = 101;
 MPI::Offset limit = thefile.Get_size();
 cout << " INFO go computing" << endl;
 do {
   thefile.Read_at(offset, buf, bufsize, MPI_CHAR, status);
   temp.assign(buf);
   Tokenize(temp,tokens,"\n");
   line.assign(tokens.at(0));
   tokens.clear();

   Tokenize(line,tokens,"\t");
   nidt_count(tokens);
   tokens.clear();
   offset += (line.size() + 1);
 }while(offset < limit);
 count = status.Get_count(MPI_INT);
 cout << "process " << myrank << " reads " << nidt_hash.size() << " nidt" << endl;

I work on a server with 4 cores, 8GB of ram. My data is on a NAS mounted in NFS or Samba on my server. I would be able to add 2 or 3 server for processing, but for the moment I just tried on a small file (1 Million rows) on one server to measure performance.

Finally my questions are :

Is it a good way to think to change to PVFS kind for my issue? I would like to say that I will process with more complicated query like : select all rows with a specific date(range hours), and specific pair value from specific columns.
Do you know others things that could help me to improve processing from a csv file? I am thinking to use Hadoop, Pytables or FasterCSV.

Here it is a sample of my data composed by 2 csv file :

The biggest one (100 Millions rows) is composed as follow :

ID        DATE             NUM_1        NUM_2     NB_UNITE TYPUNIT CODE_1 CODE_2

0  2007-05-13 15:37:48  33671624244  33698802900    547      s       0      17
0  2007-05-13 15:52:22  33671624244  33672211799      5      s       0      17 
....

The second one is more simple and small (90 000), it is like a dictionary where from a code_1 and code_2 I get a value named CODEVAL:

CODE_1 CODE_2 CODEVAL

  0       17     VS
  0       34     SS

As you expected, usually I create 2 tables one for each file and a typical query is :

Select CODEVAL, hour(date) AS HEURE, COUNT(*) AS NBSMSSOR 
From Tables_1 Join CODEVAL using(CODE_1,CODE_2) 
Where CODEVAL='SS'

Sorry for the presentation, I do not know how to make a array.

Here it is a sample of my data composed by 2 csv file :

the biggest one (100 Millions r开发者_如何学Pythonows) is composed as follow :

ID DATE NUM_1 NUM_2 NB_UNITE TYPUNIT CODE_1 CODE_2

0 2007-05-13 15:37:48 33671624244 33698802900 547 s 0 17
0 2007-05-13 15:52:22 33671624244 33672211799 5 s 0 17 ....
the second one is more simple and small (90 000), it's like a dictionary where from a code_1 and code_2 I get a value named CODEVAL:

CODE_1 CODE_2 CODEVAL

0 17 VS

0 34 SS

As you expected, usually I create 2 tables one for each file and a typical query is :

Select CODEVAL, hour(date) AS HEURE, COUNT(*) AS NBSMSSOR From Tables_1 Join CODEVAL using(CODE_1,CODE_2) Where CODEVAL='SS'

Sorry for the presentation, I do not know how to make a array.

It looks to me like you're I/O bound. It doesn't help that your data is over a network. I suspect that if you just add more machines then your performance will go DOWN because of the extra contention. Remember that there's still just one spindle and just one HD head reading your data. For the MPI solution I'd suggest making multiple copies of the data and putting them on the servers themselves.

For MySQL, I hear what you're saying. I found MySQL to be very inefficient with joins. I looks to me like it does full-table scans when it could get away without them. I remember MySQL taking over a minute on a query that Oracle would take less than a second. Maybe try PostgreSQL? I'm not sure if it's any better. Another approach could be to have the db sort the data for you so that you can then do the scan without a hashmap.

Unless your records are ginormous, 100M records shouldn't be that bad.

If you read the data from the CSV, I assume it won't change too often. So instead of loading it into a generic database product, you could also construct your own index over the CSV data. Or do you need to have full SQL support?

Apart from that you mention that you want to return the NUMBER of different K,V-Pairs. However, you really compute the actual pairs. I don't know if you need them for some other purpose, but you could also get that number as #distinctKeys x #distinctValues without actually building a HashMap.

Assuming you build an index for each colum of form

value -> {r | r is a byteOffset of a row that has "value" in the index column}

you could answer many, many queries and especially determining the number of distinct pairs should only take a couple of milliseconds.

I hope this answer is helpful, since I am not sure what other requirements have to be met. This solution is significantly less powerful that a DB supporting SQL (especially inserts will make stuff a lot more complicated) but at least determining the number of distinct pairs should be faster by several orders of magnitude

divide and conquer A hundred small databases should be WAY faster. you decide how to break it up - use split() or slice() I am currently using the first character of the first word of each line, so where there once was one huge slow DB there is now (A - Z + a - z + 0 - 9) 62 small faster databases. Another advantage is that a laptop can now do the job that only a powerful, expensive PC could do before