I have a large file, with 1.8 million rows of data, that I need to be able to read for a machine learning program I'm writing. The data is currently in a CSV file but clearly I can put it in a database or other structure as required - it won't need to be updated regularly.
The code I'm using at the moment is below. I'm first importing the data to an array list and then I'm passing it to a table model. This is very slow, currently taking six minutes to execute just the first 10,000 rows which is not acceptable as I need to be able to test different algorithms against the data fairly often.
My program will only need to access each row of the data once, so there's no need to hold the whole dataset in RAM. Am I better off reading from a database, or is there a better way to read the CSV file line by line but do it much faster?
import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Scanner;
import javax.swing.table.DefaultTableModel;
import javax.swing.table.TableModel;
public class CSVpaser {
public static TableModel parse(File f) throws FileNotFoundException {
ArrayList<String> headers = new ArrayList<String>();
ArrayList<String> oneDdata = new ArrayList<String>();
//Get the headers of the table.
Scanner lineScan = new Scanner(f);
Scanner s = new Scanner(lineScan.nextLine());
s.useDelimiter(",");
while (s.hasNext()) {
headers.add(s.next());
}
//Now go through each line of the table and add each cell to the array list
while (lineScan.hasNextLine()) {
s = new Scanner(lineScan.nextLine());
s.useDelimiter(", *");
while (s.hasNext()) {
oneDdata.add(s.next());
}
}
String[][] data = new String[oneDdata.size()/headers.size()][headers.size()];
int numberRows = oneDdata.size()/headers.size();
// Move the data into a vanilla array so it can be put in a table.
for (int x = 0; x < numberRows; x++) {
for (int y = 0; y < headers.size(); y++) {
data[x][y] = oneDdata.remove(0);
}
}
// Create a table and return it
return new DefaultTableModel(data, headers.toArray());
}
Update: Based on feedback I received in the answers I've rewritten the code, its now running in 3 seconds rather than 6 minutes (for 10,000 rows) which means only ten minutes for the whole file... but any further suggestions for how to speed it up would be appreciated:
//load data file File f = new File("data/primary_training_short.csv");
Scanner lineScan = new Scanner(f);
Scanner s = new Scanner(lineScan.nextLine());
s.useDelimiter(",");
//now go through each line of the results
while (lineScan.hasNextLine()) {
s = new Scanner(lineScan.nextLine());
s.useDelimiter(", *");
String[] data = new String[NUM_COLUMNS];
开发者_如何学Go//get the data out of the CSV file so I can access it
int x = 0;
while (s.hasNext()) {
data[x] = (s.next());
x++;
}
//insert code here which is excecuted each line
}
data[x][y] = oneDdata.remove(0);
That would be very inefficient. Every time you remove the first entry from the ArrayList all the other entries would need to be shifted down.
At a minimum you would want to create a custom TableModel so you don't have to copy the data twice.
If you want to keep the data in a database then search the net for a ResultSet TableModel.
If you want to keep it in CSV format then you can use the ArrayList as the data store for the TableModel. So your Scanner code would read the data directly into the ArrayList. See List Table Model for one such solution. Or you might want to use the Bean Table Model.
Of course the real question is who is going to have time to browse through all 1.8M records? So you really should use a database and have query logic to filter the rows that are returned from the database.
My program will only need to access each row of the data once, so there's no need to hold the whole dataset in RAM
So why are you displaying it in a JTable? This implies the entire data will be in memory.
Sqllite is a very light weight file based db and according to me, the best solution for your problem.
Check out this very good driver for java. I use it for one of my NLP projects and it works really well.
This is what I understood: Your requirement is to perform some algorithm on loaded data and that too at runtime i.e.
- load a set of data
- Perform some calculation
- Load another set of data
- Perform more calculation, and so on till we reach at the end of CSV
Since there is no correlation between the two sets of data and algorithm/calculation you're doing on data is a custom logic (for which there is no built-in function in SQL), that means you can do this in Java even without using any database, and this should be fastest.
However If the logic/calculation you're performing on two sets of data has got some equivalent function in SQL, and there is a separate Database running with good Hardware (that is more memory/CPU), executing this whole logic through a Procedure/Function in SQL could perform better.
You can use opencsv package, their CSVReader can itereate over large CSV files, you should also use online learning methods such as NaiveBayes, LinearRegression for such large data.
精彩评论