开发者

Process 46,000 rows of a document in groups of 1000 using C# and Linq

开发者 https://www.devze.com 2023-02-17 00:18 出处:网络
I have this code below that executes.IT has 46,000 records in the text file that i need to process and insert into the database.It takes FOREVER if i just call it directly and loop one at a time.

I have this code below that executes. IT has 46,000 records in the text file that i need to process and insert into the database. It takes FOREVER if i just call it directly and loop one at a time.

I was trying to use开发者_如何学运维 LINQ to pull every 1000 rows or so and throw it into a thread so I could proces 3000 rows at once and cut the processing time. I can't figure it out though. so I need some help.

Any suggestions would be welcome. Thank You in advance.

var reader = ReadAsLines(tbxExtended.Text);
        var ds = new DataSet();
        var dt = new DataTable();

        string headerNames = "Long|list|of|strings|"                                  
        var headers = headerNames.Split('|');
        foreach (var header in headers)
            dt.Columns.Add(header);

        var records = reader.Skip(1);
        foreach (var record in records)
            dt.Rows.Add(record.Split('|'));

        ds.Tables.Add(dt);
        ds.AcceptChanges();

        ProcessSmallList(ds);


If you are looking for high performance then look at the SqlBulkInsert if you are using SqlServer. The performance is significantly better than Insert row by row.

Here is an example using a custom CSVDataReader that I used for a project, but any IDataReader compatible Reader, DataRow[] or DataTable can be used as a parameter into WriteToServer, SQLDataReader, OLEDBDataReader etc.

Dim sr As CSVDataReader
Dim sbc As SqlClient.SqlBulkCopy
sbc = New SqlClient.SqlBulkCopy(mConnectionString, SqlClient.SqlBulkCopyOptions.TableLock Or SqlClient.SqlBulkCopyOptions.KeepIdentity)
sbc.DestinationTableName = "newTable"
'sbc.BulkCopyTimeout = 0

sr = New CSVDataReader(parentfileName, theBase64Map, ","c)
sbc.WriteToServer(sr)
sr.Close()

There are quite a number of options available. (See the link in the item)


To bulk insert data into a database, you probably should be using that database engine's bulk-insert utility (e.g. bcp in SQL Server). You might want to first do the processing, write out the processed data into a separate text file, then bulk-insert into your database of concern.

If you really want to do the processing on-line and insert on-line, memory is also a (small) factor, for example:

  1. ReadAllLines reads the whole text file into memory, creating 46,000 strings. That would occupying a sizable chunk of memory. Try to use ReadLines instead which returns an IEnumerable and return strings one line at a time.
  2. Your dataset may contain all 46,000 rows in the end, which will be slow in detecting changed rows. Try to Clear() the dataset table right after insert.

I believe the slowness you observed actually came from the dataset. Datasets issue one INSERT statement per new record, which means that you won't be saving anything by doing Update() 1,000 rows at a time or one row at a time. You still have 46,000 INSERT statements going to the database, which makes it slow.

In order to improve performance, I'm afraid LINQ can't help you here, since the bottleneck is with the 46,000 INSERT statements. You should:

  1. Forgo the use of datasets
  2. Dynamically create an INSERT statement in a string
  3. Batch the update, say, 100-200 rows per command
  4. Dynamically build the INSERT statement with multiple VALUE statments
  5. Run the SQL command to insert 100-200 rows per batch

If you insist on using datasets, you don't have to do it with LINQ -- LINQ solves a different type of problems. Do something like:

// code to create dataset "ds" and datatable "dt" omitted
// code to create data adaptor omitted

int count = 0;

foreach (string line in File.ReadLines(filename)) {
    // Do processing based on line, perhaps split it
    dt.AddRow(...);
    count++;

    if (count >= 1000) {
        adaptor.Update(dt);
        dt.Clear();
        count = 0;
    }
}

This will improve performance somewhat, but you're never going to approach the performance you obtain by using dedicated bulk-insert utilities (or function calls) for your database engine.

Unfortunately, using those bulk-insert facilities will make your code less portable to another database engine. This is the trade-off you'll need to make.

0

精彩评论

暂无评论...
验证码 换一张
取 消