What is your recommended way to import .csv files i开发者_如何学Cnto Microsoft SQL Server 2008 R2?
I'd like something fast, as I have a directory with a lot of .csv files (>500MB spread across 500 .csv files).
I'm using SQL Server 2008 R2 on Win 7 x64.
Update: Solution
Here's how I solved the problem the end:
- I abandoned trying to use LINQ to Entities to do the job. It works - but it doesn't support bulk insert, so its about 20x slower. Maybe the next version of LINQ to Entities will support this.
- Took the advice given on this thread, used bulk insert.
- I created a T-SQL stored procedure that uses bulk insert. Data goes into a staging table, is normalized then copied into the target tables.
- I mapped the stored procedure into C# using the LINQ to Entities framework (there is a video on www.learnvisualstudio.net showing how to do this).
- I wrote all the code to cycle through files, etc in C#.
- This method eliminates the biggest bottleneck, which is reading tons of data off the drive and inserting it into the database.
The reason why this method is extremely quick at reading .csv files? Microsoft SQL Server gets to import the files directly from the hard drive straight into the database, using its own highly optimized routines. Most of the other C# based solutions require much more code, and some (like LINQ to Entities) end up having to pipe the data slowly into the database via the C#-to-SQL-server link.
Yes, I know it'd be nicer to have 100% pure C# code to do the job, but in the end:
- (a) For this particular problem, using T-SQL requires much less code compared to C#, about 1/10th, especially for the logic to denormalize the data from the staging table. This is simpler and more maintainable.
- (b) Using T-SQL means you can take advantage of the native bulk insert procedures, which speeds things up from 20-minute wait to a 30-second pause.
Using BULK INSERT in a T-SQL script seems to be a good solution.
http://blog.sqlauthority.com/2008/02/06/sql-server-import-csv-file-into-sql-server-using-bulk-insert-load-comma-delimited-file-into-sql-server/
You can get the list of files in your directory with xp_cmdshell and the dir command (with a bit of cleanup). In the past, I tried to do something like this with sp_OAMethod and VBScript functions and had to use the dir method because I had trouble getting the list of files with the FSO object.
http://www.sqlusa.com/bestpractices2008/list-files-in-directory/
If you have to do anything with the data in the files other than insert it, then I would recommend using SSIS. It can not only insert and/or update, it can also clean the data for you.
First officially supported way of importing large text files is with command line tool called "bcp" (Bulk Copy Utility), very useful for huge amounts of binary data.
Please check out this link: http://msdn.microsoft.com/en-us/library/ms162802.aspx
However, in SQL Server 2008 I presume that BULK INSERT command would be your choice number one, because on the first place it became a part of standard command set. If for any reason you have to maintain vertical compatibility, I'd stick to bcp utility, available for SQL Server 2000 too.
HTH :)
EDITED LATER: Googling around I recalled that SQL Server 2000 had BULK INSERT command too... however, there was obviously some reason I sticked up to bcp.exe, and I cannot recall why... perhaps of some limits, I guess.
I should recommend this:
using System;
using System.Data;
using Microsoft.VisualBasic.FileIO;
namespace ReadDataFromCSVFile
{
static class Program
{
static void Main()
{
string csv_file_path=@"C:\Users\Administrator\Desktop\test.csv";
DataTable csvData = GetDataTabletFromCSVFile(csv_file_path);
Console.WriteLine("Rows count:" + csvData.Rows.Count);
Console.ReadLine();
}
private static DataTable GetDataTabletFromCSVFile(string csv_file_path)
{
DataTable csvData = new DataTable();
try
{
using(TextFieldParser csvReader = new TextFieldParser(csv_file_path))
{
csvReader.SetDelimiters(new string[] { "," });
csvReader.HasFieldsEnclosedInQuotes = true;
string[] colFields = csvReader.ReadFields();
foreach (string column in colFields)
{
DataColumn datecolumn = new DataColumn(column);
datecolumn.AllowDBNull = true;
csvData.Columns.Add(datecolumn);
}
while (!csvReader.EndOfData)
{
string[] fieldData = csvReader.ReadFields();
//Making empty value as null
for (int i = 0; i < fieldData.Length; i++)
{
if (fieldData[i] == "")
{
fieldData[i] = null;
}
}
csvData.Rows.Add(fieldData);
}
}
}
catch (Exception ex)
{
}
return csvData;
}
}
}
//Copy the DataTable to SQL Server using SqlBulkCopy
function static void InsertDataIntoSQLServerUsingSQLBulkCopy(DataTable csvData)
{
using(SqlConnection dbConnection = new SqlConnection("Data Source=ProductHost;Initial Catalog=yourDB;Integrated Security=SSPI;"))
{
dbConnection.Open();
using (SqlBulkCopy s = new SqlBulkCopy(dbConnection))
{
s.DestinationTableName = "Your table name";
foreach (var column in csvFileData.Columns)
s.ColumnMappings.Add(column.ToString(), column.ToString());
s.WriteToServer(csvFileData);
}
}
}
If the structure of all your CSVs are the same i recomend you to use Integration Services (SSIS) in order to loop between them and insert all of them into the same table.
I understand this is not exactly your question. But, if you get into a situation where you use a straight insert use tablock and insert multiple rows. Depends on the row size but I usually go for 600-800 rows at at time. If it is a load into an empty table then sometimes dropping the indexes and creating them after it is loaded is faster. If you can sort the data on the clustered index before it is loaded. Use IGNORE_CONSTRAINTS and IGNORE_TRIGGERS if you can. Put the database in single user mode if you can.
USE AdventureWorks2008R2; GO INSERT INTO Production.UnitMeasure with (tablock) VALUES (N'FT2', N'Square Feet ', '20080923'), (N'Y', N'Yards', '20080923'), (N'Y3', N'Cubic Yards', '20080923'); GO
精彩评论