Yesterday in an interview I was asked this question and the answers that I provided seem to have not impressed the interviewer. :(
Here is the scenario... A file has comma delimited attribute values in new-line seperated records. There is a requirement that this file has to go into database table that has matching sequence of columns. The file has a self referencing column i.e. One column "Id" holds the unique not null id for the record and there is another column "LinkId" which contains the id of some other record. This value can be empty.
Requirement: The file has to be inserted into the DB in the same order but all at once! That means no multiple inserts.
Question: How can this be best achieved in .Net framework (you can use the latest 4.0 framework)
Solution that I gave: Read file line by line and parse it into a required XML. Then use a DataTable.ReadXML() call to load the file at once into a data table and commit the data set that holds the datatable.
Challanges:
What about a 100 GB text file? Will it work this way? How slow would the functionality get? If we have to hold an XML into memory for 100 GB of data, will virtual address space of the system 开发者_如何学编程support it? Wouldnt the page swapping cause issues and become slow?
Would DataTable.ReadXML() work for such huge XML? Will dataset be able to commit?
My answer: :-/
Any ideas guys?
Thx Harish.
If you're using SQL Server, you can use the BULK INSERT command
http://msdn.microsoft.com/en-us/library/ms188365.aspx
Basically, you set up a SQLCommand object, set the CommandText to something like:
sqlCommand.CommandText = "BULK INSERT " & tableName & " FROM '" & file & "' WITH(TABLOCK, FIELDTERMINATOR=',')
sqlCommand.ExecuteNonQuery()
tableName - is the name of the table you want to insert into.
file - is the name of the file you want SQL Server to read. You need to ensure that SQL Server can get to the location where your file is stored. This includes ACL permissions, and network connectivity.
There's lots of options with the BULK INSERT command, I've used it in the past and it works really well.
You're going to have mulitiple inserts regardless. If the file isn't too large, the easiest would be to create a database transaction first and then read line-by-line and create an insert command per line. Once the lines are all read, you can commit the transaction.
For a 100GB text file, I would first break it up into much smaller chunks for the inserts. Maybe commit the transaction every 1000 lines?
The "Id" and "LinkId" fields look like a classic Parent > Child relationship to me. So as long as the parent is inserted first, there shouldn't be a problem.
You didn't mention what type of database this would be, but if it's MSSQL, then you can also use Linq to Sql to create a new object per line and add it to a list. Assuming you're using C#, you can use :
List<YourObject> YourList = new List<YourObject>();
using(System.IO.StreamReader file = new System.IO.StreamReader("C:\yourstorage.txt")) {
while((line = file.ReadLine()) != null) {
string[] fields = line.Split(',');
YourObject obj = new YourObject();
obj.FieldX = fields[0];
obj.FieldY = fields[1];
obj.FieldZ = fields[2];
YourList.Add(obj);
}
}
using(YourDataContext db = new YourDataContext()) {
db.YourObjects.InsertAllOnSubmit(YourList);
db.SubmitChanges();
}
Where YourObject is a Linq to Sql model that's already added to the project and YourObjects is the actual table name in the database. But this can get very memory intensive for a huge file.
Apparently you can use OLEDB to open a CSV file see this for example, and from there it should be easy to get the data from one DB to the other.
I'm not sure if it actually gets rid of the memory problem, but it is a lot nicer than constructing an XML copy in-memory.
精彩评论