I have to read in historical customs tariff data from a large text file into a database, but the data is a bit messed up.
A tariff consists of a type and a measure (the actual rates and effective dates)
A type is defined by a typecode and a description. Measures contain rates, the geographical area that it applies to, a rate and start date and end date.
The p开发者_JS百科roblem is that there are multiple entries for the same tariff with different effective dates that needs to be compounded into one entry.
The text file looks something like this:
(TypeCode, Area, Rate, StartDate, EndDate, Description)
1: 01021000#GEN #FREE #20050101#20061231#PURE-BRED BREEDING ANIMALS # 2: 01021000#GEN #FREE #20070101#20071231#PURE-BRED BREEDING ANIMALS # 3: 01021000#GEN #FREE #20080101#99999999#PURE-BRED BREEDING ANIMALS # 4: 01029000#GEN #00000040.000% #20050101#20061231#OTHER # 5: 01029000#GEN #00000040.000% #20070101#20071231#OTHER # 6: 01029000#GEN #00000030.000% #20080101#20091231#OTHER # 7: 01029000#EU #00000030.000% #20070101#20071231#OTHER #In this example:
- 1, 2 and 3 needs to be compounded into one measure with the first startdate and the last enddate (01021000#GEN #FREE #20050101#99999999#PURE-BRED BREEDING ANIMALS #)
- 4, 5 needs to be compounded into one measure with the first startdate and last enddate (01029000#GEN #00000040.000% #20050101#20071231#OTHER #)
- 6 must stay separate because it has a different rate
- 7 must stay separate because it is from a different geographical area
I am using c# and Sql Compact Edition. I have got it mostly working but it is EXTREMELY slow... There has to be a more efficient way to do this at the moment it takes about 40 minutes on my Intel i3 laptop (66000 entries)
I've written down my steps and given code for the compounding part. I need to check if dates are subsequent as well.
Steps:
Read the textfile line by line Split the line into tokens Insert the unique TypeCodes and their descriptions into the Type table Insert the values into the Measure table with the following code:// check to see if a measure with the same typecode, area and rate has already been inserted
String select = string.Format("SELECT TypeCode FROM Measure WHERE TypeCode = '{0}' AND AreaCode = '{1}' AND Rate = '{2}'", tokens[1], tokens[3], tokens[4]);//string.Format("SELECT TypeCode FROM Measure WHERE TypeCode = '{0}'", tokens[1]);
SqlCeDataAdapter adapter = new SqlCeDataAdapter(select, con);
DataTable table = new DataTable(); // Use DataAdapter to fill DataTable
adapter.Fill(table);
// if there are no similar records insert this one
if (table.Rows.Count <= 0)
{
string insert = "INSERT INTO Measure VALUES (@TypeCode, @UOM, @AreaCode, @Rate, @StartDate, @EndDate)";
SqlCeCommand com = new SqlCeCommand(insert, con);
com.Parameters.AddWithValue("@TypeCode", tokens[1]);
com.Parameters.AddWithValue("@UOM", tokens[2]);
com.Parameters.AddWithValue("@AreaCode", tokens[3]);
com.Parameters.AddWithValue("@Rate", tokens[4]);
com.Parameters.AddWithValue("@StartDate", tokens[5]);
com.Parameters.AddWithValue("@EndDate", tokens[6]);
com.ExecuteNonQuery();
}
else
{
// update the current record with the new enddate
string update = "UPDATE Measure SET EndDate = @EndDate WHERE TypeCode = @TypeCode AND AreaCode = @AreaCode AND Rate = @Rate";
SqlCeCommand com = new SqlCeCommand(update, con);
com.Parameters.AddWithValue("@EndDate", tokens[6]);
com.Parameters.AddWithValue("@TypeCode", tokens[1]);
com.Parameters.AddWithValue("@AreaCode", tokens[3]);
com.Parameters.AddWithValue("@Rate", tokens[4]);
com.ExecuteNonQuery();
}
Any help or suggestions will be appreciated!
If you're using SQL Server 2008 CE, you can use the MERGE statement to go through the table only once (http://blog.sqlauthority.com/2008/08/28/sql-server-2008-introduction-to-merge-statement-one-statement-for-insert-update-delete/). You may want to create an index to Measures on TypeCode, AreaCode and Rate to speed up the process as well.
This is something I would handle on the SQL side, rather than with C#. Import all the data into a temp table, with the fields you gave (TypeCode, Area, Rate, StartDate, EndDate, Description), then query out a summary like this:
INSERT INTO PermanentTable (TypeCode, Area, Rate, StartDate, EndDate, Description)
SELECT
TypeCode, Area, Rate, MIN(StartDate), MAX(EndDate), MIN(Description)
FROM
TempLoad
GROUP BY
TypeCode, Area, Rate
As PoppaVein points out, if you have existing data to update, you'll want a MERGE instead. Oh, and beyond just an index, it sounds like { TypeCode, AreaCode, Rate } should be the primary key for your permanent table. Doing this will probably speed things up even more, and will ensure that you don't import any duplicate data.
I don't believe you specified whether a change in Description
warrants creating a new row in your permanent table; I'm assuming it won't.
There's one big limitation in the query I just gave: if you have non-contiguous periods where a Rate applies, this will "paper over" the gap. If that's a possibility, I can help you with the more complex query needed. Actually, that topic (merging non-contiguous date ranges) pops up rather often on SO, you can probably find a suitable solution pretty easily.
精彩评论