Comparing two datasets_问答_开发者_运维开发者技术经验分享

I have a need to import data from an Oracle to MySQL. I need to update the MySQL data with the data from Oracle. I have the code setup to obtain the data from both resources but I am having the issue of figuring out the best way to update the information.

I have tried DataSet.Merge, but that doesn't actually flag the RowState properly. I was hoping to use:

ds1 = G开发者_如何学JAVAetMySQLData();
ds2 = GetOracleData();

ds1.Merge(ds2);

changesDataSet = myData.GetChanges(DataRowState.Modified);

The RowState is not being altered. I know for a fact that it's modifying the data as I purposely changed something on my test database on MySQL and saw the change after I called the merge.

Is there a known way to (an algorithm) that I can use to test the datasets against one another and insert, update, delete records?

I originally suggested calling a Merge() overload with preserveChanges = true, but that only works if ds2 already has row states indicating its differences with ds1. As you've said in your question, that is what you need to accomplish. So, an algorithm? Here are two: the clean, simple, obvious way; and an adaptation of sort-merge join. There is no way around examining every row of ds2, but the second algorithm attempts to reduce the amount of searching on ds1 by expecting the data to be ordered.

1) Simple, clean, obvious; using DataRowCollection.Find(pk) and object[].SequenceEqual(). A primary key is required in each table, but no ordering of data is necessary, and the type of the primary key is immaterial.

for (int i = 0; i < ds2.Tables.Count; i++)
{
    foreach (DataRow dr in ds2.Tables[i].Rows)
    {
        DataRow drOrig = ds1.Tables[i].Rows.Find(dr[0]);
        if (drOrig != null)
        {
            if (!drOrig.ItemArray.SequenceEqual(dr.ItemArray))
            {
                dr.SetModified();
            }
        }
        else
        {
            dr.SetAdded();
        }
    }
}

ds1.Merge(ds2);

2) Grittier, messier; still using object[].SequenceEqual(). Data must be ordered, though the 'pk' / row identifier need not be unique. However, its type must be known for each individual table, and if the types differ, you cannot simply loop through the tables.

// Assuming first column of each table is int, primary key; and that all data are ordered by pk.
for (int i = 0; i < ds2.Tables.Count; i++)
{
    int indexDs1 = 0
    int indexDs2 = 0;
    DataRow nextDs1Row = ds1.Tables[i].Rows[indexDs1];
    DataRow nextDs2Row = ds2.Tables[i].Rows[indexDs2];
    int nextDs1Pk = (int)nextDs1Row[0];
    int nextDs2Pk = (int)nextDs2Row[0];
    while ((indexDs1 < ds1.Tables[i].Rows.Count) && (indexDs2 < ds2.Tables[i].Rows.Count))
    {
        if (nextDs1Pk == nextDs2Pk)
        {
            // Set row state to modified if any differences exist.
            if (!nextDs1Row.ItemArray.SequenceEqual(nextDs2Row.ItemArray))
            {
                nextDs2Row.SetModified();
            }
            // Advance both iterators by one row.
            indexDs1++;
            if (indexDs1 < ds1.Tables[i].Rows.Count)
            {
                nextDs1Row = ds1.Tables[i].Rows[indexDs1];
                nextDs1Pk = (int)nextDs1Row[0];
            }
            indexDs2++;
            if (indexDs2 < ds2.Tables[i].Rows.Count)
            {
                nextDs2Row = ds2.Tables[i].Rows[indexDs2];
                nextDs2Pk = (int)nextDs2Row[0];
            }
        }
        else if (nextDs1Pk < nextDs2Pk)
        {
            // Advance through ds1, doing nothing, until the next pk of ds2 is reached.
            do
            {
                indexDs1++;
                if (indexDs1 < ds1.Tables[i].Rows.Count)
                {
                    nextDs1Row = ds1.Tables[i].Rows[indexDs1];
                    nextDs1Pk = (int)nextDs1Row[0];
                }
                else
                {
                    break;
                }
            } while (nextDs1Pk < nextDs2Pk);
        }
        else //nextDs1Pk > nextDs2Pk
        {
            // Advance through ds2, setting row state to added, until the next pk of ds1 is reached.
            do
            {
                nextDs2Row.SetAdded();
                indexDs2++;
                if (indexDs2 < ds2.Tables[i].Rows.Count)
                {
                    nextDs2Row = ds2.Tables[i].Rows[indexDs2];
                    nextDs2Pk = (int)nextDs2Row[0];
                }
                else
                {
                    break;
                }
            } while (nextDs1Pk > nextDs2Pk);
        }
    }
}

If your machine multi-tasks well, and you do not need to enforce foreign key constraints on the individual tables in the set, I would set each table's row analysis as a separate task, start them all in parallel, then merge the tables one by one as the tasks complete. If that is enough to make algorithm 1 meet your requirements, I would go with it in the name of simplicity. The Find() and SequenceEqual() methods that it uses are probably highly optimized, and algorithm 2 didn't perform much faster in my testing. If neither is fast enough, and you know something about your data, you can maybe improve on SequenceEqual().