开发者

Improving speed of queries/processing of RegExp function with two database tables

开发者 https://www.devze.com 2023-01-22 03:09 出处:网络
I have a program with a function that retrieves a relatively short text string from one SQL database tabl开发者_如何学Goe (\"html\"), runs a number of regular expression queries on it, and then stores

I have a program with a function that retrieves a relatively short text string from one SQL database tabl开发者_如何学Goe ("html"), runs a number of regular expression queries on it, and then stores the output in another database table ("profiles"). "Html" has about 8 million records and "profiles" stores only the results. It is currently very slow - well over 1 second per record on a fast server. I'm hoping someone might be able to point out something obvious that I would be able to fix:

dataComm dc = new dataComm(@"SERVER1", "html");
dataComm dc_bp = new dataComm(@"SERVER1", "profiles");

Int32 numinserted = 0;

for (int selectc = 0; selectc < 85000; selectc++)
{
  DataTable dt = dc.fetchData("SELECT top 100 * from html where processed=0");

  label3.Text = "Iteration " + selectc + " of 85,000";

  string sq="";
  string squpdate = "";

  Int32 thisid=0;

  for (int i = 0; i < dt.Rows.Count; i++)
  {                    
    // CODE THAT RUNS THE REGULAR EXPRESSIONS.
    thisid = Convert.ToInt32(dt.Rows[i]["ID"]);

    sq += @"INSERT INTO profiles (field1, field2, [etc.]) VALUES ('value1','value2', [etc.]); ";
    numinserted++;

    squpdate += "UPDATE html SET processed=1 WHERE ID=" + thisid.ToString() + "; ";
  }

  dc.executeNonQuery(squpdate);

}


If you have SQL 2005+, you can build some CLR to execute regular expressions in SQL. Very handy. Rather fast. Then, you can handle everything in SQL and push your code towards set-based operation rather than looping.

http://msdn.microsoft.com/en-us/magazine/cc163473.aspx


The answer might be simply not running regular expressions on that HTML but parse it in some other way, e.g. with the HTML agility pack. Parsing HTML with regex is "bad", doing a lot of it is very "bad". (It will depend on the regex, but they can be written so that they are really inefficient for this task.)

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号