High performance PHP simaliraty checking on large database_问答_开发者

High performance PHP simaliraty checking on large database

开发者 https://www.devze.com 2023-03-09 09:03 出处：网络

I have 30,000 rows in a database that need to be similarity checked (using similar_text or another such function).

I have 30,000 rows in a database that need to be similarity checked (using similar_text or another such function).

In order to do this it will require doing 30,000^2 checks for each columns.

I estimate I will be checking on average 4 columns.

This means I will have to do 3,600,000,000 checks.

What is the best (fastest, and most reliable) way to do this with PHP, bearing in mind request memory limits and time limits etc?

The server need to still actively serve webpages at the 开发者_如何学编程same time as doing this.

PS. The server we are using is an 8 core Xeon 32 GB ram.

Edit:

The size of each column is normally less that 50 characters.

I guess you just need FULL TEXT search.

If that not fits you, you have only one chance to solve this: cache the results. So you will not have to parse 3bil of records for each requests

Anyway here how you can do it:

   $result = array();       

   $sql = "SELECT * FROM TABLE";
   while( $row = ... ) {
      $result[] = $row;    //> Append the current record     
   }

Now results contains all the rows from your table.

At this point you said you want to similar_text() all columns with each other.
To do that and cache the results you need at least a table (as I said in the comment).

   //> Starting calculating the similarity
   foreach($result as $k=>$v) {
      foreach($result as $k2=>$v2) {

           //> At this point you have 2 rows, $v and $v2 containing your column

           $similarity = 0;

           $similartiy += levensthein($v['column1'],$v2['column1']);
           $similartiy += levensthein($v['column2'],$v2['column2']);               
           //> What ever comparison you need here between columns

           //> Now you can finally store the result by inserting in a table the $similarity
           "INSERT DELAYED INTO similarity (value) VALUES ('$similarity')"; 

      }                      
   }

2 Things you have to notice:

I used levensthein because it's much faster than similar_text (notice it's value it's the contrary of similar_text, because the greater the value levensthein returns the less the affinity between string)
I Used INSERT DELAYED to greatly lower the database cost

oy... similar_text() is O(n^3) !

do you really need a percentage similarity for each comparison or can you just do a quick compare of the first/middle/last X bytes of the strings to narrow the field?

if you're just looking for dups say... you can probably narrow down the number of comparisons you need to do, and that will be the most effective tack imho.