开发者

Faster way to run through a couple hundred lines in a for loop and check if match exists in 200k+ line file

开发者 https://www.devze.com 2023-01-16 12:39 出处:网络
I have a couple hundred single words that are identified in a foreach routine and pushed into an array.

I have a couple hundred single words that are identified in a foreach routine and pushed into an array.

I would like to check each one (word) to see if it exists in an existing txt file that is single column 200k+ lines.

(Similar to a huge "bad word" routine i guess 开发者_JAVA百科but, in the end this will add to the "filter" file.)

I don't know whether i should do this with preg_match in the loop or should I combine the arrays somehow and array_unique?

I would like to add the ones not found to the main file as well. Also flocking in attempt to avoid any multi access issues.

Is this a pipe dream? Well it is for this beginner. My attempts have timed out in 30 seconds.

Stackoverflow has been such a great resource. I don't know what I would do without it. Thanks in advance either way.


sorry, but that sound like a REALLY AWFUL APPROACH! doing a whole scan (of a table, list or whatever) if you want to check if something already exists is just... wrong. this is what hashtables are for! your case sounds like a classical database job... if you don't have a database available you can use a local sqlite file which will provide essential functionalities. let me explain the background... a lookup of "foo" in an hashtable basically consumes O(1) time. which means a static amount of time. because your algorithm knows WHERE to look and can see whether its THERE. hashmaps have the the attitude to run into ambiguiti because of the one-way nature of hashing procedures, which really doesnt matter that much because the hashmap delivers some possible matches which can be compared directly (for a reasonable number of elements, like probably the google index laugh) so if you want (for some reason) stay with your text-file approach, consider the following:

  • sort your file and insert your data at the right place (alphabetically would be the most intuitive approach). then you can jump from position to position and isolate the area where the word should be. there are several algorithms available, just have a google. but keep it takes longer the more data you have. usually your running time will be O(log(n)) where n is the size of the table.

well this is all basically just to guide you on the right track. you can as well shard your data, this would be for example saving every word beginning with a in the file a.txt and so on. or to split the word into characters and create a folder for every character and the last character is the file, then you check if the file exists. those are stupid suggestions, as you will probably run out of inodes on your disk, but it illustrates that you can CHECK for EXISTENCE witout having to do a FULL SCAN.

the main thing is that you have to project some search tree into a reasonable structure (like a database system does automatically for you). the folder example was an example of the basic principle.

this wikipedia entry might be a good place to start: http://en.wikipedia.org/wiki/Binary_search_tree


If the file is too large, then it is not a good idea to read it all into memory. You can process it line by line:

<?php
$words = array('a', 'b', 'c'); # words to insert, assumed to be unique

$fp = fopen('words.txt', 'r+');
while (!feof($fp))
{
    $line = trim(fgets($fp));
    $key = array_search($line, $words);
    if ($key !== false)
    {
        unset($words[$key]);
        if (!$words) break;
    }
}

foreach ($words as $word)
{
    fputs($fp, "$word\n");
}

fclose($fp);

?>

It loops through the entire file, checking to see if the current line (assumed to be a single word) exists in the array. If it does, that element is removed from the array. If there is nothing left in the array, then the search stops. After cycling through the file, if the array is not empty, it adds each of them to the file.

(File locking and error handling are not implemented in this example.)

Note that this is a very bad way to store this data (file based, unsorted, etc). Even sqlite would be a big improvement. You could always simply write an exporter to .txt if you needed it in plain text.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号