How to extract blocks of text from a plain text file?_问答_开发者

How to extract blocks of text from a plain text file?

开发者 https://www.devze.com 2023-02-15 11:02 出处：网络

I\'m working with an unstructured plain text file. In addition to a lot of clutter, the f开发者_开发技巧ile includes blocks of texts that are separated from the rest of the text by empty lines.

相关专题：php plaintext

I'm working with an unstructured plain text file. In addition to a lot of clutter, the f开发者_开发技巧ile includes blocks of texts that are separated from the rest of the text by empty lines.

How can I use PHP to extract all blocks of texts with more than 100 words?

Depending on how large the file is or could be gives different approaches.

The simplest approach would be if you were dealing with small enough files that handling it all in memory was a feasible option. Then you could simply use a regular expression to split up all the chunks of text, then loop through and get all the chunks larger than 100 words.
The safest I think would be to open the file and fetch lines one at a time until you reach an empty line. If the total words in that block are more than 100 then store the block. Then continue with the next block.

Here's an example:

// Option 1
$contents = file_get_contents($filename);
$blocks = array();
// Split the contents by 2 line breaks in a row, plus any extra ones.
// i.e. 3 blank lines in a row will be treated the same as 1 blank line.
foreach(preg_split('/\n\n\n*/m', $contents) as $block) {
    if (str_word_count($block, 0) > 100)
        $blocks[] = $block;
}

// Option 2 - longer but does not store the contents in memory.
$blocks = array();

$fp = fopen($filename, 'r');

$block = '';
while($line = fgets($fp)) {
    if (!ctype_space($line)) { // depends on your meaning of an empty line
        $block .= $line;
    }
    elseif ($block != '') {
        if (str_word_count($block, 0) > 100)
            $blocks[] = $block;
        $block = '';
    }
}
if (str_word_count($block, 0) > 100)
    $blocks[] = $block;
$block = '';

Use a regex like \n\n (for two newlines). You'll probably end up with something like this:

$text_split = preg_split('\n\n', $text);
$good_split = array()
foreach ($text_split as $k => $v) {
    if (strlen($v) >= 100) {
       array_push($good_split, $v);
    }
 }

Good luck. Look up regular expressions, you may want something different than \n\n in reality.