开发者

Script to fix broken lines in a .txt file?

开发者 https://www.devze.com 2023-01-02 16:57 出处:网络
I\'d love like to read books properly on my Kindle. To achieve my dream, I need a script to fix broken lines in a txt file.

I'd love like to read books properly on my Kindle.

To achieve my dream, I need a script to fix broken lines in a txt file.

For example, if the 开发者_如何学Gotxt file has this line:

He watched Kahlan as she walked with her shoulders slumped
down.

... then it should fix it by deleting the newline before the word "down":

He watched Kahlan as she walked with her shoulders slumped down.

So, fellow programmers, whats (a) the easiest way to do this and (b) the best language?

p.s. The solution will involve searching for a lowercase letter in column 1, and deleting the newline before it to stitch the lines together. There are 1.2 million occurrences of this "rogue line break" in the novel I am trying to fix.


There are a bunch of ways to do it. I would recommend something along the lines of Perl, Python, or Ruby. If you're looking to do this with a quick-and-dirty one-liner, Perl has an edge in that department.

For example, this will do what you asked for:

# Slurp entire file.
# Convert newlines followed by lower-case letter.
perl -p -e 'BEGIN {$/ = undef}    s/\n(?=[a-z])/ /g' book.txt

But this is probably better if paragraphs are separated by 2 newlines.

# Process file a "paragraph" at a time.
# Convert newlines followed by at least 2 characters.
perl -p -e 'BEGIN {$/ = qq{\n\n}} s/\n(?=..)/ /g'    book.txt


If there are spaces between paragraphs: read the text in by paragraphs (set $/ = "\n\n"') and then use Text::Autoformat from CPAN.

Example (substitute a regular filehandle for DATA -- I only used it for convenience in the example):

use strict;
use warnings;
use Text::Autoformat;

local $/ = "\n\n";
while (<DATA>) {
    print autoformat $_, {left=>1, right=>80};
}


__DATA__
He watched Kahlan as she walked with her shoulders slumped 
down. 

He watched Kahlan as she walked with her shoulders slumped 
down. 
He watched Kahlan as she walked with her shoulders slumped 
down. 
He watched Kahlan as she walked with her shoulders slumped 
down. 

He watched Kahlan as she walked with her shoulders slumped 
down. 
He watched Kahlan as she walked with her shoulders slumped 
down. 

Output:

He watched Kahlan as she walked with her shoulders slumped down.

He watched Kahlan as she walked with her shoulders slumped down. He watched
Kahlan as she walked with her shoulders slumped down. He watched Kahlan as she
walked with her shoulders slumped down.

He watched Kahlan as she walked with her shoulders slumped down. He watched
Kahlan as she walked with her shoulders slumped down.


If there are newlines between paragraphs, you might be able to just open it up in a good text editor which has an option to "unwrap text". One such is TextMate for the Mac, but there are probably options for Windows as well.


Using a regular expression to match on lower case characters which are immediately preceded by a newline, then replacing that newline with a space, should do the trick.

Here's a C# implementation;

    string UnwrapText(string input)
    {
        return Regex.Replace(input, Environment.NewLine + "[a-z]",
                            delegate(Match m)
                            {
                                return m.ToString().Replace(Environment.NewLine, " ");
                            });
    }


If paragraphs start with a tab, the most efficient way may to be to remove all newlines that do not precede a tab and replace them with spaces.

If not, you could nuke all newlines that aren't in a sequence of 2 or more newlines.

You could also nuke all newlines that don't follow a period, but as noted this will fail in the case that a sentence ends a line but not a paragraph.


Open the file with vim, :set tw=0 noai, then gggqG. If the file is reasonably well-behaved, that should take out all linebreaks within paragraphs, while retaining paragraph breaks.


I would say parse through the book and look for occurrences of the newline character. If it doesn't come after a period, then remove it. The only problem is that it wouldn't work in this particular case:

He watched Kahlan as she walked with her shoulders slumped down.\n

He watched Kahlan as she walked with her shoulders slumped down.

Instead of:

He watched Kahlan as she walked with her shoulders slumped down. He watched Kahlan as she walked with her shoulders slumped down.

With that case, you will have to determine how paragraphs are separated (are they two newline characters?). If that's the case, check after a period, if there are two newline characters. If not, then delete the first newline character.

0

精彩评论

暂无评论...
验证码 换一张
取 消