开发者

How to substitue character matches with trailing characters from the same text line?

开发者 https://www.devze.com 2023-01-18 07:53 出处:网络
I\'m using pdftotext to convert Spanish language text.Characters with accents or tildes are output in a systematic way that requires further conversion.Accents and tildes appear in the converted text

I'm using pdftotext to convert Spanish language text. Characters with accents or tildes are output in a systematic way that requires further conversion. Accents and tildes appear in the converted text in the correct position but without the letter. The letter almost always appears at the end of the output line. When it doesn't, I can fix those by hand.

For example, the pdf sentence

¿Por qué?

becomes

¿Por qu´? e

I know enough about sed, awk and grep to think it can be done with some combination of those - and that it would take me a long time. I intend to use this to process all the pdf files in a folder.

The sentences appear in Spanish-English pairs on separate lines. I'd like to concatenate the two with a semicolon delimiter, the import format of my flash card app (Anki). Delete all the content that are not Spanish-English sentence pairs.

For example, convert this output

B:

¿Por qu´? e
Why?

into

¿Por qué?;Why?

Where there are multiple accents, tildes or a mix of both, the letters trailing the line are in the correct order and may be comma separated by commas. For example, the pdf sentence

Sí pero vi en la televisión que iba a llover.

becomes

S´ pero vi en la televisi´n que iba a llover. ı, o

or S´ pero vi en la televisi´n que iba a llover. ı o

Output File Format

The sentences always have an end punctuation, either "!", "?" or ".". For those unfamiliar with Spanish, vowels (aeiou) are the only letters which may have an accent, the letter "n" is the only one that may have a tilde, and the 2 special characters may be found on both upper and lower case letters.

The first output line may contain the level and title of the pdf. The level and title always precede the first occurrence of "A:"

I'm not interested in the line "Key Vocabulary" or anything that appears on any subsequent lines.

pdftotext run with UTF8 encoding. My OS is Linux Mint 9, which is based on Ubuntu 10.04

Below are two sample output files.

Output 1

Elementary - Credit Card A:

(B0089)

Me da la cuenta, por favor.
Bring me the check, please.

B:

Se la doy enseguida.
I’ll bring it to you right away.

B:

Perd´n se˜or, pero no acep开发者_Go百科tamos tarjeta. o n
Sorry sir, but we don’t take cards.

A:

¿No aceptan ninguna tarjeta de cr´dito? e
You don’t take any credit cards?


Key Vocabulary

tarjeta cr´dito e cuenta

Noun Noun Noun

card credit bill

Output 2

Elementary - My computer is not working A: ¡No puede ser!
It can’t be!

(B0079)

B:

¿Qu´ pasa? e
What happened?

A:

Mi computadora no est´ funcionando. a
My computer is not working.

B:

Rein´ ıciala.
Restart it.


Key Vocabulary

funcionar

Verb

to work


Edit: Minor change to the NR == 1 line to accomodate variations in the first line of the input file. For this to work, it depends on "A:" only appearing once in the first line.

I also should add that this program depends on features of GNU AWK (gawk).

There seem to be some inconsistencies between your two output examples. The program below works with the first one. In the second example, this line contains both header and a data line:

Elementary - My computer is not working A: ¡No puede ser!

and this line contains the character to be substituted within the line rather than after the final punctuation.

Rein´ ıciala.

These issues could be accommodated by modifying the program if needed.

Also, you mention that these characters will be separated by commas, but the examples don't have them (in the one place where it might have appeared). It doesn't matter because my program ignores commas.

You can run the following program like this:

$ ./scriptname inputfile

Here it is in all its kludginess:

#!/usr/bin/awk -f
BEGIN {
    FS = "[.?!]"
    chars["n"] = "˜ñ"
    chars["N"] = "˜Ñ"
    chars["a"] = "´á"
    chars["A"] = "´Á"
    chars["e"] = "´é"
    chars["E"] = "´É"
    chars["ı"] = "´í"
    chars["I"] = "´Í"
    chars["o"] = "´ó"
    chars["O"] = "´Ó"
    chars["u"] = "´ú"
    chars["U"] = "´Ú"
}

/Key Vocabulary/ {exit}

    NR == 1 { sub(".*A: *","",$1) }

    /^\(.*\) *$/ || \
    /^(A|B): *$/ || \
    /^ *$/ \
        {next}

{
    punct = gensub($1"(.)"$2,"\\1","",$0)

    for (i=0; i<=length($2); i++) {
        char = substr($2,i,1);
        if (char != " ") {
            sub(substr(chars[char],1,1),substr(chars[char],2,1),$1)
        }
    }

    printf "%s%s;", $1, punct
    getline
    print
}


I think it would be difficult with sed or awk…

I suggest using Perl or Vim commands to do that (if you know to use Vim) :

A vim command would be:

:%s/^.\{-}\zs´\(.*\.\) ı\(,\|$\)/í\1/
:%s/^.\{-}\zs´\(.*\.\) o\(,\|$\)/ó\1/
:%s/^.\{-}\zs´\(.*\.\) e\(,\|$\)/é\1/
: " etc

And repeat until there is no more vowel at an end of line after a full stop.

\zs sets start of match, and \1 is back-reference to .*. put inside brackets in matched regexp.

If you want to process all pdf files, do as follows:

vim *.pdf
:set hidden   "allows modifying a not-on-display buffer
:bufdo %s/^.\{-}\zs´\(.*\.\) ı\(,\|$\)/í\1/
:bufdo %s/^.\{-}\zs´\(.*\.\) o\(,\|$\)/ó\1/
: " etc
:next         "allows you to see other buffers to validate
:bufdo w      "will save all buffers
:q            "will quit
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号