How can I search a file for all text lines prefaced with numbers in a certain format and move them to a new line_问答_开发者

I'm searching a flat text file version of the KJV bible for a word or group of words to get a match that returns the line, the book, chapter, and the verse that the word was found. My problem is that I had to manually find the line number that each book started with and put them in a dictionary, but I didn't consider, at the time, that the file had jumbled lines, for example:

1:16 And God made two great lights; the greater light to rule the day,
and the lesser light to rule the night: he made the stars also.

1:17 And God set them in the firmament of the heaven to give light
upon the earth, 1:18 And to rule over the day and over the night, and
to divide the light from the darkness: and God saw that it was good.

So, if I did a search for God, the line that comes immediately after 1:16 with the chapter listed as 1 and verse as 16, and the same for 1:17 ... but the line in 1:18 would be listed as chapter 1, verse 17.

I need to figure out how to search all lines like 1:18 and move them to a new line. Obviously, the line numbers in first_lines dictionary in the code below will change, but that's minor (I will simply go back through the text file and manually look at the starting line numbers). I really appreciate any help. The text bible can be found here: http://www.gutenberg.org/ebooks/10 Also, here is the code:

import os
import sys
import re


print "%30s %-3s %s %4s\n" % ("","King", "James", "Bible")
word_search = raw_input(r'Enter a word to search: ')
book = open("KJV.txt", "rb")
first_lines = {36: 'Genesis', 4812: 'Exodus', 8867: 'Leviticus', 11749: 'Numbers', 15718: 'Deuteronomy',
           18909: 'Joshua', 21070: 'Judges', 23340: 'Ruth', 23651: 'I Samuel', 26641: 'II Samuel',
           29094: 'I Kings', 31990: 'II Kings', 34706: 'I Chronicles', 37378: 'II Chronicles',
           40502: 'Ezra', 41418: 'Nehemiah', 42710: 'Esther', 43352: 'Job', 45937: 'Psalms', 53537: 'Proverbs',
           56015: 'Ecclesiastes', 56711: 'Song of Solomon', 57076: 'Isaiah', 61550: 'Jeremiah',
           66480: 'Lamentations', 66961: 'Ezekiel', 71548: 'Daniel', 72933: 'Hosea', 73620: 'Joel',
           73874: 'Amos', 74359: 'Obadiah', 74441: 'Jonah', 74604: 'Micah', 74985: 'Nahum', 75160: 'Habakkuk',
           75348: 'Zephaniah',75550: 'Haggai', 75676: 'Zechariah', 76428: 'Malachi', 76646: 'Matthew',
           79708: 'Mark', 81680: "Luke", 85006: 'John', 87543: 'Acts', 90654: 'Romans', 91851: 'I Corinthians',
           93065: 'II Corinthians', 93830: 'Galatians', 94257: 'Ephesians', 94612: 'Philippians', 94896: 'Colossians',
           95145: 'I Thessalonians', 95390: 'II Thessalonians', 95515: 'I Timothy', 95833: 'II Timothy',
           96063: 'Titus', 96183: 'Philemon', 96243: 'Hebrews', 97113: 'James', 97430: 'I Peter', 97719: 'II Peter',
           97906: 'I John', 98249: 'II John', 98295: 'III John', 98340: 'Jude', 98427: 'Revelation'}

for ln, line in enumerate(book):
     match = re.match(r'(\d+):(\d+)', line)

     if match:
          chapter = match.group(1)
          verse = match.group(2)

     if word_search in line: 
          first_line = max(l for l in first_lines if l < ln)
          bibook = first_li开发者_运维百科nes[first_line]

          template = "\nLine: {0}\nString: {1}\nBook: {2}\nChapter: {3}\nVerse: {4}\n"
          output = template.format(ln, line, bibook, chapter, verse)
          print output

This is a fairly complex problem. Being so and not knowing Python, below is a Perl
solution that features one, of possibly many, regex solutions. Its what I came up
with in 5 minutes, I'm sure it can be refactored to be more efficient, but you should
get the drift.

use strict;
use warnings;

my $str = '
1:16 And God made two great lights; the greater light to rule the day,
and the lesser light to rule the night: he made the stars also.

1:17 And God set them in the firmament of the heaven to give light
upon the earth, 1:18 And to rule over the day and over the night, and
to divide the light from the darkness: and God saw that it was good.
';

my $word_search = 'God';

while ( $str =~ /

  (?:^|\s)
  (\d+) : (\d+)    # group 1,2
  (?:\s|$)
  (                # group 3
    (?:
        (?!
           \s+ \d+ : \d+ (?:\s|$)
        )
        .
    )*
    $word_search
    (?:
       (?!
          \s+ \d+ : \d+ (?:\s|$)
       )
       .
    )*
  )

/xsg )

{
  print "\nChapter $1, Verse $2\n";
  print "Verse: $3\n";
}

__END__

Output:

Chapter 1, Verse 16
Verse: And God made two great lights; the greater light to rule the day,
and the lesser light to rule the night: he made the stars also.

Chapter 1, Verse 17
Verse: And God set them in the firmament of the heaven to give light
upon the earth,

Chapter 1, Verse 18
Verse: And to rule over the day and over the night, and
to divide the light from the darkness: and God saw that it was good.

edit Compressed it looks like this: /(?:^|\s)(\d+):(\d+)(?:\s|$)((?:(?!\s+\d+:\d+(?:\s|$)).)*$word_search(?:(?!\s+\d+:\d+(?:\s|$)).)*)/sg

The flag's are (/sg) 'single line' and 'global'

Let's look at this snippet around 9:3:

stand before the children of Anak! 9:3 Understand therefore this day,

If you search for children of Anak, then the code you posted (assuming the regex can be fixed) would return 9:3, even though it should be 9:2. So we need to rethink how we want to approach the problem.

I suggest

contents=book.read()
re.split(r'(\d+:\d+)',contents)

this cleaves the entire text on the chapter/verse numbers.

import re
import itertools
import textwrap

if __name__=='__main__':
    print "{0:^78}".format("King James Bible")

    books=iter(['Genesis', 'Exodus', 'Leviticus', 'Numbers', 'Deuteronomy', 'Joshua',
           'Judges', 'Ruth', 'I Samuel', 'II Samuel', 'I Kings', 'II Kings',
           'I Chronicles', 'II Chronicles', 'Ezra', 'Nehemiah', 'Esther', 'Job', 'Psalms',
           'Proverbs', 'Ecclesiastes', 'Song of Solomon', 'Isaiah', 'Jeremiah',
           'Lamentations', 'Ezekiel', 'Daniel', 'Hosea', 'Joel', 'Amos', 'Obadiah',
           'Jonah', 'Micah', 'Nahum', 'Habakkuk', 'Zephaniah', 'Haggai', 'Zechariah',
           'Malachi', 'Matthew', 'Mark', 'Luke', 'John', 'Acts', 'Romans', 'I Corinthians',
           'II Corinthians', 'Galatians', 'Ephesians', 'Philippians',
           'Colossians', 'I Thessalonians', 'II Thessalonians', 'I Timothy', 'II Timothy',
           'Titus', 'Philemon', 'Hebrews', 'James', 'I Peter', 'II Peter', 'I John',
           'II John', 'III John', 'Jude', 'Revelation'])

    with open("KJV.txt", "rb") as book:
        contents=book.read()
        data=re.split(r'(\d+:\d+)',contents)[1:]    
        del contents

    word_search = raw_input(r'Enter a word to search: ')

    for chapter_verse, line in itertools.izip(*[iter(data)]*2):
        if chapter_verse=='1:1':
            book=next(books)
        line=' '.join(line.split())
        if word_search in line:
            line=textwrap.fill(line,width=78)
            print('''\
{b} {c}
{l}
'''.format(b=book,c=chapter_verse,l=line))

Running test.py on "consuming fire" yields

% test.py 
                               King James Bible                               
Enter a word to search: consuming fire
Deuteronomy 4:24
For the LORD thy God is a consuming fire, even a jealous God.

Deuteronomy 9:3
Understand therefore this day, that the LORD thy God is he which goeth over
before thee; as a consuming fire he shall destroy them, and he shall bring
them down before thy face: so shalt thou drive them out, and destroy them
quickly, as the LORD hath said unto thee.

Hebrews 12:29
For our God is a consuming fire.

PS. Hard-coding the first_line numbers of books is fragile -- don't use them. (What happens if someone decides to delete the header text that comes with the Gutenberg file, or accidentally inserts some blank newlines somewhere, etc.)

All you really need is the order of the books, since each new book starts with chapter_verse 1:1.

Try changing your regular expression to:

^(\d+):(\d+)

The ^ should anchor matches to the beginning of the text.

Here's a regex that matches (I think!) chapter:verse headings.

r'[^\n\d](\d+:\d+)'

If you want them grouped, as in your code

r'[^\n\d](\d+):(\d+)'

I used the below to re-lineate the text from project gutenberg. This still leaves some awkward line breaks, though -- it's not one verse per line.

>>> with open('pg10.txt', 'r') as kjb_file:
...     kjb_text = kjb_file.read()
... 
>>> kjb_text = re.sub(r'[^\n\d](\d+:\d+)', r'\r\n\r\n\g<1>', kjb_text)
>>> with open('kjb_new.txt', 'w') as kjb_new:
...     kjb_new.write(kjb_text)
...