开发者

How to extract string with regular expressions in python

开发者 https://www.devze.com 2023-02-20 13:17 出处:网络
I am trying to extract a sub string from a string in python. My data file contains line of the Quran where each one is marked with verse and chapter number at the beginning of the string.

I am trying to extract a sub string from a string in python.

My data file contains line of the Quran where each one is marked with verse and chapter number at the beginning of the string. I want to try to extract the first number and second number and write these to a line in another text file Here is an example of a few lines of the txt file.

2|12|Of a surety, they are the ones who make mischief, but they realise (it) not.
2|242|Thus doth Allah Make clear His Signs to you: In order that ye may understand.

As you can see the verse and chapter could contain multiple digits so just counting the number of spaces from the start of the string would not be adequate. Is there a way of using regular expressions to try to extract as a string the first number(verse) and the second number (chapter)?

The code that I am writing this for will try to write to an Arff file the verse and chapter string. an exam开发者_StackOverflowple of a line in the arff file would be:

1,0,0,0,0,0,0,0,0,2,12

where the last 2 values are the verse and chapter.

here is the for loop that will write for each verse the attributes that i am interested in and then i want to attempt to write verse and chapter to the end by using regular expressions to extract the relevant substring for each line.

for line in verses:
    for item in topten:
        count = line.count(item)
        ARFF_FILE.write(str(count) + ",")
    # Here is where i could use regular expressions to extract the desired substring 
    # verse and chapter then write these to the end of a line in the arff file.
    ARFF_FILE.write("\n")

I think the regular expression for chapter number (first number before pipe) should be something like this, then use the group(0) function to get the first number and

"^(\d+)\|(\d)\|" 

then the regexp for verse should be gained by group(1)

but i dont know how to implement this in python. Does anyone have any ideas? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ response to a question.

I have just tried to implement you technique but am getting a " index error: list index out of range. my code is

for line in verses:
 for item in topten:
     parts = line.split('|')

     count = line.count(item)
     ARFF_FILE.write(str(count) + ",")
 ARFF_FILE.write(parts[0] + ",")
 ARFF_FILE.write(parts[1])  
 ARFF_FILE.write("\n")


If all your lines are formatted like A|B|C, then you don't need any regex, just split it.

for line in fp:
    parts = line.split('|') # or line.split('|', 2) if the last part can contain |
    # use parts[0], parts[1]


I think the easiest way would be to use a re.split() to get the verses text and a re.findall() to get the chapter and verses numbers The results would be stored in lists that can be used later Here is an example of the code:

#!/usr/bin/env python

import re

# string to be parsed
Quran= '''2|12|Of a surety, they are the ones who make mischief, but they realise (it) not.
2|242|Thus doth Allah Make clear His Signs to you: In order that ye may understand.'''

# list containing the text of all the verses
verses=re.split(r'[0-9]+\|[0-9]+\|',Quran)
verses.remove("")

# list containing the chapter and verse number:
#
#   if you look closely, the regex should be r'[0-9]+\|[0-9]+\|'
#   i ommited the last pipe character so that later when you need to split
#   the string to get the chapter and verse nembuer you wont have an
#   empty string at the end of the list
#
chapter_verse=re.findall(r'[0-9]+\|[0-9]+',Quran)


# looping over the text of the verses assuming len(verses)==len(chp_vrs)
for index in range(len(verses)):
    chapterNumber,verseNumber =chapter_verse[index].split("|")
    print "Chapter :",chapterNumber, "\tVerse :",verseNumber
    print verses[index]


With parenthesis? Isn't that how all regular expressions work?

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号