开发者

python regex mediawiki section parsing

开发者 https://www.devze.com 2023-04-13 08:52 出处:网络
I have text similar to the following: ==Mainsection1== Some text here ===Subsection1.1=== Other text here

I have text similar to the following:

==Mainsection1==  
Some text here  
===Subsection1.1===  
Other text here  

==Mainsection2==  
Text goes here  
===Subsecttion2.1===  
Other text goes here. 

In the above text the MainSection 1 and 2 have different names which can be everything the user wants. Same goes for the subsections.

What i want to do with a regex is get the text of a mainsection including its subsection (if there is one). Yes this is from a wikipage. All mainsections names start with == and end with == All subsections have more then the 2== in there name.

regex =re.compile('==(.*)==([^=]*)', re.MULTILINE)  
regex.findall(text)

But the above returns each separate section. Meaning it perfectly returns a mainsection but sees a subsection on his own.

I hope someone can help me with this as its been bugging me for some time

edit: The result should be:

[('Mainsection1', 'Some text here\n===Subsection1.1===  
Other text here\n'), ('Mainsection2', 'Text goes here\n===Subsecttion2.1===  
Other text goes here.\n')]

Edit 2:

I have rewritten my code to not use a regex. I came to the conclusion that it's easy enough to just parse it myself. Which makes it a bit more readable for me.

So here is my code:

def createTokensFromText(text):    
    sections = []
    cur_section = None
    cur_lines = []


    for line in text.split('\n'):
        line = line.strip()
        if line.startswith('==') and not line.startswith('==='):
            if cur_section:
                sections.append( (cur_section, '\n'.join(cur_lines)) )
                cur_lines = []
            cur_section = line
            continue
        if cur_section:
            cur_lines.append(line)

    if cur_section:
  开发者_运维问答      sections.append( (cur_section, '\n'.join(cur_lines)) )
    return sections

Thanks everyone for the help!

All the answers provided have helped me a lot!


First, it should be known, I know a little about Python, but I have never programmed formally in it... Codepad said this works, so here goes! :D -- Sorry the expression is so complex:

(?<!=)==([^=]+)==(?!=)([\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))

This does what you asked for, I believe! on Codepad, this code:

import re

wikiText = """==Mainsection1==
Some text here
===Subsection1.1===
Other text here

==Mainsection2==
Text goes here
===Subsecttion2.1===
Other text goes here. """

outputArray = re.findall('(?<!=)==([^=]+)==(?!=)([\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))', wikiText)
print outputArray

Produces this result:

[('Mainsection1', '\nSome text here\n===Subsection1.1===\nOther text here\n\n'), ('Mainsection2', '\nText goes here\n===Subsecttion2.1===\nOther text goes here. ')]

EDIT: Broken down, the expression essentially says:

01 (?<!=)        # First, look behind to assert that there is not an equals sign
02 ==            # Match two equals signs
03 ([^=]+)       # Capture one or more characters that are not an equals sign
04 ==            # Match two equals signs
05 (?!=)         # Then verify that there are no equals signs following this
06 (             # Start a capturing group
07   [\s\S]*?    #   Match zero or more of ANY character (even CrLf), but BE LAZY
08   (?=         #   Look ahead to verify that either...
09     $         #     this is the end of the 
10     |         #     -OR-
11     (?<!=)    #     when I look behind there is no equals sign
12     ==        #     then there are two equals signs
13     [^=]+     #     then one or more characters that are not equals signs
14     ==        #     then two equals signs
15     (?!=)     #     then verify that there are no equals signs following this
16   )           #   End look-ahead group
17 )             # End capturing group 

Line 03 and Line 06 specify the capturing groups for the Main Section Title and the Main Section Content, respectively.

Line 07 begs for a lot of explanation if you're not pretty fluent in Regex...

  • The \s and \S inside a character class [] will match anything that is whitespace or is not whitespace (i.e. ANYTHING WHATSOEVER) - one alternative to this is using the . operator, but depending upon your compiler options (or ability to specify options) this might or might not match CrLf (or Carriage-Return/Line-Feed). Since you want to match multiple lines, this is the easiest way to ensure a match.
  • The *? at the end means that it will match zero or more instances of the "anything" character class, but BE LAZY ABOUT IT - "lazy" quantifiers (sometimes called "reluctant") are the opposite of the default "greedy" quantifier (without the ? following it), and will not consume a source character unless the source that follows it cannot be matched by the part of the expression that follows the lazy quantifier. In other words, this will consume any characters until it finds either the end of the source text OR another main section which is specified by exactly two and only two equals signs on either side of one or more characters that are not an equals sign (including whitespace). Without the lazy operator, it would try to consume the entire source text then "backtrack" until it could match one of the things after it in the expression (end of source or a section header)

Line 08 is a "look-ahead" that specifies that the expression follwoing should be ABLE to be matched, but should not be consumed.

END EDIT

AFAIK, it has to be this complex in order to properly exclude the subsections... If you want to match the Section Name and Section Content into named groups, you can try this:

(?<!=)==(?P<SectionName>[^=]+)==(?!=)(?P<SectionContent>[\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))

If you'd like, I can break it down for you! Just ask! EDIT (see edit above) END EDIT


The problem here is that ==(.*)== matches ==(=Subsection=)==, so the first thing to do is to make sure there is no = inside the title : ==([^=]*)==([^=]*).

We then need to make sure that there is no = before the beginning of the match, otherwise, the first = of the three is ignored and the subtitle is matched. This will do the trick : (?<!=)==([^=]*)==([^=]*), it means "Matches if not preceded by ...".

We can also do this at the end to make sure, which gives as a final result (?<!=)==([^=]*)==(?!=)([^=]*).

>>> re.findall('(?<!=)==([^=]*)==(?!=)([^=]*)', x,re.MULTILINE)
[('Mainsection1', '\nSome text here\n'),
 ('Mainsection2', '\nText goes here\n')]

You could also remove the check at the end of the title and replace it with a newline. That may be better if you are sure there is a new line at the end of each title.

>>> re.findall('(?<!=)==([^=]*)==\n([^=]*)', x,re.MULTILINE)
[('Mainsection1', 'Some text here\n'), ('Mainsection2', 'Text goes here\n')]

EDIT :

section = re.compile(r"(?<!=)==([^=]*)==(?!=)")

result = []
mo = section.search(x)
previous_end = 0
previous_section = None
while mo is not None:
    start = mo.start()
    if previous_section:
        result.append((previous_section, x[previous_end:start]))
    previous_section = mo.group(0)
    previous_end = mo.end()
    mo = section.search(x, previous_end)
result.append((previous_section, x[previous_end:]))
print result

It's more simple than it looks : repeatedly, we search for a section title after the previous one, and we add it to the result with the text between the beginning of this title and the end of the previous one. Adjust it to suit your style and your needs. The result is :

[('==Mainsection1==',
  '  \nSome text here  \n===Subsection1.1===  \nOther text here  \n\n'),
 ('==Mainsection2==',
  '  \nText goes here  \n===Subsecttion2.1===  \nOther text goes here. ')]
0

精彩评论

暂无评论...
验证码 换一张
取 消