开发者

Python regex to match text in single quotes, ignoring escaped quotes (and tabs/newlines)

开发者 https://www.devze.com 2023-02-20 08:16 出处:网络
Given a file of text, where the character I want to match are delimited by single-quotes, but might have zero or one escaped single-quote, as well as zero or more tabs and newline characters (not esca

Given a file of text, where the character I want to match are delimited by single-quotes, but might have zero or one escaped single-quote, as well as zero or more tabs and newline characters (not escaped) - I want to match the text only. Example:

menu_item = 'casserole';
menu_item = 'meat 
            loaf';
menu_item = 'Tony\'s magic pizza';
menu_item = 'hamburger';
menu_item = 'Dave\'s famous pizza';
menu_item = 'Dave\'s lesser-known
    gyro';

I want to grab only the text (and spaces), ignoring the tabs/newlines 开发者_StackOverflow- and I don't actually care if the escaped quote appears in the results, as long as it doesn't affect the match:

casserole
meat loaf
Tonys magic pizza
hamburger
Daves famous pizza
Dave\'s lesser-known gyro # quote is okay if necessary.

I have manage to create a regex that almost does it - it handles the escaped quotes, but not the newlines:

menuPat = r"menu_item = \'(.*)(\\\')?(\t|\n)*(.*)\'"
for line in inFP.readlines():
    m = re.search(menuPat, line)
    if m is not None:
        print m.group()

There are definitely a ton of regular expression questions out there - but most are using Perl, and if there's one that does what I want, I couldn't figure it out :) And since I'm using Python, I don't care if it is spread across multiple groups, it's easy to recombine them.

Some Answers have said to just go with code for parsing the text. While I'm sure I could do that - I'm so close to having a working regex :) And it seems like it should be doable.

Update: I just realized that I am doing a Python readlines() to get each line, which obviously is breaking up the lines getting passed to the regex. I'm looking at re-writing it, but any suggestions on that part would also be very helpful.


This tested script should do the trick:

import re
re_sq_long = r"""
    # Match single quoted string with escaped stuff.
    '            # Opening literal quote
    (            # $1: Capture string contents
      [^'\\]*    # Zero or more non-', non-backslash
      (?:        # "unroll-the-loop"!
        \\.      # Allow escaped anything.
        [^'\\]*  # Zero or more non-', non-backslash
      )*         # Finish {(special normal*)*} construct.
    )            # End $1: String contents.
    '            # Closing literal quote
    """
re_sq_short = r"'([^'\\]*(?:\\.[^'\\]*)*)'"

data = r'''
        menu_item = 'casserole';
        menu_item = 'meat 
                    loaf';
        menu_item = 'Tony\'s magic pizza';
        menu_item = 'hamburger';
        menu_item = 'Dave\'s famous pizza';
        menu_item = 'Dave\'s lesser-known
            gyro';'''
matches = re.findall(re_sq_long, data, re.DOTALL | re.VERBOSE)
menu_items = []
for match in matches:
    match = re.sub('\s+', ' ', match) # Clean whitespace
    match = re.sub(r'\\', '', match)  # remove escapes
    menu_items.append(match)          # Add to menu list

print (menu_items)

Here is the short version of the regex:

'([^'\\]*(?:\\.[^'\\]*)*)'

This regex is optimized using Jeffrey Friedl's "unrolling-the-loop" efficiency technique. (See: Mastering Regular Expressions (3rd Edition)) for details.

Note that the above regex is equivalent to the following one (which is more commonly seen but is much slower on most NFA regex implementations):

'((?:[^'\\]|\\.)*)'


This should do it:

menu_item = '((?:[^'\\]|\\')*)'

Here the (?:[^'\\]|\\')* part matches any sequence of any character except ' and \ or a literal \'. The former expression [^'\\] does also allow line breaks and tabulators that you then need to replace by a single space.


You cold try it like this:

pattern = re.compile(r"menu_item = '(.*?)(?<!\\)'", re.DOTALL)

It will start matching at the first single quote it finds and it ends at the first single quote not preceded by a backslash. It also captures any newlines and tabs found between the two single quotes.

0

精彩评论

暂无评论...
验证码 换一张
取 消