Adding anchors to h2 in text using python and regexp_问答_开发者

Adding anchors to h2 in text using python and regexp

开发者 https://www.devze.com 2022-12-31 19:34 出处：网络

I\'m trying to add anchors to all h2\'s in my html, using python. This code will add those anchors, but I need to fill the name of the anchors too.

相关专题：python

I'm trying to add anchors to all h2's in my html, using python. This code will add those anchors, but I need to fill the name of the anchors too.

Any idea if the name can be the number of the match in the loop or a slugified version of the text between the h2 tags?

Here's the code s开发者_如何转开发o far:

regex = '(?P<name><h2>.*?</h2>)'
text = re.sub(regex, "<a name=''/>"+r"\g<name>", text)

You can take advantage of the fact that the second argument to re.sub can be a function to do pretty much anything you'd like. Here's an example that will slugify the text inside the <h2> element:

regex = '(?P<name><h2>(.*?)</h2>)' # Note the extra group inside the <h2>

def slugify(s):
    return s.replace(' ', '-') # bare-bones slugify

def anchorize(matchobj):
    return '<a name="%s"/>%s' % (slugify(matchob.group(2)), matchobj.group(1))

text = re.sub(regex, anchorize, text)

(That slugify function could obviously use some work.)

You could also implement a counter with a version of anchorize that used a global counter or, better yet, a class that kept track of its own counter and implemented the special __call__ method.

Not sure if I understand correctly, but is placing the author as the name attribute sufficient? Maybe you could use (as long as the author name doesn't contain invalid chars for an attribute):

regex = '(?P<name><h2>(.*?)</h2>)'
print re.sub(regex, "<a name='\g<2>'/>"+r"\g<name>", text)

If you need a more advanced substitution method, parsing the author name or looking up some sort of related id, you could define a replacement function (see re substitute doc):

def name_substitution(matchobj):
    name = matchobj.group(2)
    # do some processing on name here ...
    name = name.replace(' ', '_')
    return "<a name='%s'>%s</a>" % (name, matchobj.group(0))

print re.sub(regex, substitution, text)