I am using Beautiful Soup to replace the occurrences of a pattern with a href link inside a HTML file
I am facing a problem as described below
modified_contents = re.sub("([^http://*/s]APP[a-z]{2}[0-9]{2})", "<a href=\"http://stack.com=\\1\">\\1</a>", str(soup))
Sample input 1:
Input File contains APPdd34
Output File contains <a href="http://stack.com=APPdd34"> APPdd34</a>
Sample input 2:
Input File contains <a href="http://stack.com=APPdd34"> APPdd34</a>
Output File contains <a href="http://stack.co开发者_JAVA技巧m=<a href="http://stack.com=APPdd34"> APPdd34</a>"> <a href="http://stack.com=APPdd34"> APPdd34</a></a>
Desired Output File 2 is same as Sample Input File 2.
How can I rectify this problem?
This may not entirely answer your problem because I don't know an entire input file could look like, but I hope this is a direction you can take.
from BeautifulSoup import BeautifulSoup, Tag
text = """APPdd34"""
soup = BeautifulSoup(text)
var1 = soup.text
text = """<a href="http://stack.com=APPdd34"> APPdd34</a>"""
soup = BeautifulSoup(text)
var2 = soup.find('a').text
soup = BeautifulSoup("<p>Some new html</p>")
tag1 = Tag(soup, "a",{'href':'http://stack.com='+var1,})
tag1.insert(0,var1) # Insert text
tag2 = Tag(soup, "a",{'href':'http://stack.com='+var2,})
tag2.insert(0,var2)
soup.insert(0,tag1)
soup.insert(3,tag2)
print soup.prettify()
So basically, just use BeautifulSoup to extract the text and then you can build Tags from there.
精彩评论