开发者

Removing spaces and newlines between tags in html (aka unformatting) in python

开发者 https://www.devze.com 2023-01-04 13:59 出处:网络
An example: <p> Hello</p> <div>hgello</div> <pre> code code <pre> turns in something like:

An example:

<p> Hello</p>
<div>hgello</div>
<pre>
   code
    code
<pre>

turns in something like:

<p> Hell开发者_开发问答o</p><div>hgello</div><pre>
    code
     code
<pre>

How to do this in python? I make also intensive use of < pre> tags so substituting all '\n' with '' is not an option.

What's the best way to do that?


You could use re.sub(">\s*<","><","[here your html string]").

Maybe string.replace(">\n",">"), i.e. look for an enclosing bracket and a newline and remove the newline.


I would choose to use the python regex:

string.replace(">\s+<","><")

Where the '\s' finds any whitespace character and the '+' after it shows it matches one or more whitespace characters. This removes the possibility of the replace replacing

<pre>
    code
     code
<pre>

with

<pre><pre>

More information about regular expressions can be found here, here and here.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号