开发者

Mysql REGEX for detecting long lines

开发者 https://www.devze.com 2023-01-26 08:24 出处:网络
I have some records in my database that looks like that : Lorem ipsum dolor sit amet, consectetur adipiscing elit.......

I have some records in my database that looks like that :

Lorem ipsum dolor sit amet, consectetur adipiscing elit.......
<PRE>
one short line
an other short line
a very long line I want to detect with more than 80 caracterssssssssssssssssss
again some short lines
&l开发者_如何转开发t;/PRE>
Nullam tristique nisl eu lacus fringilla porta. ........

I would like to detect long lines (>80 caracters) inside the PRE tags and then I will edit them manually.

I tried something like this

SELECT * FROM table WHERE column 
    REGEXP "<PRE>.*[\n\r]+[^\n\r]{80,}[\n\r]+.*</PRE>"

but it's returning records where there is no long lines.

Can someone point me in the right direction ?


The [^\n\r]{80,} isn't necessarily matching a line in the PRE element where it starts searching. The .* could be matching the closing </PRE> tag and beyond, so the long line could be in another PRE element if there is one, or even in the text between PRE elements.

I don't think there's a bullet-proof way to do what you want in MySQL, but you could try this:

<PRE>[^<]*[\n\r][^\n\r<]{80,}

You've said there won't be any other markup inside the PRE element, so any angle bracket in its content should be in the form of an escape sequence like &lt;, and the first < the regex encounters should be one in the </PRE> tag.

It's a hack, but without lookaheads, this is the only way I can think of to constrain the match to within the same PRE element. To do this job right, you should do it outside MySQL altogether.


Use .*? instead of .* so the regex parser isn't greedy


If there could be more then one <PRE> block, you expression can swallow space in between them. Change [^\n\r]{80,} to [^\n\r]{80,}?.


<PRE>\s*[^\n\r]{80,}.*?</PRE>

Note that this assumes that the </PRE> tag never comes on the same line as the content. (If it did, you could consume 74 characters of 'long line' followed by the closing tag, and then you would consume a lot of content up until the next closing tag.)

0

精彩评论

暂无评论...
验证码 换一张
取 消