开发者

Mysql text extract with Regex

开发者 https://www.devze.com 2022-12-08 19:21 出处:网络
I tried to extract text from html text stored in a db. This is an example: <P style=\"FONT-SIZE: 13px; MARGIN-LEFT: 6px\"><FONT color=#073b66><STRONG><A

I tried to extract text from html text stored in a db.

This is an example:

<P style="FONT-SIZE: 13px; MARGIN-LEFT: 6px"><FONT color=#073b66><STRONG><A 
href="/generic.asp?page_id=p00497">Practice Exams</A> - </STRONG><FONT 
color=#000000>ours are the most realistic exam simulations, and the best way to 
prepare 开发者_Go百科for your exams. Get detailed correct and incorrect answers and 
explanations. Free Flash Cards are included.</FONT></FONT> </P>

If I search "generic" this regex must find it if this text is over the html tag.

Please help


The following MySQL regex string will match all the html tags, so you can strip them out

"<" +       -- Match the character “<” literally
"[^>]" +    -- Match any character that is NOT a “>”
   "*" +       -- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
">"         -- Match the character “>” literally

OR

I know this does not answer your question directly, but if you have access to scripting languages, they normal have built in functions for stripping html tags from text.

eg. in php you can do this...

$htmltext = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
$plaintext = strip_tags($text);

// or use regex...
$result = preg_replace('/<[^>]*>/i', '', $text);

http://php.net/manual/en/function.strip-tags.php


I suggest parsing the HTML using a proper parser in the language you're programming in before injecting it into your database.

If you post in what language you're working, perhaps I, or someone else, can make a recommendation.


I'd suggest adding another column to db with a text-only copy of the html column and use that column for full-text queries. Regular expressions are the wrong tool for this.

For large amounts of texts you also might consider Sphinx http://www.sphinxsearch.com which has a built-in option to ignore html while searching.

0

精彩评论

暂无评论...
验证码 换一张
取 消