开发者

How to learn regular expressions

开发者 https://www.devze.com 2023-01-29 17:57 出处:网络
I.e., I get a list of words and I want to construct a simple regular expression from that which matches at least all of the words (but maybe more).

I.e., I get a list of words and I want to construct a simple regular expression from that which matches at least all of the words (but maybe more).

I want to have an algorithm for that. I.e. input of that algorithm is a list of words and output is a regular expression. Obviously, there will be some restrictions. Like either the regular expression will always match more words if it should match an infinite amounts of words and I only give it a finite number of words. Or I will need some more compact representation of the input. Or I am also thinking about giving me some regular expression as input and a list of additional words a开发者_如何学Gond I want to get a regular expression which matches all of them together (and maybe more). In any case, it should try to construct a regular expression which is as simple as possible.

What techniques are availalbe which can do that?


I was quite misunderstood. I know the general principles behind regular expressions. I know what it is. And in most cases I can come up quite easily with a regular expression to some language by hand. But I am searching for algorithms which does that.


Again formulated a bit different:

Let L be a regular language. Let M_n be a finite subset of L with n elements. Let M_n be a subset of M_(n+1).

I want to have an algorithm LRE which gets a finite set of words and outputs a regular expression. And I want to have the property:

lim_n->infinity | diff( LRE(M_n), L ) | = 0


See this website to learn the general principles: http://www.regular-expressions.info/

If all you have is a list of words such as dog, cat, cow, mouse, the simplest regex to match any of these would be: dog|cat|cow|mouse, but note that it will also match doggone, scatological, etc... It may or may not match DOGGONE, COWPATTY, etc... depending on whether or not your are doing case-sensitive matching. Better patterns can be given if more particulars about your problem are given.

It's also a good idea to get a regex testing tool. I like Expresso, it is good for .NET patterns. Since regex capabilties may vary between platforms, make sure your tool supports your platform.


This problem has been looked at the last decade. You might want to google DFA learning, and download a couple of papers to get a sense of the state of the art.

Once you have the DFA generating a regular expression is trivial. To avoid the problems @FrustratedWithDesign mentions some conditions such as generating the DFA with the least amount of nodes is introduced, from a machine learning point of view this is similar to having a regularization condition for the simplest hypothesis.


Use this site to learn the basics and use rubular for live testing.


If you have a list of distinct words that you want to match -- it doesn't sound like you're matching on something that a regular expression is best at.

As FrustratedWithFormsDesigner pointed out -- your regex is going to be mapped to the items in the list in the worst case; best case you can find common prefixes. And if you automate the regex construction, why bother with the regex? What is the use-case?

But if your list is beyond a trivial size, you'd probably be better off looping through it.


http://www.regular-expressions.info is a fantastic site for Regex Reference.

When building a complex regex, I typically use Expresso. It's a free app that helps you build Regular expressions. It breaks them down into a tree view so that it is easy to see what all parts are doing. http://www.ultrapico.com/Expresso.htm It is made to work with .NET languages, but there are plenty of tools like this available for different languages.

To build my Regex, I'll usually start with an acceptable value and start replacing characters with Regex syntax.

For example, if I was trying to match a URL I would start with

http://www.mydomain.com

I would then escape anything that needs escaping

http://www\.mydomain\.com

then I would start replacing characters

http://www\.\w+\.\w+\.\w+

obviously this expression needs some more work, but you get the idea


Here is a site for Perl regex:

http://perldoc.perl.org/perlre.html
0

精彩评论

暂无评论...
验证码 换一张
取 消