We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
开发者_开发百科 Improve this questionMy Questions is
Is there any regular expression engine that does Just-In-Time compiling during regex pattern parsing and use when matching / replacing the texts? Or where can I learn JIT for i386 or x64 architecture?
Why I need it
I was recently trying to benchmark Python’s built-in regex engine compared with normal C code with around 10MB of data.
I found that for a straightforward replacing (for example ab
to zzz
) it’s relatively fast: just 2 to 3 times slower than C.
But for [a-z]c
it took around 5 to 8 times as much time as C.
And with grouping (e.g. ([a-z])(c)
to AA\2\1BB
) it took 20 to 40 times as much time as C.
It’s not Just-In-Time compiling yet, but I think, if I could do Just-In-Time compiling, It could speed up a lot more.
PS: I use profiling for each regex pattern during compiling patterns,
for example, profile 1 for simple one like ab
, profile 2 for range [a-z]c
, profile 3 with grouping ([a-z])(c)
, each profile has separate codes, so no extra cost needed when matching and replacing simple patterns.
Update 1
I have tried it with psyco, and it doesn’t improve the speed that much. May be because I am doing text replacing against big data, not looping many times.
If I am not wrong, Python’s re.sub
is running it natively already I think, so pysco cannot improve the speed that much.
Update 2
I have tried with boost regex wrapped into python, but it’s even slower than Python’s regex, so it seems like the bottleneck is in Python’s string processing and Jan Goyvaerts has also pointed me to that in the answer.
Update
I’d like to convert regex pattern ab[a-z]c
to machine code, like the following equivalent C code (*s
points to 10MB long texts):
do{
if(*s=='a' && s[1]=='b' && s[2]>='a' && s[2]<='z' && s[3]=='c') return 1;
}while(*s++);
return 0;
Any ideas?
PCRE has a JIT compiler since 8.20. You can read about here: http://sljit.sourceforge.net/pcre.html
The only regex engine that I know that can compile regular expressions into executable code is the one in .NET when you pass RegexOptions.Compiled. That causes the Regex class to emit MSIL which can then be JITted like any other .NET code.
Whether than makes the .NET regex engine faster than others is a totally different matter. When searching and replacing using relatively simple regular expressions on large data sets, string handling becomes far more important. .NET strings are immutable, so much will depend on how many times the string needs to be reallocated.
Hand-coding the operation will always be faster, because the code isn't equivalent. The regex code maintains certain information about the regex match and the capturing groups which your code does not. In most situations, the extra time you spend hand-coding the search-and-replace instead of using a regex isn't worth the effort, particularly if you factor in that switching to a different regex is trivial when your requirements change, while rewriting the search-and-replace using procedural code takes much more time.
In my experience, PCRE is one of the fastest regex engines around. It doesn't include a ready-made search-and-replace, however.
I'm no Python expert, but you could give Psycho a try:
http://www.ibm.com/developerworks/library/l-psyco.html
http://psyco.sourceforge.net/
The regular expression engine in Firefox compiles some (not all!) regular expressions to machine code. I believe Safari and Chrome do too.
I don't see it in your question, so I ask: Did you test with precompiled regular expressions e.g. "re.compile(pattern)" ??
Since compiled regexes should be faster. OK, it is not JIT, but most of the time you are fine with simply precompiled ones!
See here:
re.compile
Another idea: When you have a library (in C) that is more optimal than the Python regex module or that does just-in-time compilation of Regexes, then you could write your own regex module for python that does just wrap your C-Library.
That of course is somewhat more work and only recommended when you really, really need the speed.
You could also try Cython (personally I did not use it yet, but it sounds rather good) to do the job of wrapping.
As much as I understand your problem now, the Python surrounding is not your problem (so I doubt whether psyco will help) -- also the preparation of the regex-run is not your problem, but the run itself must be top-speed. That of course depends on the library you use and how good it can handle large strings. I would think, that the standard python regex-lib is not optimized for such long strings and top-of-the-notch speed.
So if I understand you correctly, you use a programming language that by default does not do just-in-time compiling and now you are looking for a regex library that does precisely that?
I think you should compile all of your python code to binary using e.g. Psyco
http://www.devshed.com/c/a/Python/How-Python-Runs-Programs/4/
also discussed here:
Is it feasible to compile Python to machine code?
and here:
Is it possible to compile Python natively (beyond pyc byte code)?
If these solutions either don't work or are still not fast enough and if you absolutely want to write the rest of your application in python, there is the boost python c++ library:
http://www.boost.org/doc/libs/1_41_0/libs/python/doc/index.html
The boost.python library allows full interoperability between python and c++. Then, you could use the boost.regex c++ regex matcher:
http://www.boost.org/doc/libs/1_41_0/libs/regex/doc/html/index.html
I may be wrong, but I believe that Python's regex module is in C, so any suggestion to compile Python (like using Psycho) would not make much difference---what you're actually comparing is the performance of one C regex library (Python's) with another (whatever library you are using).
Thompson had a paper published in the communications of the ACM in 1968 that described a working JIT compiler for regular expressions into IBM 7094 code. I don't know what language(s) he used; Fortran or LISP would be the obvious suspects, with LISP being especially suspect since it already had JIT compiling.
精彩评论