开发者

Can this regex be further optimized?

开发者 https://www.devze.com 2023-03-29 17:35 出处:网络
I wrote this regex to parse entries from srt files. (?s)^\\d++\\s{1,2}(.{12}) --> (.{12})\\s{1,2}(.+)\\r?$

I wrote this regex to parse entries from srt files.

(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$

I don't know if it matters, but this is done using Scala programming language (Java Engine, but literal strings so that I don't have to double the backslashes).

The s{1,2} is used because some files will only have line breaks \n and others will have line breaks and carriage returns \n\r The first (?s) enables DOTALL mode so that the third capturing group can also match line breaks.

My program basically breaks a srt file using \n\r?\n as a delimiter and use Scala nice pattern matching feature to read each entry for further processi开发者_JAVA技巧ng:

val EntryRegex = """(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$""".r

def apply(string: String): Entry = string match {
  case EntryRegex(start, end, text) => Entry(0, timeFormat.parse(start),
    timeFormat.parse(end), text);
}

Sample entries:

One line:

1073
01:46:43,024 --> 01:46:45,015
I am your father.

Two Lines:

160
00:20:16,400 --> 00:20:19,312
<i>Help me, Obi-Wan Kenobi.
You're my only hope.</i>

The thing is, the profiler shows me that this parsing method is by far the most time consuming operation in my application (which does intensive time math and can even reencode the file several times faster than what it takes to read and parse the entries).

So any regex wizards can help me optimize it? Or maybe I should sacrifice regex / pattern matching succinctness and try an old school java.util.Scanner approach?

Cheers,


(?s)^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(.+)\r?$

In Java, $ means the end of input or the beginning of a line-break immediately preceding the end of input. \z means unambiguously end of input, so if that is also the semantics in Scala, then \r?$ is redundant and $ would do just as well. If you really only want a CR at the end and not CRLF then \r?\z might be better.

The (?s) should also make (.+)\r? redundant since the + is greedy, the . should always expand to include the \r. If you do not want the \r included in that third capturing group, then make the match lazy : (.+?) instead of (.+).

Maybe

(?s)^\d++\s\s?(.{12}) --> (.{12})\s\s?(.+?)\r?\z

Other fine high-performance alternatives to regular expressions that will run inside a JVM &| CLR include JavaCC and ANTLR. For a Scala only solution, see http://jim-mcbeath.blogspot.com/2008/09/scala-parser-combinators.html


I'm not optimistic, but here are two things to try:

  1. you could do is move the (?s) to just before you need it.
  2. remove the \r?$ and use a greedy .++ for the text .+

    ^\d++\s{1,2}(.{12}) --> (.{12})\s{1,2}(?s)(.++)$

To really get good performance, I would refactor the code and regex to use findAllIn. The current code is doing a regex for every Entry in your file. I imagine the single findAllIn regex would perform better...But maybe not...


Check this out:

(?m)^\d++\r?+\n(.{12}) --> (.{12})\r?+\n(.++(?>\r?+\n.++)*+)$

This regex matches a complete .srt file entry in place. You don't have to split the contents up on line breaks first; that's a huge waste of resources.

The regex takes advantage of the fact that there's exactly one line separator (\n or \r\n) separating the lines within an entry (multiple line separators are used to separate entries from each other). Using \r?+\n instead of \s{1,2} means you can never accidentally match two line separators (\n\n) when you only wanted to match one.

This way, too, you don't have to rely on the . in (?s) mode. @Jacob was right about that: it's not really helping you, and it's killing your performance. But (?m) mode is helpful, for correctness as well as performance.

You mentioned java.util.Scanner; this regex would work very nicely with findWithinHorizon(0). But I'd be surprised if Scala doesn't offer a nice, idiomatic way to use it as well.


I wouldn't use java.util.Scanner or even strings. Everything you're doing will work perfectly on a byte stream as long as you can assume UTF-8 encoding of your files (or a lack of unicode). You should be able to speed things up by at least 5x.


Edit: this is just a lot of low-level fiddling of bytes and indices. Here's something based loosely on things I've done before, which seems about 2x-5x faster, depending on file size, caching, etc.. I'm not doing the date parsing here, just returning strings, and I'm assuming the files are small enough to fit in a single block of memory (i.e. <2G). This is being rather pedantically careful; if you know, for example, that the date string format is always okay, then the parsing can be faster yet (just count the characters after the first line of digits).

import java.io._
abstract class Entry {
  def isDefined: Boolean
  def date1: String
  def date2: String
  def text: String
}
case class ValidEntry(date1: String, date2: String, text: String) extends Entry {
  def isDefined = true
}
object NoEntry extends Entry {
  def isDefined = false
  def date1 = ""
  def date2 = ""
  def text = ""
}

final class Seeker(f: File) {
  private val buffer = {
    val buf = new Array[Byte](f.length.toInt)
    val fis = new FileInputStream(f)
    fis.read(buf)
    fis.close()
    buf
  }
  private var i = 0
  private var d1,d2 = 0
  private var txt,n = 0
  def isDig(b: Byte) = ('0':Byte) <= b && ('9':Byte) >= b
  def nextNL() {
    while (i < buffer.length && buffer(i) != '\n') i += 1
    i += 1
    if (i < buffer.length && buffer(i) == '\r') i += 1
  }
  def digits() = {
    val zero = i
    while (i < buffer.length && isDig(buffer(i))) i += 1
    if (i==zero || i >= buffer.length || buffer(i) != '\n') {
      nextNL()
      false
    }
    else {
      nextNL()
      true
    }
  }
  def dates(): Boolean = {
    if (i+30 >= buffer.length) {
      i = buffer.length
      false
    }
    else {
      d1 = i
      while (i < d1+12 && buffer(i) != '\n') i += 1
      if (i < d1+12 || buffer(i)!=' ' || buffer(i+1)!='-' || buffer(i+2)!='-' || buffer(i+3)!='>' || buffer(i+4)!=' ') {
        nextNL()
        false
      }
      else {
        i += 5
        d2 = i
        while (i < d2+12 && buffer(i) != '\n') i += 1
        if (i < d2+12 || buffer(i) != '\n') {
          nextNL()
          false
        }
        else {
          nextNL()
          true
        }
      }
    }
  }
  def gatherText() {
    txt = i
    while (i < buffer.length && buffer(i) != '\n') {
      i += 1
      nextNL()
    }
    n = i-txt
    nextNL()
  }
  def getNext: Entry = {
    while (i < buffer.length) {
      if (digits()) {
        if (dates()) {
          gatherText()
          return ValidEntry(new String(buffer,d1,12), new String(buffer,d2,12), new String(buffer,txt,n))
        }
      }
    }
    return NoEntry
  }
}

Now that you see that, aren't you glad that the regex solution was so quick to code?

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号