开发者

Is it more efficient to grep twice or use a regular expression once?

开发者 https://www.devze.com 2023-03-07 02:53 出处:网络
I\'m trying to parse a couple of 2gb+ files and want to grep on a couple of levels. Say I want to fetch lines that contain \"foo\" and lines that also contain \"bar\".

I'm trying to parse a couple of 2gb+ files and want to grep on a couple of levels.

Say I want to fetch lines that contain "foo" and lines that also contain "bar".

I could do grep foo file.log | grep bar, but my c开发者_开发问答oncern is that it will be expensive running it twice.

Would it be beneficial to use something like grep -E '(foo.*bar|bar.*foo)' instead?


grep -E '(foo|bar)' will find lines containing 'foo' OR 'bar'.

You want lines containing BOTH 'foo' AND 'bar'. Either of these commands will do:

sed '/foo/!d;/bar/!d' file.log

awk '/foo/ && /bar/' file.log

Both commands -- in theory -- should be much more efficient than your cat | grep | grep construct because:

  • Both sed and awk perform their own file reading; no need for pipe overhead
  • The 'programs' I gave to sed and awk above use Boolean short-circuiting to quickly skip lines not containing 'foo', thus testing only lines containing 'foo' to the /bar/ regex

However, I haven't tested them. YMMV :)


In theory, the fastest way should be:

grep -E '(foo.*bar|bar.*foo)' file.log

For several reasons: First, grep reads directly from the file, rather than adding the step of having cat read it and stuff it down a pipe for grep to read. Second, it uses only a single instance of grep, so each line of the file only has to be processed once. Third, grep -E is generally faster than plain grep on large files (but slower on small files), although this will depend on your implementation of grep. Finally, grep (in all its variants) is optimized for string searching, while sed and awk are general-purpose tools that happen to be able to search (but aren't optimized for it).


These two operations are fundamentally different. This one:

cat file.log | grep foo | grep bar

looks for foo in file.log, then looks for bar in whatever the last grep output. Whereas cat file.log | grep -E '(foo|bar)' looks for either foo or bar in file.log. The output should be very different. Use whatever behavior you need.

As for efficiency, they're not really comparable because they do different things. Both should be fast enough, though.


If you're doing this:

cat file.log | grep foo | grep bar

You're only printing lines that contain both foo and bar in any order. If this is your intention:

grep -e "foo.*bar" -e "bar.*foo" file.log

Will be more efficient since I only have to parse the output once.

Notice I don't need the cat which is more efficient in itself. You rarely ever need cat unless you are concatinating files (which is the purpose of the command). 99% of the time you can either add a file name to the end of the first command in a pipe, or if you have a command like tr that doesn't allow you to use a file, you can always redirect the input like this:

tr `a-z` `A-Z` < $fileName

But, enough about useless cats. I have two at home.

You can pass multiple regular expressions to a single grep which is usually a bit more efficient than piping multiple greps. However, if you can eliminate regular expressions, you might find this the most efficient:

fgrep "foo" file.log | fgrep "bar"

Unlike grep, fgrep doesn't parse regular expressions which means it can parse lines much, much faster. Try this:

time fgrep "foo" file.log | fgrep "bar"

and

time grep -e "foo.*bar" -e "bar.*foo" file.log

And see which is faster.

0

精彩评论

暂无评论...
验证码 换一张
取 消