开发者

How to make this sed script faster?

开发者 https://www.devze.com 2022-12-13 11:01 出处:网络
I have inherited this sed script snippet that attempts to remove certain empty spaces: s/[\\s\\t]*|/|/g

I have inherited this sed script snippet that attempts to remove certain empty spaces:

s/[\s\t]*|/|/g
s/|[\s\t]*/|/g
s/[\s] *$//g
s/^|/null|/g

that operates on a file that is around 1Gb large. This script runs for 2 hours on our unix server. Any ideas how to speed it up?

Notes that the \s stands for a space and \t stan开发者_如何学Pythonds for a tab, the actual script uses the actual space and tab and not those symbols

The input file is a pipe delimited file and is located locally not on the network. The 4 lines are in a file executed with sed -f


The best I was able to do with sed, was this script:

s/[\s\t]*|[\s\t]*/|/g
s/[\s\t]*$//
s/^|/null|/

In my tests, this ran about 30% faster than your sed script. The increase in performance comes from combining the first two regexen and omitting the "g" flag where it's not needed.

However, 30% faster is only a mild improvement (it should still take about an hour and a half to run the above script on your 1GB data file). I wanted to see if I could do any better.

In the end, no other method I tried (awk, perl, and other approaches with sed) fared any better, except -- of course -- a plain ol' C implementation. As would be expected with C, the code is a bit verbose for posting here, but if you want a program that's likely going to be faster than any other method out there, you may want to take a look at it.

In my tests, the C implementation finishes in about 20% of the time it takes for your sed script. So it might take about 25 minutes or so to run on your Unix server.

I didn't spend much time optimizing the C implementation. No doubt there are a number of places where the algorithm could be improved, but frankly, I don't know if it's possible to shave a significant amount of time beyond what it already achieves. If anything, I think it certainly places an upper limit on what kind of performance you can expect from other methods (sed, awk, perl, python, etc).

Edit: The original version had a minor bug that caused it to possibly print the wrong thing at the end of the output (e.g. could print a "null" that shouldn't be there). I had some time today to take a look at it and fixed that. I also optimized away a call to strlen() that gave it another slight performance boost.


Try changing the first two lines to:

s/[ \t]*|[ \t]*/|/g


My testing indicated that sed can become CPU bound pretty easily on something like this. If you have a multi-core machine you can try spawning off multiple sed processes with a script that looks something like this:

#!/bin/sh
INFILE=data.txt
OUTFILE=fixed.txt
SEDSCRIPT=script.sed
SPLITLIMIT=`wc -l $INFILE | awk '{print $1 / 20}'`

split -d -l $SPLITLIMT $INFILE x_

for chunk in ls x_??
do
  sed -f $SEDSCRIPT $chunk > $chunk.out &
done

wait 

cat x_??.out >> output.txt

rm -f x_??
rm -f x_??.out


It seems to me from your example that you are cleaning up white space from the beginning and end of pipe (|) delimited fields in a text file. If I were to do this, I would change the algorithm to the following:

for each line
    split the line into an array of fields
    remove the leading and trailing white space
    join the fields back back together as a pipe delimited line handling the empty first field correctly.

I would also use a different language such as Perl or Ruby for this.

The advantage of this approach is that the code that cleans up the lines now handles fewer characters for each invocation and should execute much faster even though more invocations are needed.


This Perl script should be much much faster

s/\s*|\s*/|/go;
s/\s *$//o;
s/^|/null|/o;

Basically, make sure your regexes are compiled once (the 'o' flag), and no need need to use 'g' on regexes that apply only to end and beginning of line.

Also, [\s\t]* is equivalent to \s*


This might work. I've only tested it a little.

awk  'BEGIN {FS="|"; OFS="|"} {for (i=1; i<=NF; i++) gsub("[ \t]", "", $i); $1=$1; if ( $1 == "" ) $1 = "null"; print}'


How about Perl:

#!/usr/bin/perl

while(<>) {
    s/\s*\|\s*/|/g;
    s/^\s*//;
    s/\s*$//;
    s/^\|/null|/;
    print;
}

EDIT: changed approach significantly. On my machine this is almost 3x faster than your sed script.

If you really need the best speed possible, write a specialized C program to do this task.


use gawk, not sed.

awk -vFS='|' '{for(i=1;i<=NF;i++) gsub(/ +|\t+/,"",$i)}1' OFS="|"  file


Try doing it in one command:

sed 's/[^|]*(|.*|).*/\1/'


Have you tried Perl? It may be faster.

#!/usr/local/bin/perl -p

s#[\t ]+\|#|#g;
s#\|[\t ]+#|#g;
s#[\t ]*$##;
s#^\|#null|#;

Edit: Actually, it seems to be about three times slower than the sed program. Strange...


I think the * in the regular expressions in the question and most of the answers can be a major slowdown compared to using a +. Consider the first replace in the question

s/[\s\t]*|/|/g

the * matches zero or more items followed by a |, hence every | is replaced even those that do not need replacing. Changing the replace to be

s/[\s\t]+|/|/g

will only change the | characters that are preceded by one or more spaces and tabs.

I do not have sed available, but I did an experiment with Perl. On the data I used the script with the * took almost 7 times longer than the script with +.

The times were consistent across the runs. For the + the difference between minimum and maximum times was 4% of the average and for the * it was 3.6%. The ratio of the average times was 1::6.9 for +::*.

Details of experiment

Tested using an 80mb file with just over 180000 occurrences of [st]\., these are the lowercase characters s and t.

The test used a batch command file with 30 of each of these two commands, alternating star and plus.

perl -f TestPlus.pl input.ltrar > zz.oo
perl -f TestStar.pl input.ltrar > zz.oo

One script is below, the other merely changed the * to + and star to plus.

#! /bin/usr/perl
use strict;
use warnings;
use Time::HiRes qw( gettimeofday tv_interval );

my $t0 = [gettimeofday()];
while(<>)
{
    s/[st]*\././g;
}

my $elapsed = tv_interval ( $t0 );
print STDERR "Elapsed star $elapsed\n";

Perl version used:

c:\test> perl -v
This is perl 5, version 16, subversion 3 (v5.16.3) built for MSWin32-x64-multi-thread
(with 1 registered patch, see perl -V for more detail)

Copyright 1987-2012, Larry Wall

Binary build 1603 [296746] provided by ActiveState http://www.ActiveState.com
Built Mar 13 2013 13:31:10
0

精彩评论

暂无评论...
验证码 换一张
取 消