I've combined a bunch of email files into one large text file & now I'm trying to delete all the header lines from the emails out of thi开发者_Go百科s new text files. I have a set of unique characters I can use as markers to delete between them, but I'm coming up short with finding a RegEx that will strip out the header files. An example set is below (including the two asterisks and the double equals at the bottom).
** w54cs6547wem; Sat, 30 Oct 2010 00:06:43 -0700 (PDT) s10mr13764658ybi.218.1288422402631; Sat, 30 Oct 2010 00:06:42 -0700 (PDT) p13si451872ybk.2.2010. .36; Sat, 30 Oct 2010 00:06:42 -0700 (PDT) Sat, 30 Oct 2010 02:01:23 -0500 Date: Sat, 30 Oct 2010 02:01:22 -0500 Subject: Message-ID: Thread-Index: Act4ABHi0HfIPTIzRwe9oy8ojziTig==
sed -i '/\*\*/,/==/d' FILE
changes your file in place (-i),
sed '/\*\*/,/==/d' FILE > MODIFIED
saves the modification to a newly created file.
I don't know bash replacement syntax, but the regex you want is:
/\*\*.*?==/
In PHP, the code would be:
$str = preg_replace('/\*\*.*?==/', '', $str);
Hopefully you can translate that into bash without any trouble.
Explanation:
The trick here is the .*?
. The ?
makes the .*
lazy, so it will start at **
and match everything until the first ==
it finds. Without the ?
, the .*
would be greedy and grab everything between the first **
and the last ==
in the document. So if you have something like this:
**foo==bar **baz==quux **abc==xyz
...using /\*\*.*?==/
as your regex would give you bar quux xyz
, while /\*\*.*==/
would give only xyz
.
If you are going to do that, most probably you would be processing the entire file in memory. Here's a line by line approach.
$> cat file
some words
here that i want
**
w54cs6547wem; Sat, 30 Oct 2010 00:06:43 -0700 (PDT)
s10mr13764658ybi.218.1288422402631; Sat, 30 Oct 2010 00:06:42 -0700 (PDT)
p13si451872ybk.2.2010. .36; Sat, 30 Oct 2010 00:06:42 -0700 (PDT)
Sat, 30 Oct 2010 02:01:23 -0500
Date: Sat, 30 Oct 2010 02:01:22 -0500 Subject:
Message-ID:
Thread-Index: Act4ABHi0HfIPTIzRwe9oy8ojziTig==
other words
here that i also want
$> awk '/^\*\*/{f=1;next} f&&/==$/{f=0;next} f{next} !f' file
some words
here that i want
other words
here that i also want
The idea is to set a flag when the **
is found, then skip the line until ==
is found.
It is easily expressible in perl: cat file | perl -p -i -e 'undef $_ if /^\*\*/ .. /==$/'
. Same for sed: cat file | sed -e '/^\*\*/,/==$/d'
.
精彩评论