How to extract block of XML from a log file on Linux_问答_开发者

I have a log file that looks like the following:

2010-05-12 12:23:45 Some sort of log entry
2010-05-12 01:45:12 Request XML: <RootTag>
<Element>Value</Element>
<Element>Another Value</Element>
</RootTag>
2010-05-12 01:45:32 Response XML: <ResponseRoot>
<Element>Value</Element>
</ResponseRoot>
2010-05-12开发者_开发知识库 01:45:49 Another log entry

What I want to do is extract the Request and Response XML (and ultimately dump them into their own single files). I had a similar parser that used egrep but the XML was all on one line, not multiple ones like above.

The log files are also somewhat large, hitting 500-600 megs a log. Smaller logs I would read in via a PHP script and use regex matching, but the amount of memory required for such a large file would more than likely kill the script.

Is there an easy way using the built-in tools on a Linux box (CentOS in this case) to extract multiple lines or am I going to have to bite the bullet and use Perl or PHP to read in the entire file to extract it?

# Example usage:
# perl script.pl data.xml RootTag > RootTag.xml

use strict;
use warnings;

my $tag = pop;

while (<>){
    if ( s/.*(<$tag>)/$1/ .. s/(<(\/)$tag>).*/$1/ ){
        print;
        last if $2;
    }
}

See the docs for details on the flip-flop operator.

Sounds like a job for sed (I was so tempted to say SuperSed ;-)

sed -n '/^<.\+>/H; /\(Request\|Response\) XML/{s/^.*</</;x;p}; ${x;p}' xmllog

where xmllog is your log file's name. You'll get a blank line at the beginning, but that can be filtered out with egrep '.+' or even just tail -n +2.

By way of explanation, sed is a little interpreter for programs that consist of a list of matching conditions and corresponding actions. sed runs through a file line by line (hence the name, "stream editor" -> "sed") and for each line, for each condition in the program that matches the text on the line, it applies the corresponding action. In this case:

/^<.\+>/

is a regular expression condition that matches any line which contains < followed by any character (.) repeated one or more times (\+) followed by > - basically any line with an XML tag. The associated action is H which appends the line to a "hold buffer". The other condition is

/\(Request\|Response\) XML/

which, of course, is a regexp that matches either Request or Response followed by a space and then XML. The corresponding action is

{s/^.*</</;x;p}

which first does a substitution (s) of the beginning of the line (^) followed by any character (.) repeated any number of times (*) followed by <, with just <. Basically that gets rid of anything before the first XML tag on the line. Then it switches (x) the line just read with the "hold buffer" (which contains the XML of the previous log message) and prints (p) the stuff that was just swapped in from the hold buffer. Finally,

matches the end of the input, and {x;p} again just swaps the contents of the hold buffer into the "print buffer" and then prints it.

You can alter the command to suit your needs, for example if you need something to delimit the different records, this'll put a blank line between them:

sed -n '/^<.\+>/H; /\(Request\|Response\) XML/{s/^.*</\n</;x;p}; ${x;p}' xmllog

(in that case, of course, don't use egrep to filter out the blank line at the beginning).

Your question implies you're not thinking right; if there's a way to do what you're asking in one language (there is) ... then you can do it in any language.

There's no reason to read the entire log into memory. You just read it line by line and extract the information you want. You just need to keep a state as to where you are (not in tag, inside RootTag, inside ResponseRoot, etc) and process the data as you wish.