Selectively parsing log files using Java_问答_开发者

I have to parse a big bunch of log files, which are in the following format.

SOME SQL STATEMENT/QUERY

DB20000I  The SQL command completed successfully.

SOME OTHER SQL STATEMENT/QUERY

DB21034E  The command was processed as an SQL statement because it was not a 
valid Comm开发者_开发技巧and Line Processor command.

EDIT 1: The first 3 lines (including a blank line) indicate an SQL statement executed successfully, while the next three show the statement and the exception it caused. darioo's reply below, suggesting the use of grep instead of Java, works beautifully for a single line SQL statement.

EDIT 2: However, the SQL statement/query might not be a single line, necessarily. Sometimes it is a big CREATE PROCEDURE...END PROCEDURE block. Can this problem be overcome using only Unix commands too?

Now I need to parse through the entire log file and pick all occurrences of the pair of (SQL statement + error) and write them in a separate file.

Please show me how to do this!

My answer will be non Java based since this is a classic example of a problem that can be solved in a much, much easier manner.

All you need is the tool grep. If you're on Windows, you can find it here.

Assuming your logs are in file log.txt, solution to your problem is a one liner:

grep -hE --before-context 1 "^DB2[0-9]+E" log.txt > filtered.txt

Explanation:

-h - don't print file name
-E - regular expression search
--before-context 1 - this will print one line before found error message (this will work if all your SQL queries are in one line)
^DB2[0-9]+E - search for lines that begin with "DB2", have some numbers and end with "E"

Above expression will print every line that you need in a new file called filtered.txt.

Update: after some fumbling around, I managed to get what's needed using only standard *nix utilities. Beware, it's not pretty. The final expression:

grep -nE "^DB2[0-9]+" log.txt | cut -f 1 -d " " | gawk "/E$/{y=$0;print x, y};{x=$0}" | sed -e "s/:DB2[[:digit:]]\+[IE]//g" | gawk "{print \"sed -n \\\"\" $1+1 \",\" $2 \"p\\\" log.txt \"}" | sed -e "s/$/ >> filtered.txt/g" > run.bat

Explanation:

grep -nE "^DB2[0-9]+" log.txt - prints lines that begin with DB2... and their line number at beginning. Example:

6:DB20000I  The SQL command completed successfully.
12:DB21034E  The command was processed as an SQL statement because it was not a valid Command Line Processor command.
19:DB21034E  The command was processed as an SQL statement because it was not a valid Command Line Processor command.
26:DB21034E  The command was processed as an SQL statement because it was not a valid Command Line Processor command.
34:DB20000I  The SQL command completed successfully.
41:DB20000I  The SQL command completed successfully.
47:DB21034E  The command was processed as an SQL statement because it was not a valid Command Line Processor command.
54:DB20000I  The SQL command completed successfully.

cut -f 1 -d " " - prints only the "first column", that is, removes everything after error message. Example:

6:DB20000I
12:DB21034E
19:DB21034E
26:DB21034E
34:DB20000I
41:DB20000I
47:DB21034E
54:DB20000I

gawk "/E$/{y=$0;print x, y};{x=$0}" - for every line that ends with "E" (an error line), print the line before it and then the error line. Example:

6:DB20000I 12:DB21034E
12:DB21034E 19:DB21034E
19:DB21034E 26:DB21034E
41:DB20000I 47:DB21034E

sed -e "s/:DB2[[:digit:]]\+[IE]//g" - removes colon and the error message, leaving only line numbers. Example:

gawk "{print \"sed -n \\\"\" $1+1 \",\" $2 \"p\\\" log.txt \"}" - formats above lines for sed processing and increments first line number by one. Example:

sed -n "7,12p" log.txt 
sed -n "13,19p" log.txt 
sed -n "20,26p" log.txt 
sed -n "42,47p" log.txt

sed -e "s/$/ >> filtered.txt/g" - appends >> filtered.txt to lines, for appending to final output file. Example:

sed -n "7,12p" log.txt  >> filtered.txt
sed -n "13,19p" log.txt  >> filtered.txt
sed -n "20,26p" log.txt  >> filtered.txt
sed -n "42,47p" log.txt  >> filtered.txt

> run.bat - finally, prints the last lines to a batch file named run.bat

After you execute this file, content you wanted will appear in filtered.txt.

Update 2:

Here is another version that works on Ubuntu (previous version was written on Windows):

grep -nE "^DB2[0-9]+" log.txt | cut -f 1 -d " " | gawk '/E/{y=$0;print x, y};{x=$0}' | sed -e "s/:DB2[[:digit:]]\+[IE]//g" | gawk '{print "sed -n \""$1+1" ,"$2 "p\" log.txt" }' | sed -e "s/$/ >> filtered.txt/g" > run.sh

Two things were not working with previous version:

for some reason, gawk '/E$/' wasn't working (it didn't recognize that E is on end of line), so I just put /E/ since E won't be found anywhere else.
quoting, " were converted to ' for gawk since it doesn't like double quotes; afterwards, quoting inside the last gawk expression was modified

Assuming that you are looking for a block of non-blank lines, followed by a blank line, followed by a block of non-blank lines the first of which starts with DB, then try:

Pattern regex = Pattern.compile(
    "(?:.+\\n)+    # Match one or more non-blank lines\n" +
    "\\n           # Match one blank line\n" +
    "DB(?:.+\\n)+  # Match one or more non-blank lines, the first one starting with DB", 
    Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
    // matched text: regexMatcher.group()
    // match start: regexMatcher.start()
    // match end: regexMatcher.end()
}

This assumes a blank line between each match, and assumes Unix line endings. If it's a DOS/Windows file, then replace \\n with \\r\\n.

Personally, I would go about it slightly differently. Instead of finding all the errors, I would remove all the successes.

Something like this:

Read the log file (Use a read method, not readLine as the latter will drop newline chars) into a String
Use the following regex with replaceAll(regex, "") on the String to remove all successful entries: (?:.+\r\n)+\r\n+DB2.+I(?:.+\r\n)+
Write the resulting String out to a new file.

And in code (Just call processLog with the File object for the log):

private void openAndProcessLog(){
    JFileChooser chooser = new JFileChooser();
    chooser.showOpenDialog(this);
    if (chooser.getSelectedFile() != null) {
        processLog(chooser.getSelectedFile());
    }
}

private void processLog(File logfile){
    String originalLog = readFile(logfile);
    String onlyFailures = removeAllSuccessFull(originalLog);
    System.out.println(onlyFailures);
}

private String readFile(File file) {
    String ret = "";
    try {
        BufferedReader in = new BufferedReader(
                new FileReader(file));
        StringWriter out = new StringWriter();
        char[] buf = new char[10000];
        int n;
        while( (n = in.read(buf)) >= 0 ) {
            out.write(buf, 0, n);
        }
        ret = out.toString();
    } catch (IOException e) {
    }
    return ret;
}

private String removeAllSuccessFull(String text) {
    String sep = System.getProperty("line.separator");
    Pattern regex = Pattern.compile(
            "(?:.+"+sep+")+"+sep+"+DB2.+I(?:.+"+sep+")+");
    return regex.matcher(text).replaceAll("");
}

Give this a try:

#!/usr/bin/awk -f
$1 ~ /^DB.*I$/ {lines=""; nl=""; next} # discard successes
$1 ~ /^DB.*E$/ {print lines; print $0; print "-----"; lines=""; next} # print error blocks
$0 !~ /^$/ { lines = lines nl $0; nl="\n" } # accumulate lines in block

If you don't want to strip blank lines, remove the $0 !~ /^$/.

Run it like this:

./script.awk inputfile

If you are using linux shell or cygwin on windows I'd recommend you to use grep with flags -a (after) and -b (before):

grep -a 2 "The SQL command completed successfully" mylog.log

Will print 2 lines after the line that matches the given pattern.

if you wish to write your own I'd recommend you to do the following:

Iterate over the lines until you meet line that meets your pattern. Then continue reading N lines (e.g. 2 lines) and print them somewhere. Then continue reading.