I am basically grepping with a regular expression on. In the output, I would like to see only the strings that match my reg exp.
In a bunch of XML files (mostly they are single-line files with huge amounts of data in a开发者_开发技巧 line), I would like to get all the words that start with MAIL_.
Also, I would like the grep command on the shell to give only the words that matched and not the entire line (which is the entire file in this case).
How do I do this?
I have tried
grep -Gril MAIL_* .
grep -Grio MAIL_* .
grep -Gro MAIL_* .
First of all, with GNU grep that is installed with Ubuntu, -G flag (use basic regexp) is the default, so you can omit it, but, even better, use extended regexp with -E.
-r flag means recursive search within files of a directory, this is what you need.
And, you are right to use -o flag to print matching part of a line. Also, to omit file names you will need a -h flag.
The only mistake you made is the regular expression itself. You missed character specification before *. Your command should look like this:
grep -Ehro 'MAIL_[^[:space:]]*' .
Sample output (not recursive):
$ echo "Some garbage MAIL_OPTION comes MAIL_VALUE here" | grep -Eho 'MAIL_[^[:space:]]*'
MAIL_OPTION
MAIL_VALUE
Try the following command
grep -Eo 'MAIL_[[:alnum:]_]*'
grep -o or --only-matching
outputs only the matching text instead of complete lines but the problem could be your regex that's not restrictive or greedy enough and actually matches the whole file.
From your comment to Thor's answer it seems you also want to distinguish if the MAIL_.*
text is a text node or an attribute, not just to isolate it whenever it appears in the XML document. Grep cannot parse XML, you need a proper XML parser for that.
A command line xml parser is xmlstarlet. It is packaged in Ubuntu.
Using it on this example file example file:
$ cat test.xml
<some_root>
<test a="MAIL_as_attribute">will be printed if you want matching attributes</test>
<bar>MAIL_as_text will be printed if you want matching text nodes</bar>
<MAIL_will_not_be_printed>abc</MAIL_will_not_be_printed>
</some_root>
For selecting text nodes you can use:
$ xmlstarlet sel -t -m '//*' -v 'text()' -n test.xml | grep -Eo 'MAIL_[^[:space:]]*'
MAIL_as_text
And for selecting attributes:
$ xmlstarlet sel -t -m '//*[@*]' -v '@*' -n test.xml | grep -Eo 'MAIL_[^[:space:]]*'
MAIL_as_attribute
Brief explanations:
//*
is an XPath expression that selects all elements in the document andtext()
outputs the value of their children text nodes, therefore everything except text nodes gets filtered out//*[@*]
is an XPath expression that selects all attributes in the document and then@*
outputs their value
精彩评论