Say I have a bunch of XML files which contain no newlines, but basically contain a long list of records, delimited by &开发者_开发知识库lt;/record><record>
If the delimiter were </record>\n<record>
I would be able to do something like cat *.xml | grep xyz | wc -l
to count instances of records of interest, because cat would emit the records one per line.
Is there a way to write SOMETHING *.xml | grep xyz | wc -l
where SOMETHING
can stream out the records one per line? I tried using awk
for this but couldn't find a way to avoid streaming the whole file into memory.
Hopefully the question is clear enough :)
This is a little ugly, but it works:
sed 's|</record>|</record>\
|g' *.xml | grep xyz | wc -l
(Yes, I know I could make it a little bit shorter, but only at the cost of clarity.)
If your record body has no character like <
or /
or >
, then you may try this:
grep -E -o 'SEARCH_STRING[^<]*</record>' *.xml| wc -l
or
grep -E -o 'SEARCH_STRING[^/]*/record>' *.xml| wc -l
or
grep -E -o 'SEARCH_STRING[^>]*>' *.xml| wc -l
Here is a different approach using xsltproc, grep, and wc. Warning: I am new to XSL so I can be dangerous :-). Here is my count_records.xsl file:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" /> <!-- Output text, not XML -->
<xsl:template match="record"> <!-- Search for "record" node -->
<xsl:value-of select="text()"/> <!-- Output: contents of node record -->
<xsl:text> <!-- Output: a new line -->
</xsl:text>
</xsl:template>
</xsl:stylesheet>
On my Mac, I found a command line tool called xsltproc, which read instructions from an XSL file, process XML files. So the command would be:
xsltproc count_records.xsl *.xml | grep SEARCH_STRING | wc -l
- The xsltproc command displays the text in each node, one line at a time
- The grep command filters out the text you are interested in
- Finally, the wc command produces the count
You may also try xmlstarlet
for gig-sized files:
# cf. http://niftybits.wordpress.com/2008/03/27/working-with-huge-xml-files-tools-of-the-trade/
xmlstarlet sel -T -t -v "count(//record[contains(normalize-space(text()),'xyz')])" -n *.xml |
awk '{n+=$1} END {print n}'
xmlstarlet sel -T -t -v "count(//record[contains(normalize-space(text()),'xyz')])" -n *.xml |
paste -s -d '+' /dev/stdin | bc
精彩评论