I've got a directory of html files courtesy of wget and I need to extract title tag and all metadata from each file -- but separately, so I can copy/paste into a spreadsheet (ok, if I were better at scripting this wouldn't be a requirement). I've got a script with two problems -- it produces lots of extra white space on the extraction and when I tried to write it to a file, the file was 600 GBs (no kidding, good thing I routed it to my external). I'm open to any solution native to *NIX. TIA for any help.
#!/bin/bash
for LINE in `cat htmllist.txt`
do
awk 'BEGIN{IGNORE开发者_JAVA技巧CASE=1;FS="<title>|</title>";RS=EOF} {print $2}' $LINE
done
First off, you should get rid of all the lines with just white space. You can do this by using awk like so:
cat <file> | awk '{ if (NF > 0) printf("%s\n", $0); }'
In your case, you could just pipe the last awk command into this one. You could also get rid of multiple whitespaces in a row using awk. Since they are the default separators you could do this:
cat <file> | awk '{
for (i = 1; i <= NF; i++) {
printf("%s ", i);
}
printf("\n");
}'
精彩评论