I have an input file with a list of movies (Note that there might be some repeated entries):
American_beauty__1h56mn38s_
As_Good_As_It_Gets
As_Good_As_It_Gets
_DivX-ITA__Casablanca_M_CURTIZ_1942_Bogart-bergman_
Capote_EN_DVDRiP_XViD-GeT-AW
_DivX-ITA__Casablanca_M_CURTIZ_1942_Bogart-bergman_
I would to find the corresponding match (line number) from another reference file for each of the entries in the first file:
American beauty.(1h56mn38s)
As Good As It Gets
Capote.EN.DVDRiP.XViD-GeT-AW
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman)
Quills (2000)(7.4)
The desired output would be something like (Reference Movie + Line number from the Reference File):
开发者_StackOverflow社区American beauty.(1h56mn38s) 1
As Good As It Gets 2
As Good As It Gets 2
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman) 4
Capote.EN.DVDRiP.XViD-GeT-AW 3
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman) 4
Basically, the difference between the entries in both files is that some characters such as: blank spaces, parenthesis, points, etc. have been replaced by underscores.
Does anybody could shed some light on it?
Best wishes,
Javier
Awk will work:
gawk '
NR == FNR {
# read the reference file first, capture the line numbers and transform
# the "real" title to one with underscores
line[$0] = NR
u = $0
gsub(/[][ .()]/,"_",u)
movie[u] = $0
next
}
$0 in movie {
print movie[$0] " " line[movie[$0]]
}
' movies.reference movies.list
The regular expression could be simplified if hyphens were also turned into underscores (would be /\W/
then).
Maybe you could just strip all the non-desired characters (from both the file listing and textfile) using sed?
e.g
ls | sed -e 's/[^a-z0-9]/o/gi'
Or if you want more fuzziness, you could try to do some least editing distance on the processed filename (or a tokenized version).
Give this a try. It won't be particularly fast:
#!/bin/bash
chars='[]() .'
num=0
while read -r line
do
(( num++ ))
num=$( grep --line-number "$line" <( tr "$chars" '_' < movies.reference ) | awk -F: '{print $1}' )
echo "$( sed -n "$num{p;q}" movies.reference ) $num"
done < movies.input
精彩评论