Regular Expression approach_问答_开发者_运维开发者技术经验分享

I'm trying to come up with a function that produces possible database matches for specified strings where most these specified strings can't easily be matched as they are in different naming forms for example acronyms for movies. Database values only use fullnames at this stage. What i have come up with so far is a function that produces a pattern which has the initial letter of each word separated by .* from database candidates:

pkgName matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*0.*M.*, title: 007
Moonraker   pkgName matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*1.*A.*M.*, title: 12
Angry Men  pkgName matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*3.*, title: 300 
pkgName matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*A.*P.*, title: A
Prophet  pkgName matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*A.*, title: Adaptation 
pkgName matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*A.*, title:
Adventureland  pkgName matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*A.*, title: Amelie 
pkgName开发者_运维问答 matched:
The.Fighter.2010.DVDRip.XviD.AC3-TiMPE,
for pattern: .*A.*P.*, title: American
Psycho

The problem is this method is producing too many unwanted suggested matches (all unwanted in my previous example). Can anyone suggest a better method that would trim down unwanted these matches? Are regular expressions even suitable for this?

public ArrayList<Movie> databaseMatches(String pkgName) {
    Connection conn = getConnection();
    ArrayList<Movie> dbMatches = new ArrayList<Movie>();
    try {
        for (Movie dbTitle : getDatabaseMovies(conn)) {
            Pattern p = Pattern.compile(createTitlePattern(dbTitle.getTitle()));
            Matcher m = p.matcher(pkgName);
            if (m.find()) {
                System.out.println("pkgName matched: " + pkgName + ", for pattern: " + createTitlePattern(dbTitle.getTitle()) + ", title: " + dbTitle.getTitle());
                dbMatches.add(dbTitle);
            }
        }
    } catch (SQLException e) {
        e.printStackTrace();
    }
    return dbMatches;
}

private String createTitlePattern(String dbTitle) {

    // System.out.println("dbTitle: " + dbTitle + "split(' ')");

    String titleParts[] = dbTitle.split(" ");
    String searchPattern = ".*";
    for (int i = 0; i < titleParts.length; i++) {
        char c = titleParts[i].charAt(0);
        searchPattern += (c + ".*");
    }
    // System.out.println("pattern produced: " + searchPattern);
    return searchPattern;
}

Edit: I have come across instances of strings with various characters between each acronym's letter so i thought this pattern would be appropriate.

Because you have so little criteria around the format of your data you may need to use a slightly different approach, which may or may not be feasible depending on the size of your data / throughput need for the application. One suggestion is begin with a full text match and only if that fails to produce results move to a more generalized search, or other variations.

With the preceeding example you may start with a full keyword search:

.*American.*Psycho.*

and if that fails to produce results try a pure acronym search

.*AP.*

and if that fails a single keyword search

.*((American)|(Psycho)).*

and then onto a mixed keyword / abbreviation search

.*(A|(American)).*(P|(Psycho))

etc. Again this approach can be significantly hindered based on how quickly the searches run / how quickly you need them to run.

If this is not acceptable you might try using a single "loose" pattern as above with the modifications of trying to allow a full word match if possible, and also minimizing the grouping between keywords.

.*(A[merican]*)(.*?)(P[sycho]*)

Note that we a character class (square brackets) instead of a regular grouping (round brackets) to allow a partial match on the remaining title. i.e. the previous would match "Amer. Psy.". Then based on the matches you get you could do further examination of the grouping to eliminate false positives. For example if group 1 only matched "A" you might expect group 2 to be empty, or contain only non alpha-numerics, and if it didn't you would reject it as a false positive.

In order to match unpredictable abbreviations, you need to use a better technique than First Letters. This post on Stack Overflow has some ideas, including alternative algorithms for matching the distance between two words:

Regex - Matching Abbreviations of a Word

A regex in the form of .*x.*y.*z.* means "any string in which we can find x, y, z in that order, separated by any amount of any characher", and there is no indication that x, y, or z must be at the first letter of a single word.

Before the initials, you have to put a character class with all charachters you expect as word separator.

You can use the predefined \W character class to consider all non-word characters as word separator.

Word characters are A-Z, a-z, 0-9 and _ (underscore). All others are non-word charachters.

If this can go well for you, replace ".*" with ".*\W".