Let say I have an array like this:
- Band of Horses - Is There a Ghost
- Band Of Horses - No One's Gonna Love You
- Band of Horses - The Funeral
- Band of Horses - The Funeral (lyrics in description)
- Band of Horses - Laredo
- Band Of Horses - Laredo on Letterman 5.20.10
- Band of Horses - "The Great Salt Lake" Sub Pop Records
- Band Of Horses - "No One's Gonna Love You"
- Band of Horses perform Marry Song at Tromso Wedding
- Band Of Horses - No One's Gonna Love You
- 'Laredo' by Band of Horses on Q TV
- Band of Horses, On My Way Back Home
- Band of Horses - cigarettes wedding bands
- Band Of Horses - "Cigarettes Wedding Bands"
- Band Of Horses - I Go To The Barn Because I Like The
- Our Swords - Band of Horses
- Band Of Horses - "Marry song"
- Band of Horses - Monsters
- Band Of Horses - No One's Gonna Love You
The New Array would have:
- Band of Horses - Is There a Ghost
- Band Of Horses - No One's Gonna Love You
- Band of Horses - The 开发者_运维问答Funeral
- Band of Horses - Laredo
- Band of Horses - "The Great Salt Lake" Sub Pop Records
- Band of Horses, On My Way Back Home
- Band of Horses - cigarettes wedding bands
- Band Of Horses - I Go To The Barn Because I Like The
- Our Swords - Band of Horses
- Band Of Horses - "Marry song"
- Band of Horses - Monsters
How would you go about comparing each string to every other string in the list in PHP and if they are similar, remove them.
I consider these similar:
- Band of Horses - The Funeral
- Band of Horses - The Funeral (lyrics in description)
Another example:
- Band of Horses - Laredo
- Band Of Horses - Laredo on Letterman 5.20.10
You have multiple options.
For each option you should probably massage the album names before performing the comparisons. You can do this by stripping punctuation, sorting the words in the album name alphabetically (in certain cases), etc.
In each case, when you do a comparison, if you remove one of the album names from the array, then your comparison is order sensitive, unless you make a rule as to which album name to remove. So, it probably makes sense to always remove the longer album name if two album names are compared and found to be "similar."
The main comparison options are
Simple substring comparisons. Check if an album name is inside another. Strip punctuation first and compare case insensitively (see my second code snippet below).
Check album name similarity using
levenshtein()
. This string comparison is more efficient thensimilar_text()
. You should strip punctuation and order words alphabetically.Check album name similarity using
similar_text()
. I had the most luck with this method. In fact I got it to pick the exact album names you wanted (see first code snippet below).There are various other string comparison functions you can play around with including
soundex()
andmetaphone()
Anyway... here are 2 solutions.
The first uses similar_text()
... but it calculates the similarity only after all punctuation has been stripped and words put into alphabetical order and lowercased...... the downside is you have to play around with the threshold similarities... The second uses a simple case insensitive substring test after all punctuation and white space is stripped.
The way both code snippets work is that they use array_walk()
to run the compare()
function on each album in the array. Then inside the compare()
function, I use foreach()
to compare the current album to all the other albums. There's ample room to make things more efficient.
Note that I should be using the 3rd argument as a reference in array_walk
can someone help me do this? The current work around is a global variable:
Live example (69% similarity threshold)
function compare($value, $key)
{
global $array; // Should use 3rd argument of compare instead
$value = strtolower(preg_replace("/[^a-zA-Z0-9 ]/", "", $value));
$value = explode(" ", $value);
sort($value);
$value = implode($value);
$value = preg_replace("/[\s]/", "", $value); // Remove any leftover \s
foreach($array as $key2 => $value2)
{
if ($key != $key2)
{
// collapse, and lower case the string
$value2 = strtolower(preg_replace("/[^a-zA-Z0-9 ]/", "", $value2));
$value2 = explode(" ", $value2);
sort($value2);
$value2 = implode($value2);
$value2 = preg_replace("/[\s]/", "", $value2);
// Set up the similarity
similar_text($value, $value2, $sim);
if ($sim > 69)
{ // Remove the longer album name
unset($array[ ((strlen($value) > strlen($value2))?$key:$key2) ]);
}
}
}
}
array_walk($array, 'compare');
$array = array_values($array);
print_r($array);
The output of the above is:
Array
(
[0] => Band of Horses - Is There a Ghost
[1] => Band Of Horses - No One's Gonna Love You
[2] => Band of Horses - The Funeral
[3] => Band of Horses - Laredo
[4] => Band of Horses - "The Great Salt Lake" Sub Pop Records
[5] => Band of Horses perform Marry Song at Tromso Wedding
[6] => Band of Horses, On My Way Back Home
[7] => Band of Horses - cigarettes wedding bands
[8] => Band Of Horses - I Go To The Barn Because I Like The
[9] => Our Swords - Band of Horses
[10] => Band of Horses - Monsters
)
Note that the short version of Mary's song is missing... so it must have been a false positive against something else, since the long version is still in the list..... but they are precisely the album names you wanted.
The substring method:
Live Example
function compare($value, $key)
{
// I should be using &$array as a 3rd variable.
// For some reason couldn't get that to work, so I do this instead.
global $array;
// Take the current album name and remove all punctuation and white space
$value = preg_replace("/[^a-zA-Z0-9]/", "", $value);
// Compare current album to all othes
foreach($array as $key2 => $value2)
{
if ($key != $key2)
{
// collapse the album being compared to
$value2 = preg_replace("/[^a-zA-Z0-9]/", "", $value2);
$subject = $value2;
$pattern = '/' . $value . '/i';
// If there's a much remove the album being compared to
if (preg_match($pattern, $subject))
{
unset($array[$key2]);
}
}
}
}
array_walk($array, 'compare');
$array = array_values($array);
echo "<pre>";
print_r($array);
echo "</pre>";
For your example string the above outputs (it shows 2 that you don't want shown):
Array
(
[0] => Band of Horses - Is There a Ghost
[1] => Band Of Horses - No One's Gonna Love You
[2] => Band of Horses - The Funeral
[3] => Band of Horses - Laredo
[4] => Band of Horses - "The Great Salt Lake" Sub Pop Records
[5] => Band of Horses perform Marry Song at Tromso Wedding // <== Oops
[6] => 'Laredo' by Band of Horses on Q TV // <== Oops
[7] => Band of Horses, On My Way Back Home
[8] => Band of Horses - cigarettes wedding bands
[9] => Band Of Horses - I Go To The Barn Because I Like The
[10] => Our Swords - Band of Horses
[11] => Band Of Horses - "Marry song"
[12] => Band of Horses - Monsters
)
You might want to try similar_text
, maybe in combination with levenshtein
, and through experimentation establish a threshold of what score you consider to be similar enough. Also look at the user discussions for some more hints. You can then loop through the array, compare each element to each other element and remove elements that you deem too similar.
I hope this serves as a start for you. The problem is rather complex, since there are many things that may be considered to have the same content, but have completely different syntax ("Our Swords - Band of Horses" vs. "Band of Horses - Our Swords"). It depends if this rather simplistic solution is good enough for what you're trying to do.
Here's my (somewhat complicated?) solution.
It splits up the input strings into an array of words (getWords
). Then, it compares them all with each other, grouping them by "equality" (titlesMatch
), which doesn't care for word ordering. It stores arrays for reach group of matches so you can review similar titles.
Here's the script (assuming $array
is the input):
function getWords($str) {
// Remove non-alpha characters and split by spaces
$normalized = preg_replace('/[^a-z0-9\s]/', '', strtolower($str));
$words = preg_split('/\s+/', $normalized, -1, PREG_SPLIT_NO_EMPTY);
return $words;
}
function titlesMatch($words1, $words2) {
$intersection = array_intersect($words1, $words2);
sort($words1);
sort($words2);
sort($intersection);
return $intersection === $words1 || $intersection === $words2;
}
$wordedArray = array_map('getWords', $array);
$uniqueItems = array();
foreach ($wordedArray as $words1Index => $words1) {
$isUnique = true;
foreach ($uniqueItems as &$words2Indices) {
foreach ($words2Indices as $words2Index) {
if (titlesMatch($words1, $wordedArray[$words2Index])) {
$words2Indices[] = $words1Index;
$isUnique = false;
break;
}
}
}
if ($isUnique) {
$uniqueItems[] = array($words1Index);
}
}
// Show the first matches as an example
foreach ($uniqueItems as $indices) {
echo $array[$indices[0]] . "\n";
}
Output with your input data:
Band of Horses - Is There a Ghost
Band Of Horses - No One's Gonna Love You
Band of Horses - The Funeral
Band of Horses - Laredo
Band of Horses - "The Great Salt Lake" Sub Pop Records
Band of Horses perform Marry Song at Tromso Wedding
Band of Horses, On My Way Back Home
Band of Horses - cigarettes wedding bands
Band Of Horses - I Go To The Barn Because I Like The
Our Swords - Band of Horses
Band of Horses - Monsters
(Note: this looks O(n3) but it's really O(n2).)
The best implementation will depend greatly on your data. The more you know about your data, the better results you can get with the least amount of work. Anyway, here's a sample script I put together:
<?php
$list = array(); # source data
$groups = array();
foreach ($list as $item)
{
$words = array_unique(explode(' ', trim(preg_replace('/[^a-z]+/', ' ', strtolower($item)))));
$matches = array();
foreach ($groups as $i => $group)
{
foreach ($group as $g)
{
if (count($words) < count($g['words']))
{
$a = $words;
$b = $g['words'];
}
else
{
$a = $g['words'];
$b = $words;
}
$c = 0;
foreach ($a as $word1)
{
foreach ($b as $word2)
{
if (levenshtein($word1, $word2) < 2)
{
++$c;
break;
}
}
}
if ($c / count($a) > 0.85)
{
$matches[] = $i;
continue 2;
}
}
}
$me = array('item' => $item, 'words' => $words);
if (!$matches)
$groups[] = array($me);
else
{
for ($i = 1; $i < count($matches); ++$i)
{
$groups[$matches[0]] = array_merge($groups[$matches[0]], $groups[$matches[$i]]);
unset($groups[$matches[$i]]);
}
$groups[$matches[0]][] = $me;
}
}
foreach ($groups as $group)
{
echo $group[0]['item']."\n";
for ($i = 1; $i < count($group); ++$i)
echo "\t".$group[$i]['item']."\n";
}
?>
The output with your list:
Band of Horses - Is There a Ghost
Band Of Horses - No One's Gonna Love You
Band Of Horses - "No One's Gonna Love You"
Band Of Horses - No One's Gonna Love You
Band Of Horses - No One's Gonna Love You
Band of Horses - The Funeral
Band of Horses - The Funeral (lyrics in description)
Band of Horses - Laredo
Band Of Horses - Laredo on Letterman 5.20.10
'Laredo' by Band of Horses on Q TV
Band of Horses - "The Great Salt Lake" Sub Pop Records
Band of Horses perform Marry Song at Tromso Wedding
Band Of Horses - "Marry song"
Band of Horses, On My Way Back Home
Band of Horses - cigarettes wedding bands
Band Of Horses - "Cigarettes Wedding Bands"
Band Of Horses - I Go To The Barn Because I Like The
Our Swords - Band of Horses
Band of Horses - Monsters
The basic principle here is to group like list items together. Any new item coming in is compared against the existing groups. The shorter item is checked against the larger ones. If enough of the words (85%) are close enough (2 characters different), then it is considered a match, and is added to the list.
This might be good enough for you, if you tweak the parameters. Other things to consider: ignoring small words altogether, similar phrases, etc.
精彩评论