Is there a way to search for fields that contain similar values in a sql db? For example I have a table of over a million records where one column contains url values and is associated with a country column. Previously I tried to match urls that are equal where it contained a null value for the country as was able to update it using the following:
UPDATE t1
SET t1.country = t2.country
FROM Sources AS t1
JOIN sources AS t2
ON t1.url = t2.url;
Then I altered the query to use the like word as follows:
UPDATE t1
SET t1.country = t2.country
FROM Sources AS t1
JOIN sources AS t2
ON t1.url = t2.url
WHERE t1.url like t2.url;
when I just use the select statement to find the records where urls are like then I get the results but the update does not work. A better example is as follows:
- http://www.pantagraph.com
- http://pantagraph.com
- http://www.pantagraph.com/news
These are all the same domain url and I just want to update the country column f开发者_JS百科or each one to avoid doing it manually because there are around 200000 to do.
How about:
UPDATE t1
SET t1.country = t2.country
FROM Sources AS t1
JOIN sources AS t2
ON t1.url LIKE t2.url
See what kind of joins you get when you run that on your dataset...it may make too many bad matches.
At some point you'll probably need to do some matching based on exact portions of the url, but i don't know how to do that in a query like this. See this links for info:
http://www.w3schools.com/SQL/sql_wildcards.asp
Oh and if all the URLs contain the http://www. portion you could always do something like
WHERE left(t1.url,16) = left(t1.url,16)
That might cut down on your execution time and enfore better joins
First, I don't see how your two queries could return different results. In both cases, you are requiring that the Url match exactly in the ON clause.
Second, what do you suppose will happen if there are two rows in the Sources
table with identical Urls but different countries? This is one of the reasons to be careful with using the FROM clause in an UPDATE statement (and one of the reasons it is not officially supported by the ANSI standard). In the case of two rows with identical Urls but different countries, you need to decide which one will win. Let's suppose that the lowest sorted one will win:
Update Sources
Set Country = (
Select Min(T2.Country)
From Sources As T2
Where T2.Url = Sources.Url
)
If you want "similar" urls, we first need to know how you are defining "similar". The use of LIKE with no wildcard is effectively the same as using =
. If you use a leading wildcard (e.g. LIKE ('%' + Url)
), you end up doing a table scan on your million row table. However, if we use only a trailing wildcard (e.g. LIKE (Url + '%')
), then SQL can still use an index:
Update Sources
Set Country = (
Select Min(T2.Country)
From Sources As T2
Where T2.Url Like ( Sources.Url + '%' )
)
EDIT
In a later comment, it was hinted that you only wanted to update rows with NULL Country values. If that is true, it requires a trivial addition of a WHERE clause:
Update Sources
Set Country = (
Select Min(T2.Country)
From Sources As T2
Where T2.Url = Sources.Url
)
Where Country Is Null
精彩评论