开发者

updating similar fields in sql server

开发者 https://www.devze.com 2023-01-10 05:00 出处:网络
Is there a way to search for fields that contain similar values in a sql db? For example I have atable of over a million records where one column contains url values and is associated with a country c

Is there a way to search for fields that contain similar values in a sql db? For example I have a table of over a million records where one column contains url values and is associated with a country column. Previously I tried to match urls that are equal where it contained a null value for the country as was able to update it using the following:

UPDATE t1
SET t1.country = t2.country
FROM Sources AS t1
JOIN sources AS t2
ON t1.url = t2.url;

Then I altered the query to use the like word as follows:

UPDATE t1
SET t1.country = t2.country
FROM Sources AS t1
JOIN sources AS t2
ON t1.url = t2.url
WHERE t1.url like t2.url;

when I just use the select statement to find the records where urls are like then I get the results but the update does not work. A better example is as follows:

  • http://www.pantagraph.com
  • http://pantagraph.com
  • http://www.pantagraph.com/news

These are all the same domain url and I just want to update the country column f开发者_JS百科or each one to avoid doing it manually because there are around 200000 to do.


How about:

 UPDATE t1
 SET t1.country = t2.country
 FROM Sources AS t1
 JOIN sources AS t2
 ON t1.url LIKE t2.url

See what kind of joins you get when you run that on your dataset...it may make too many bad matches.

At some point you'll probably need to do some matching based on exact portions of the url, but i don't know how to do that in a query like this. See this links for info:

http://www.w3schools.com/SQL/sql_wildcards.asp

Oh and if all the URLs contain the http://www. portion you could always do something like

 WHERE left(t1.url,16) = left(t1.url,16)

That might cut down on your execution time and enfore better joins


First, I don't see how your two queries could return different results. In both cases, you are requiring that the Url match exactly in the ON clause.

Second, what do you suppose will happen if there are two rows in the Sources table with identical Urls but different countries? This is one of the reasons to be careful with using the FROM clause in an UPDATE statement (and one of the reasons it is not officially supported by the ANSI standard). In the case of two rows with identical Urls but different countries, you need to decide which one will win. Let's suppose that the lowest sorted one will win:

Update Sources
Set Country =   (
                    Select Min(T2.Country)
                    From Sources As T2
                    Where T2.Url = Sources.Url
                    )

If you want "similar" urls, we first need to know how you are defining "similar". The use of LIKE with no wildcard is effectively the same as using =. If you use a leading wildcard (e.g. LIKE ('%' + Url)), you end up doing a table scan on your million row table. However, if we use only a trailing wildcard (e.g. LIKE (Url + '%')), then SQL can still use an index:

Update Sources
Set Country =   (
                    Select Min(T2.Country)
                    From Sources As T2
                    Where T2.Url Like ( Sources.Url + '%' )
                    )

EDIT

In a later comment, it was hinted that you only wanted to update rows with NULL Country values. If that is true, it requires a trivial addition of a WHERE clause:

Update Sources
Set Country =   (
                    Select Min(T2.Country)
                    From Sources As T2
                    Where T2.Url = Sources.Url
                    )
Where Country Is Null
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号