I have a list of 100,000 urls in list(Of string) which can contain urls in the form.
yahoo.com
http://yahoo.com
http://www.yahoo.com
开发者_如何学C
i have tried using a combination of regex and the Uri class, but that didn't help, so i dumped the code. i also tried using this code, but it will only remove duplicatse of exact form, since its not domain specific.
list = new ArrayList<T>(new HashSet<T>(list))
How filter these duplicates and keep just one of these url if it contains the same name e.g yahoo.
thanks
[EDIT]
Please note that
all URL are of different domains, but can usually have duplicates like the example i gave above
also, am using .net 2.0, so i can't use linq
This worked for me
[TestMethod]
public void TestMethod1()
{
var sites = new List<string> {"yahoo.com", "http://yahoo.com", "http://www.yahoo.com"};
var result = sites.Select(
s =>
s.StartsWith("http://www.")
? s
: s.StartsWith("http://")
? "http://www." + s.Substring(7)
: "http://www." + s).Distinct();
Assert.AreEqual(1, result.Count());
}
I think the Uri Class would be able to help in this case. I am not at a VS machine where I can test; however, pass the Uri constructor the string of the Url, and try the Host property for comparison:
List<string> distinctHosts = new List<string>();
foreach (string url in UrlList)
{
Uri uri = new Uri(url)
if (! disctinctHosts.Contains(uri.Host))
{
distinctHosts.Add(uri.Host);
}
}
This feels a bit primitive, and could probably be more elegant - possibly without a foreach
; but like I said, I'm not at a development machine where I could work with it.
I think this would be able to handle any variation of a valid Url. Building an ArrayList is not a good idea; in my opinion, Regex would require that you maintain some sort of custom 'MatchList' that could get unwieldy.
As @Damokles points out, you should have some form of validation. The Uri class does require a protocol: 'http://' or 'ftp://'. You do not want to assume 'badurl.com' is actually invalid; however:
if (!url.StartsWith("http://")) { /* add protocol */ } // then check Host domain as above
...should be sufficient simply to retrieve a distinct host or domain name. I recommend any option that does not require guessing the index position of any part of the Url as that is tightly bound to specific formats.
You can do this with the Uri class and Linq/extension methods. The trick is to normalize the Url before using it with the Uri class. Also note that the Uri class requires the scheme, so that will have to be added for ones where it's not present. You can use a different property of the Uri class to achieve different results. The example below returns all unique Urls and treats yahoo.com differently than www.yahoo.com.
string[] urls = new[] {
"yahoo.com",
"http://yahoo.com",
"http://www.yahoo.com" };
var unique = urls.
Select(url => new System.Uri(
url.StartsWith("http") ? url : "http://" + url).Host).
Distinct();
(Edited to clean up formatting and to make the scheme addition part support both "http://" and "https://")
Try a Regex then .*?(\w+\.\w+)$
assuming you don't have anything after the tld.
精彩评论