开发者

Remove duplicate domains from list with regular expressions

开发者 https://www.devze.com 2022-12-20 05:24 出处:网络
I\'d like to use PCRE to take a list of URI\'s and distill it. Start: http://abcd.tld/products/widget1 http://abcd.tld/products/widget2

I'd like to use PCRE to take a list of URI's and distill it.

Start:

http://abcd.tld/products/widget1       
http://abcd.tld/products/widget2    
开发者_开发百科http://abcd.tld/products/review    
http://1234.tld/

Finish:

http://abcd.tld/products/widget1
http://1234.tld/

Any ideas, dear members of StackOverflow?


You can you simple tools like uniq.

See kobi's example in the comments:

grep -o "^[^/]*//[^/]*/" urls.txt | sort | uniq


While it's INSANELY inefficient, it can be done...

(?<!^http://\2/.*?$.*)^(http://(.*?)/.*?$)

Please don't use this


Parse out the domain using a URI library, then insert it into a hash. You'll write over any URL that exists in that hash already so you'll end up with unique links.

Here's a Ruby example:

require 'uri'

unique_links = {}

links.each do |l|
  u = URI.parse(l)
  unique_links[u.host] = l
end

unique_links.values # returns an Array of the unique links


If you can work with the whole file as a single string, rather than line-by-line, then why shouldn't something like this work. (I'm not sure about the char ranges.)

s!(\w+://[a-zA-Z0-9.]+/\S+/)([^ /]+)\n(\1[^ /]+\n)+!\1\2!


if you have (g)awk on your system

awk -F"/" '{
 s=$1
 for(i=2;i<NF;i++){ s=s"/"$i }
 if( !(s in a) ){ a[s]=$NF }
}
END{
    for(i in a) print i"/"a[i]
} ' file

output

$ ./shell.sh
http://abcd.tld/products/widget1
http://1234.tld/
0

精彩评论

暂无评论...
验证码 换一张
取 消