So i'm looking to scrape rapidshare.com links from websites. I have the following regular expression开发者_开发知识库s to find links:
<a href=\"(http://rapidshare.com/files/(\\d+)/(.+)\\.(\\w{3,4}))\"
http://rapidshare.com/files/(\\d+)/(.+)\\.(\\w{3,4})
How can I write a regex that will exclude text that is embedded in a <a href="...">
tag. and only capture the text in >here</a>
I also have to bare in mind that not all links are embedded in href tags. Some are just displayed in plain text.
Basically is there a wway to exclude patterns in regex ?
Thanks.
To capture the inner text of an anchor tag, while ignoring all attribute text of the tag, you'd use the pattern:
<a href="http://rapidshare.com/files/(\d+)/(.+)\.(\w{3,4})[^>]*>(.*?)</a>
The [^>]* part matches everything else in your tag up until the end of the start tag. The (.*?) performs a non-greedy capture of the inner text.
If you want to capture anchor tag links and non-anchor tag links, then those are really two separate problems. There's probably a regex for it, but it would be terribly complicated. You're better off simply looking for non-anchor-tag links separately with the simple regex:
[^'"]http://rapidshare.com/files/(\d+)/(.+)\.(\w{3,4})
How about like this, last part will try to match any thing except ' " >
http://rapidshare.com/files/(\d+)/([^'"> ]+)
How about something like:
/http:\/\/rapidshare.com\/files\/\d+\/[^<&\s]+\.\w{3,4}/
I got rid of the capturing groups, because I think you only had them in there because you thought you might need them to make sure the different groupings worked and you can add them back in if you really want them parsed out.
You can expand upon the [^<&"\s]
as it only is excluding white spaces, the <
character which could be the start of the tag, the &
which would include things like
and other HTML entities or the "
which would be the end of the href=
. but you could exclude any non-valid URI character if you wanted. This should cover your inline text as well as those embedded as href.
精彩评论