开发者

Regex to retrieve download link

开发者 https://www.devze.com 2023-01-31 02:10 出处:网络
I\'ve been trying to get my regex to match a wide variety of download links and have narrowed down the following.

I've been trying to get my regex to match a wide variety of download links and have narrowed down the following.

For 90% of download links they will start with either " or ' or http and end at " or ' or .exe. Three examples of this

Now the annoying part is I whipped up two regex's that cover this 90% however there has to be a way for it to only need one line of code. The only thing the user needs to change is the file extension they are looking for.

I tried $ anchoring but i'm not a regex expert so couldn't get it to work, tried to start the match at the first .exe occurance and then work its way back to match the very first " or ' or http that happens before the first .exe occurance. Yes, they do start with href= then " or ' however you can get href= and I don't know how to account for that PLUS some download links you don't want it to start from the href= and not all start with http

Example

href="/bouncer?t=http%3A%2F%2Fdownload.portableapps.com%2Fportableapps%2Ffoxitreaderportable%2FFoxitReaderPortable_4.2.paf.exe">

The two regex I have that cover the 90% of situations are

["']([^"']+(\.zip|\.rar|\.7z)) and (http[^"']+(\.zip|\.rar|\.7z))

EDIT: This is开发者_如何学Go used in a program called Ketarin, which parses the HTML for me and returns the page source with which I can use the regex on. I have found that Ketarin processes regex in this fashion, Singleline and IgnoreCase.

This flavor of regex treats the entire block of text as a single line, so the . character also matches \r\n.

This aside does anyone know how to start the regex match from the end of the string and work its way back to the first found " ' or http? The closest I got was

$?[^"']*.exe

But i'm not sure how to include http as an OR inclusive match in that


/href[\=][\"]((.*)([.]exe))[\"]/ try this using a group match (or the scan method if you are using ruby


EDIT: Sorry, i based this off something that did work hoping it would of work... anyways:

(?<=href=").+?\.(your|extensions|here)

Hope this one does help. Put your desired extensions separated by | [like (exe:|rar|zip....)]

Good Luck

0

精彩评论

暂无评论...
验证码 换一张
取 消