开发者

Regular expression to add base domain to directory

开发者 https://www.devze.com 2023-01-08 23:21 出处:网络
10 websites need to be cached. When caching: photos, css, js, etc are not displayed properly because the base domain isn\'t attached to the directory. I need a regex to add the base domain to the dire

10 websites need to be cached. When caching: photos, css, js, etc are not displayed properly because the base domain isn't attached to the directory. I need a regex to add the base domain to the directory. examples below

base domain: http://www.exampl开发者_Go百科e.com

the problem occurs when reading cached pages with img src="thumb/123.jpg" or src="/inc/123.js".

they would display correctly if it was img src="http://www.example.com/thumb/123.jpg" or src="http://www.example.com/inc/123.js".

regex something like: if (src=") isn't followed by the base domain then add the base domain


without knowing the language, you can use the (maybe most portable) substitute modifier:

s/^(src=")([^"]+")$/$1www\.example\.com\/$2/

This should do the following: 1. the string 'src="' (and capture it in variable $1) 2. one or more non-double-quote (") character followed by " (and capture it in variable $2) 3. Substitutes 'www.example.com/' in between the two capture groups.

Depending on the language, you can wrap this in a conditional that checks for the existence of the domain and substitutes if it isn't found.

to check for domain: /www\.example\.com/i should do.

EDIT: See comments:

For PHP, I would do this a bit differently. I would probably use simplexml. I don't think that will translate well, though, so here's a regex one...

$html = file_get_contents('/path/to/file.html');
$regex_match = '/(src="|href=")[^(?:www.example.com\/)]([^"]+")/gi';
$regex_substitute = '$1www.example.com/$2';
preg_replace($regex_match, $regex_substitute, $html);

Note: I haven't actually run this to debug it, it's just off the cuff. I would be concerned about 3 things. first, I am unsure how preg_replace will handle the / character. I don't think you're concerned with this, though, unless VB has a similar problem. Second, If there's a chance that line breaks would get in the way, I might change the regex. Third, I added the [^(?:www\.example\.com)] bit. This should change the match to any src or href that doesn't have www.example.com/ there, but this depends on the type of regex being used (POSIX/PCRE).

The rest of the changes should be fine (I added href=" and also made it case-insensitive (\i) and there's a requirement to make it global (\g) otherwise, it will just match once).

I hope that helps.


Matching regular expression:

(?:src|href)="(http://www\.example\.com/)?.+
0

精彩评论

暂无评论...
验证码 换一张
取 消