10 websites need to be cached. When caching: photos, css, js, etc are not displayed properly because the base domain isn't attached to the directory. I need a regex to add the base domain to the directory. examples below
base domain: http://www.exampl开发者_Go百科e.com
the problem occurs when reading cached pages with img src="thumb/123.jpg" or src="/inc/123.js".
they would display correctly if it was img src="http://www.example.com/thumb/123.jpg" or src="http://www.example.com/inc/123.js".
regex something like: if (src=") isn't followed by the base domain then add the base domain
without knowing the language, you can use the (maybe most portable) substitute modifier:
s/^(src=")([^"]+")$/$1www\.example\.com\/$2/
This should do the following: 1. the string 'src="' (and capture it in variable $1) 2. one or more non-double-quote (") character followed by " (and capture it in variable $2) 3. Substitutes 'www.example.com/' in between the two capture groups.
Depending on the language, you can wrap this in a conditional that checks for the existence of the domain and substitutes if it isn't found.
to check for domain: /www\.example\.com/i
should do.
EDIT: See comments:
For PHP, I would do this a bit differently. I would probably use simplexml. I don't think that will translate well, though, so here's a regex one...
$html = file_get_contents('/path/to/file.html');
$regex_match = '/(src="|href=")[^(?:www.example.com\/)]([^"]+")/gi';
$regex_substitute = '$1www.example.com/$2';
preg_replace($regex_match, $regex_substitute, $html);
Note: I haven't actually run this to debug it, it's just off the cuff. I would be concerned about 3 things. first, I am unsure how preg_replace will handle the / character. I don't think you're concerned with this, though, unless VB has a similar problem. Second, If there's a chance that line breaks would get in the way, I might change the regex. Third, I added the [^(?:www\.example\.com)]
bit. This should change the match to any src or href that doesn't have www.example.com/ there, but this depends on the type of regex being used (POSIX/PCRE).
The rest of the changes should be fine (I added href=" and also made it case-insensitive (\i) and there's a requirement to make it global (\g) otherwise, it will just match once).
I hope that helps.
Matching regular expression:
(?:src|href)="(http://www\.example\.com/)?.+
精彩评论