开发者

Regular expression to parse robots.txt

开发者 https://www.devze.com 2023-04-06 16:18 出处:网络
I have the following robots.txt as an example - User-agent: googlebot User-agent: slurp User-agent: msnbot

I have the following robots.txt as an example -

User-agent: googlebot
User-agent: slurp
User-agent: msnbot
User-agent: teoma
User-agent: W3C-checklink
User-agent: WDG_SiteValidator
Disallow: /
Disallow: /js/
Disallow: /Web_References/
Disallow: /webresource.axd
Disallow: /scriptresource.axd

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /webresource.axd
Disallow: /scriptresource.axd
Disallow: /js/
Disallow: /Web_References/

I may be asking too much of regex but I'm wanting to write an expression which will return matches in the following grouped and ordered fashion -

Matches
 - [0]
   - [UserAgents]
      - "googlebot"
      - "slurp"
      - "msnbot"
      - "teoma"
      - "W3C-checklink"
      - "WDG_SiteValidator"
    - [Routes]
      - [0]
        - [Permission] "Allow"
        - [Url] "/"
      - [1]
        - [Permissi开发者_运维百科on] "Disallow"
        - [Url] "/js/"
      - [2]
        - [Permission] "Disallow"
        - [Url] "/Web_References/"

...

etc

...

I've written individual expressions to match elements of the document, however I can't get them to work when pieced together. Maybe someone can point out where I'm going wrong?

Patterns

User agents: (?:user-agent:\s*)(?<UserAgent>[a-z_0-9-*]*)

Permissions: (?<Permission>(?:allow|disallow))(?:\s*:\s*)(?<Url>[/0-9_a-z.]*)

My attempt

((?<UserAgents>(?:user-agent:\s*)(?<UserAgent>[a-z_0-9-*]*))+(?<Routes>(?<Permission>(?:allow|disallow))(?:\s*:\s*)(?<Url>[/0-9_a-z.]*))+)+

FYI, I'm using Expresso to debug these scripts and have the following checked - Multiline, Compiled and Ignore Case


Try this:

(?:^User-agent: (?<UserAgent>.*?)$)|(?<Permission>^(?:Allow)|(?:Disallow)): (?<Url>.*?)$

I'm not sure about that format you want, but the above regex matches and names the parts you are interested in. Maybe you can build on top of that regex. I hardly do C#, but maybe this might work:

try {
    Regex regexObj = new Regex("(?:^User-agent: (?<UserAgent>.*?)$)|(?<Permission>^(?:Allow)|(?:Disallow)): (?<Url>.*?)$", RegexOptions.IgnoreCase | RegexOptions.Multiline);
    Match matchResults = regexObj.Match(subjectString);
    while (matchResults.Success) {
        for (int i = 1; i < matchResults.Groups.Count; i++) {
            Group groupObj = matchResults.Groups[i];
            if (groupObj.Success) {
                // matched text: groupObj.Value
                // match start: groupObj.Index
                // match length: groupObj.Length
            } 
        }
        matchResults = matchResults.NextMatch();
    } 
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}
0

精彩评论

暂无评论...
验证码 换一张
取 消