开发者

.NET Regex parse markup for repeated values in certain section but not others

开发者 https://www.devze.com 2023-03-31 01:18 出处:网络
I need to use .NET regular expressions to scrap some values between <value> tags of a markup file such as this (copy\\pasted excerpt):

I need to use .NET regular expressions to scrap some values between <value> tags of a markup file such as this (copy\pasted excerpt):

<Title>Section1</Title>

<attributeArray><name>Name1</name><value>Value1</value></attributeArray>

<attributeArray><name>Name2</name><value>开发者_JS百科;Value2</value></attributeArray>

<attributeArray><name>Name3</name><value>Value3</value></attributeArray>

<attributeArray><name>Name4</name><value>Value4</value></attributeArray>

<Title>Section2</Title>

<attributeArray><name>Name1</name><value>Value1</value></attributeArray>

<attributeArray><name>Name2</name><value>Value2</value></attributeArray>

<attributeArray><name>Name3</name><value>Value3</value></attributeArray>

<attributeArray><name>Name4</name><value>Value4</value></attributeArray>

</node>

The actual text goes on to include 6 sections. the problem I have is that all tag names for each section are identical and I only need to extract the values from say Section2 (so not including 1, 3,4,5,6).

I have struggled with this for a couple days and tried various conditional expressions which was new to me like this:

(?(<node>Section2)(.*?<value>(?<Value>.*?)<\/value>.*?))

If Section 2, then parse the value keys, but it only extracts the first value - it does not iterate through each <value> of the markup. and the markup usually has around 10 values that I need to extract (abbreviated in the example above).

This is not being done in code so I don't have the liberty of using an XML parser.

Any suggestions would be greatly appreciated - or if I can clarify further let me know.

an afterthought- if there is a way to include the text of the title with each value match then I could parse all 6 sections, but I could later filter the result based on the section I am after would also work.

example:

match1
group1 = Section2
group2 = Value1

match2
group1 = Section2
group2 = Value2

match3
group1 = Section2
group2 = Value3

match4
group1 = Section2
group2 = Value4

Thanks!


Here's one option:

(?:
   <Title>Section2</Title>    # Match the header
   |                          # or
   \G(?!\A)                   # Match where the previous match ended
)\s*
<attributeArray>
    <name>(?<name>[^<]*)</name>
    <value>(?<value>[^<]*)</value>
</attributeArray>

The first match includes the header, and the following matches must start where the previous one ended.
Working example: http://regexhero.net/tester/?id=321ce843-923d-4556-9b99-dbb72175929a


Note that the above will fail if you have other elements you didn't mention between the values or the title. You can get around that with a probably less efficient pattern, using the fact .Net regexes can have variable length lookbehinds:

(?<=                          # lookbehind - check that before the current position
   <Title>Section2</Title>    #  we can see the wanted title,
   (?:(?!<Title>).)*          #  followed by no more title between it and here.
)
<attributeArray>
    <name>(?<name>[^<]*)</name>
    <value>(?<value>[^<]*)</value>
</attributeArray>

Example: http://regexhero.net/tester/?id=743c4de6-1b8a-48a4-a69b-63f3624de594

If you want to, you can change the title to <Title>(?<title>[^<]*)</Title>, capture all values in the file, and filter by the wanted title - it will be added to each match.


Lastly, here's a similar approach which will work in other flavors: it captures key/value pairs before the title Section3, assuming it is well ordered:

<attributeArray>
    <name>(?<name>[^<]*)</name>
    <value>(?<value>[^<]*)</value>
</attributeArray>
(?=
   (?:(?!<Title>).)*
   <Title>Section3</Title>
)

Example: http://regexhero.net/tester/?id=8d8ae0e8-5f10-439f-a5a5-50d0b4e73bd2


I recommend using a CaptureCollection:

string s = @"<Title>Section1</Title>
<attributeArray><name>Name1</name><value>Value1-1</value></attributeArray>
<attributeArray><name>Name2</name><value>Value1-2</value></attributeArray>
<attributeArray><name>Name3</name><value>Value1-3</value></attributeArray>
<attributeArray><name>Name4</name><value>Value1-4</value></attributeArray>

<Title>Section2</Title>
<attributeArray><name>Name1</name><value>Value2-1</value></attributeArray>
<attributeArray><name>Name2</name><value>Value2-2</value></attributeArray>
<attributeArray><name>Name3</name><value>Value2-3</value></attributeArray>
<attributeArray><name>Name4</name><value>Value2-4</value></attributeArray>

<Title>Section3</Title>
<attributeArray><name>Name1</name><value>Value3-1</value></attributeArray>
<attributeArray><name>Name2</name><value>Value3-2</value></attributeArray>
<attributeArray><name>Name3</name><value>Value3-3</value></attributeArray>
<attributeArray><name>Name4</name><value>Value3-4</value></attributeArray>";

Regex r = new Regex(
  @"<Title>(Section2)</Title>(?:\s*<attributeArray>.*?<value>(.*?)</value></attributeArray>)+");
Match m = r.Match(s);
if (m.Success)
{
  string section = m.Groups[1].Value;
  int i = 0;
  foreach (Capture c in m.Groups[2].Captures)
  {
    Console.WriteLine("match{0}\ngroup1 = {1}\ngroup2 = {2}\n",
                      ++i, section, c.Value);
  }
}

m.Groups[2].Value would return Value2-4, the last thing to be captured in group #2. But all the intermediate captures are retained, and can be accessed through the Captures property.

0

精彩评论

暂无评论...
验证码 换一张
取 消