开发者

Regular expression to extract bracketed and end string

开发者 https://www.devze.com 2023-02-14 01:56 出处:网络
I\'m working on a web app that uses scraping to harvest it\'s data. I have run into a roadblock in that I\'m unsure on how to write a regular expression to extract the data I need.

I'm working on a web app that uses scraping to harvest it's data. I have run into a roadblock in that I'm unsure on how to write a regular expression to extract the data I need.

I need to extract the distance and grade from a string like the following.

"The Bet with the Tote 525 (A6) 525y"

The grade is the "A6" and the distance is the "525y".

Every now and again, the string has another set of brackets in it that need to be ruled out. For example in this string:

"The Bet with the Tote (Starter race) Some more info (A6) 525y"

I will need the second set of brackets. The grade and distance are always appended to the end of the description so will always be at the end of the string.

I have tried simply using substr() to get the number of characters from the end of the string but every now and again, the distance is set to something like "525yH" which completely throws it out. For that reason, I would guess that a regular expression would be the best option.

Any help greatly appreciated.

Dan

Extended Information

  • The grade is always a minimum of 2 characters. Maximum of 3.
  • The grade does not always consist of a letter and a number.
  • Examples of grades:
    • "A1" through to "A10"
    • "T1" through to "T10"
    • "OR"
    • A number of other letter/number combinations
  • Distance can be in either metres or yards.
  • Distance is always a 3 charac开发者_StackOverflow社区ter integer with either "y" or "m" except:
  • Sometimes the distance has a H on the end which should be ommited.


If data pattern is fixed, why not use EXPLODE ?

<?php

$str = "The Bet with the Tote 525 (A6) 525y";
$strArr = explode(" ",$str);
$arrCount = count($strArr);
$data1 = $strArr[$arrCount - 1];
$data2 = $strArr[$arrCount - 2];
echo $data1," , ",$data2;

?>


Since

The grade and distance are always appended to the end of the description so will always be at the end of the string.

Something like the following, without regex, might work. That is, assuming your above statement is correct.

$text = "The Bet with the Tote (Starter race) Some more info (A6) 525y";
array_slice(explode(" ", $text), -2, 2);

//returns
Array
(
    [0] => (A6)
    [1] => 525y
)


You could try:

([^)]+) (\d+y.?)$

which is a little more specific


$str = 'The Bet with the Tote 525 (A6) 525y';

preg_match_all('/.*\((?P<grade>.+?)\)\s(?P<distance>.+?)$/', $str, $matches);

var_dump($matches);

Output

array(5) {
  [0]=>
  array(1) {
    [0]=>
    string(9) "(A6) 525y"
  }
  ["grade"]=>
  array(1) {
    [0]=>
    string(2) "A6"
  }
  [1]=>
  array(1) {
    [0]=>
    string(2) "A6"
  }
  ["distance"]=>
  array(1) {
    [0]=>
    string(4) "525y"
  }
  [2]=>
  array(1) {
    [0]=>
    string(4) "525y"
  }
}

So you can access the grade and distance by accessing $matches['grade'] and $matches['distance'].

Update

Your second string...

The Bet with the Tote (Starter race) Some more info (A6) 525y

Output

array(5) {
  [0]=>
  array(1) {
    [0]=>
    string(61) "The Bet with the Tote (Starter race) Some more info (A6) 525y"
  }
  ["grade"]=>
  array(1) {
    [0]=>
    string(2) "A6"
  }
  [1]=>
  array(1) {
    [0]=>
    string(2) "A6"
  }
  ["distance"]=>
  array(1) {
    [0]=>
    string(4) "525y"
  }
  [2]=>
  array(1) {
    [0]=>
    string(4) "525y"
  }
}


Thanks to the update question it's a simple as:

preg_match('/(\(\w+\)) (\w+)H?/', $str, $matches);

Usage:

$str = "The Bet with the Tote 525 (A6) 525y";

print_r($matches);

outputs:

Array
(
    [0] => (A6) 525y
    [1] => (A6)
    [2] => 525y
)

or:

$str = "The Bet with the Tote (Starter race) Some more info (A6) 525y";

print_r($matches);

outputs:

Array
(
    [0] => (A6) 525y
    [1] => (A6)
    [2] => 525y
)

Although I personally prefer the elegance if the explode method, it then would require and extra condition and possible operation to remove the trailing H.


Try with:

/.*?\((.*?)\)\W+(.*)$/
0

精彩评论

暂无评论...
验证码 换一张
取 消