I have HTML that has the weight of a item.
<div><b>Item Weight (0.51 lbs in Warehouse 3)</b></div>
I need a regex to get the weight and unit of measure.
So in the above HTML, I need 0.51
and lbs
I am using Java, I have a helper method, just need to get the regex down now!
String regexPattern = "";
String result = "";
Pattern p = Pattern.compile(regexPattern);
Matc开发者_开发技巧her m = p.matcher(text);
if(m.find())
result = m.group(1).trim();
This should do the trick
(\d*\.?\d+)\s?(\w+)
The first match will be the weight and the 2nd will be the unit of measure
if you know the units beforehand, specifying a list of units may give better results:
([\d.]+)\s+(lbs?|oz|g|kg)
I think the pattern you want is:
(\d*\.?\d+)\s*(lbs?|kg)
This will get the numbers right, and you should anchor it with actual measurements, as Jimmy pointed out, to restrict your matches to measures of weight (or whatever other measures you care about).
This is what I came up with:
\((?<Weight>\d*\.?\d+)\s(?<Unit>\w+)
This will return the weight in group "Weight" and the unit of measure in group "Unit". And this will work with or without a decimal.
There are a couple assumptions I made:
- The weight must be listed immediately after the first parenthesis.
- There must be a space between the weight and the unit of measure.
If those assumptions aren't always accurate then the regular expression will need some more tweaking.
What about:
((?:\d+\.)?\d+ \w{3})
Will "Weight" always be in the string? If so, a better regex would be:
Weight.*?(\d+(?:\.\d+)?)\s+(\w+)
I assume this is valid in Java regex, as it works in Perl. The above assumes weights < 1 will be 0.X formatted. If they can begin with decimals, use this:
Weight.?(\d.?\d+)?)\s+(\w+)
Why use regex? Since you always rely on some sort of format, you can also assume that the last brackets are the weight and location and that the weight and unit of measure is always formatted like that, e.g. with spaces.
@Test
public void testParseWeight() throws Exception {
String input = "<div><b>Item Weight (0.51 lbs in Warehouse 3)</b></div>";
int startPos = input.lastIndexOf('(');
int space = input.indexOf(' ', startPos);
String weight = input.substring(startPos + 1, space);
String uom = input.substring(space + 1, input.indexOf(' ', space + 1));
Number parse = NumberFormat.getNumberInstance(Locale.US).parse(weight);
assertEquals(0.51d, parse.doubleValue(), 0.0d);
assertEquals("lbs", uom);
}
You shouldn't use regexp for HTML...A better guess would be to use a parser (like NekoHTML), with xpath (through jaxen for example)
精彩评论