开发者

Parse numbers from large text, possibly without regex (performance critical)

开发者 https://www.devze.com 2023-04-04 16:50 出处:网络
I\'m extremely familiar with regex before you all start answering with variations of: /d+ I want to know if there are alternatives to regex for parsing numbers out of a large text file.

I'm extremely familiar with regex before you all start answering with variations of: /d+

I want to know if there are alternatives to regex for parsing numbers out of a large text file.

I'm parsing through tons of huge files and need to do some group/location analysis on the posit开发者_StackOverflow社区ions of keywords. I'm now at the point where i need to start finding groups of numbers as well nested closely to my content of interest. I want to avoid regex if at all possible because this needs to be a speedy process.

It is possible to take chunks of a file to inspect for the numbers of interest. That however would require more work and add hard coded limits for searching. (i'd like to avoid this)

I'm open to any suggestions.

UPDATE

Sorry for the lack of sample data. For HIPAA reasons I'd rather not even consider scrambling the text and posting it.

A great substitute would be the HTML source of any stackoverflow.com question page. Imagine I needed to grab the reputation (score) of all people that posted an answer to a question. This also means that the comma (,) is needed as well. I can't remove the html to simplify the content because I'm using some density analysis to weed out unrelated content. Removing the HTML would mix content too close together.


Unless the file is some sort of SGML, then I don't know of any method (which is not to say there isn't, I just don't know of one)

However, it's not to say that you can't create your own parser; you could eliminate some of the overheads of the .Net regex library by writing something that only finds ranges of numbers.

Fundamentally, I guess that that's all any library would do, at the most basic level.

Might help if you can post a sample of the sort of data you'll be processing?

0

精彩评论

暂无评论...
验证码 换一张
取 消