开发者

Java Regular Expressions

开发者 https://www.devze.com 2022-12-15 20:18 出处:网络
Hallo, I have the following syntax: @AAAA{tralala10aa, author = {Some Author}, title = {Some Title}, booktitle = {Some Booktitle},

Hallo,

I have the following syntax:

@AAAA{tralala10aa,
  author = {Some Author},
  title = {Some Title},
  booktitle = {Some Booktitle},
  year = {2010},
  month = {March},
  booktitle_short = {CC 2010},
  conference_url = {http://www.mmmm.com},
  projects = {projects}
}

....

I've made the following regular expression:

@[A-Z]*[{][a-z0-9]*[,]

but I need the whole text blo开发者_如何转开发ck. How can I do it ?


It seems like you would be much better off using a context-free grammar instead of a regular expression in this case. Consider using a parser generator, such as CUP or ANTLR.


If the "block" always ends with a lone closing brace, then this maywill do it:

"(?ms)@[A-Z]+\\{.+?^\\}$"

Where (?ms) sets the expression to "multiline" and "dotall" (so the .+ can also match newlines), and the stuff at the end matches a closing brace on a line by itself.

The question mark in the middle makes the .+ match non-greedy so it won't match all blocks up to and including the last block in the file.


If the nesting on braces is only allowed one-deep:

/@[A-Z]*{([^{}]*+|{[^{}]*+})*}/

Note the use of the possessive quantifier *+ - without it, this can take quite a long time on failed matches.

I'm not sure if Java supports it - if it doesn't, remove it, but keep in mind the poor failure-behaviour.


I would not use regex, I would tokenize the string and build up a dictionary. Sorry, this is a Python implementation (not Java):

>>> s ="""@AAAA{tralala10aa,
  author = {Some Author},
  title = {Some Title},
  booktitle = {Some Booktitle},
  year = {2010},
  month = {March},
  booktitle_short = {CC 2010},
  conference_url = {http://www.mmmm.com},
  projects = {projects}
}"""
>>> 
>>> s
'@AAAA{tralala10aa,\n  author = {Some Author},\n  title = {Some Title},\n  booktitle = {Some Booktitle},\n  year = {2010},\n  month = {March},\n  booktitle_short = {CC 2010},\n  conference_url = {http://www.mmmm.com},\n  projects = {projects}\n}'
>>> 
>>> 
>>> lst = s.replace('@AAA', '').replace('{', '').replace('}', '').split(',\n')
>>> lst
['Atralala10aa', '  author = Some Author', '  title = Some Title', '  booktitle = Some Booktitle', '  year = 2010', '  month = March', '  booktitle_short = CC 2010', '  conference_url = http://www.mmmm.com', '  projects = projects\n']
>>> dct = dict((x[0].strip(), x[1].strip()) for x in (y.split('=') for y in lst[1:]))
>>> dct
{'booktitle_short': 'CC 2010', 'title': 'Some Title', 'booktitle': 'Some Booktitle', 'author': 'Some Author', 'month': 'March', 'conference_url': 'http://www.mmmm.com', 'year': '2010', 'projects': 'projects'}
>>> 
>>> dct['title']
'Some Title'
>>> 

Hopefully the code above seems self explanatory.

0

精彩评论

暂无评论...
验证码 换一张
取 消