开发者

Split string with caret character in python

开发者 https://www.devze.com 2023-03-14 05:53 出处:网络
I have a huge text file, each line seems like this: Some sort of general menu^a_sub_menu_title^^pagNumber

I have a huge text file, each line seems like this:

Some sort of general menu^a_sub_menu_title^^pagNumber

Notice that the first "general menu" has white spaces, the second part (a subtitle) each word is separate with "_" character and finally a number (a pag number). I want to split each line in 3 (obvious) p开发者_StackOverflow社区arts, because I want to create some sort of directory in python.

I was trying with re module, but as the caret character has a strong meaning in such module, I couldn't figure it out how to do it.

Could someone please help me????


>>> "Some sort of general menu^a_sub_menu_title^^pagNumber".split("^")
['Some sort of general menu', 'a_sub_menu_title', '', 'pagNumber']


If you only want three pieces you can accomplish this through a generator expression:

line = 'Some sort of general menu^a_sub_menu_title^^pagNumber'
pieces = [x for x in line.split('^') if x]
# pieces => ['Some sort of general menu', 'a_sub_menu_title', 'pagNumber']


What you need to do is to "escape" the special characters, like r'\^'. But better than regular expressions in this case would be:

line = "Some sort of general menu^a_sub_menu_title^^pagNumber"
(menu, title, dummy, page) = line.split('^')

That gives you the components in a much more straightforward fashion.


You could just say string.split("^") to divide the string into an array containing each segment. The only caveat is that it will divide consecutive caret characters into an empty string. You could protect against this by either collapsing consecutive carats down into a single one, or detecting empty strings in the resultant array.

For more information see http://docs.python.org/library/stdtypes.html

Does that help?


It's also possible that your file is using a format that's compatible with the csv module, you could also look into that, especially if the format allows quoting, because then line.split would break. If the format doesn't use quoting and it's just delimiters and text, line.split is probably the best.

Also, for the re module, any special characters can be escaped with \, like r'\^'. I'd suggest before jumping to use re to 1) learn how to write regular expressions, 2) first look for a solution to your problem instead of jumping to regular expressions - «Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. »

0

精彩评论

暂无评论...
验证码 换一张
取 消