开发者

Splitting on first occurrence

开发者 https://www.devze.com 2023-03-25 00:53 出处:网络
What would be the best way to split a string on the first occurrence of a delimiter? For example: "123mango abcd mango kiwi peach"

What would be the best way to split a string on the first occurrence of a delimiter?

For example:

"123mango abcd mango kiwi peach"

splitting on the first mango to get:

"abcd mango kiwi peach"

To split on the last occurrence instead, see partition string in python and get value of last se开发者_JAVA百科gment after colon.


From the docs:

str.split([sep[, maxsplit]])

Return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).

s.split('mango', 1)[1]


>>> s = "123mango abcd mango kiwi peach"
>>> s.split("mango", 1)
['123', ' abcd mango kiwi peach']
>>> s.split("mango", 1)[1]
' abcd mango kiwi peach'


For me the better approach is that:

s.split('mango', 1)[-1]

...because if happens that occurrence is not in the string you'll get "IndexError: list index out of range".

Therefore -1 will not get any harm cause number of occurrences is already set to one.


You can also use str.partition:

>>> text = "123mango abcd mango kiwi peach"

>>> text.partition("mango")
('123', 'mango', ' abcd mango kiwi peach')

>>> text.partition("mango")[-1]
' abcd mango kiwi peach'

>>> text.partition("mango")[-1].lstrip()  # if whitespace strip-ing is needed
'abcd mango kiwi peach'

The advantage of using str.partition is that it's always gonna return a tuple in the form:

(<pre>, <separator>, <post>)

So this makes unpacking the output really flexible as there's always going to be 3 elements in the resulting tuple.


Summary

The simplest and best-performing approach is to use the .partition method of the string.

Commonly, people may want to get the part either before or after the delimiter that was found, and may want to find either the first or last occurrence of the delimiter in the string. For most techniques, all of these possibilities are roughly as simple, and it is straightforward to convert from one to another.

For the below examples, we will assume:

>>> import re
>>> s = '123mango abcd mango kiwi peach'

Using .split

>>> s.split('mango', 1)
['123', ' abcd mango kiwi peach']

The second parameter to .split limits the number of times the string will be split. This gives the parts both before and after the delimiter; then we can select what we want.

If the delimiter does not appear, no splitting is done:

>>> s.split('grape', 1)
['123mango abcd mango kiwi peach']
Thus, to check whether the delimiter was present, check the length of the result before working with it.

Using .partition

>>> s.partition('mango')
('123', 'mango', ' abcd mango kiwi peach')

The result is a tuple instead, and the delimiter itself is preserved when found.

When the delimiter is not found, the result will be a tuple of the same length, with two empty strings in the result:

>>> s.partition('grape')
('123mango abcd mango kiwi peach', '', '')

Thus, to check whether the delimiter was present, check the value of the second element.

Using regular expressions

>>> # Using the top-level module functionality
>>> re.split(re.escape('mango'), s, 1)
['123', ' abcd mango kiwi peach']
>>> # Using an explicitly compiled pattern
>>> mango = re.compile(re.escape('mango'))
>>> mango.split(s, 1)
['123', ' abcd mango kiwi peach']

The .split method of regular expressions has the same argument as the built-in string .split method, to limit the number of splits. Again, no splitting is done when the delimiter does not appear:

>>> grape = re.compile(re.escape('grape'))
>>> grape.split(s, 1)
['123mango abcd mango kiwi peach']

In these examples, re.escape has no effect, but in the general case it's necessary in order to specify a delimiter as literal text. On the other hand, using the re module opens up the full power of regular expressions:

>>> vowels = re.compile('[aeiou]')
>>> # Split on any vowel, without a limit on the number of splits:
>>> vowels.split(s)
['123m', 'ng', ' ', 'bcd m', 'ng', ' k', 'w', ' p', '', 'ch']

(Note the empty string: that was found between the e and the a of peach.)

Using indexing and slicing

Use the .index method of the string to find out where the delimiter is, then slice with that:

>>> s[:s.index('mango')] # for everything before the delimiter
'123'
>>> s[s.index('mango')+len('mango'):] # for everything after the delimiter
' abcd mango kiwi peach'

This directly gives the prefix. However, if the delimiter is not found, an exception will be raised instead:

>>> s[:s.index('grape')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: substring not found

Everything after the last occurrence, instead

Though it wasn't asked, I include related techniques here for reference.

The .split and .partition techniques have direct counterparts, to get the last part of the string (i.e., everything after the last occurrence of the delimiter). For reference:

>>> '123mango abcd mango kiwi peach'.rsplit('mango', 1)
['123mango abcd ', ' kiwi peach']
>>> '123mango abcd mango kiwi peach'.rpartition('mango')
('123mango abcd ', 'mango', ' kiwi peach')

Similarly, there is a .rindex to match .index, but it will still give the index of the beginning of the last match of the partition. Thus:

>>> s[:s.rindex('mango')] # everything before the last match
'123mango abcd '
>>> s[s.rindex('mango')+len('mango'):] # everything after the last match
' kiwi peach'

For the regular expression approach, we can fall back on the technique of reversing the input, looking for the first appearance of the reversed delimiter, reversing the individual results, and reversing the result list:

>>> ognam = re.compile(re.escape('mango'[::-1]))
>>> [x[::-1] for x in ognam.split('123mango abcd mango kiwi peach'[::-1], 1)][::-1]
['123mango abcd ', ' kiwi peach']

Of course, this is almost certainly more effort than it's worth.

Another way is to use negative lookahead from the delimiter to the end of the string:

>>> literal_mango = re.escape('mango')
>>> last_mango = re.compile(f'{literal_mango}(?!.*{literal_mango})')
>>> last_mango.split('123mango abcd mango kiwi peach', 1)
['123mango abcd ', ' kiwi peach']

Because of the lookahead, this is a worst-case O(n^2) algorithm.

Performance testing

$ python -m timeit --setup="s='123mango abcd mango kiwi peach'" "s.partition('mango')[-1]"
2000000 loops, best of 5: 128 nsec per loop
$ python -m timeit --setup="s='123mango abcd mango kiwi peach'" "s.split('mango', 1)[-1]"
2000000 loops, best of 5: 157 nsec per loop
$ python -m timeit --setup="s='123mango abcd mango kiwi peach'" "s[s.index('mango')+len('mango'):]"
1000000 loops, best of 5: 250 nsec per loop
$ python -m timeit --setup="s='123mango abcd mango kiwi peach'; import re; mango=re.compile(re.escape('mango'))" "mango.split(s, 1)[-1]"
1000000 loops, best of 5: 258 nsec per loop

Though more flexible, the regular expression approach is definitely slower. Limiting the number of splits improves performance with both the string method and regular expressions (timings without the limit are not shown, because they are slower and also give a different result), but .partition is still a clear winner.

For this test data, the .index approach was slower even though it only has to create one substring and doesn't have to iterate over text beyond the match (for the purpose of creating the other substrings). Pre-computing the length of the delimiter helps, but this is still slower than the .split and .partition approaches.

0

精彩评论

暂无评论...
验证码 换一张
取 消