开发者

Separating file extensions using python os.path module

开发者 https://www.devze.com 2023-03-04 16:52 出处:网络
I\'m working in python with os.path.splitext() and curious if it is possible to separate fi开发者_运维知识库lenames from extensions with multiple \".\"?e.g.\"foobar.aux.xml\" using splitext. Filenames

I'm working in python with os.path.splitext() and curious if it is possible to separate fi开发者_运维知识库lenames from extensions with multiple "."? e.g. "foobar.aux.xml" using splitext. Filenames vary from [foobar, foobar.xml, foobar.aux.xml]. Is there a better way?


Split with os.extsep.

>>> import os
>>> 'filename.ext1.ext2'.split(os.extsep)
['filename', 'ext1', 'ext2']

If you want everything after the first dot:

>>> 'filename.ext1.ext2'.split(os.extsep, 1)
['filename', 'ext1.ext2']

If you are using paths with directories that may contain dots:

>>> def my_splitext(path):
...     """splitext for paths with directories that may contain dots."""
...     li = []
...     path_without_extensions = os.path.join(os.path.dirname(path), os.path.basename(path).split(os.extsep)[0])
...     extensions = os.path.basename(path).split(os.extsep)[1:]
...     li.append(path_without_extensions)
...     # li.append(extensions) if you want extensions in another list inside the list that is returned.
...     li.extend(extensions)
...     return li
... 
>>> my_splitext('/path.with/dots./filename.ext1.ext2')
['/path.with/dots./filename', 'ext1', 'ext2']


you could try with:

names = pathname.split('.')
filename = names[0]
extensions = names[1:]

if you want to use splitext, you can use something like:

import os

path = 'filename.es.txt'

while True:
    path, ext = os.path.splitext(path)
    if not ext:
        print path
        break
    else:
        print ext

produces:

.txt
.es
filename


From the help of the function:

Extension is everything from the last dot to the end, ignoring leading dots.

So the answer is no, you can't do it with this function.


If you want to split off any number of extensions at the end, you can create a function like this:

def splitext_recurse(p):
    base, ext = os.path.splitext(p)
    if ext == '':
        return (base,)
    else:
        return splitext_recurse(base) + (ext,)

and use it like so:

>>> splitext_recurse("foobar.aux.xml")
('foobar', '.aux', '.xml')


import os
#Returns the file extension or empty string if none is found.
#The actual extension is the string after the last dot (if multiple).
def get_extension(filename):
    result = ""
    if "." in filename:
        result = os.path.splitext(filename)[-1]

    return result


As I mentioned in a comment, this issue has been identified as a bug in Python. See https://bugs.python.org/issue34931

For example, the library os.path.splitext("St. Thomas.txt") will return: ('St. Thomas', '.txt') which is right. But os.path.splitext("St. Thomas") returns ('St', '. Thomas'). The function below will correctly return ('St. Thomas', ''). This is the type of error we were trying to avoid. The os.path.splitext() also strangely splits ("....txt") to ('....txt', '') while our safe_splitext() will correctly split to ('...', '.txt')

For the original question, you can just add the expected double extensions to the list.

Because I encounter file names that may have periods embedded in the file names (and may or may not have extensions), I have reluctantly introduced the implementation below which requires explicit listing of expected extensions. In our case, we know mostly what extensions we are working with. If no extension is found, it attempts to find an unlisted extension and split that way and then emit a message to allow new extensions to be added.

def safe_splitext(filepath):
    """ the library os.path.splitext(path)
        can be fooled by periods in the name.
        This function is limited to the extensions we normally work with.
    """
    
    match = re.search(r'(\.pbm|\.csv|\.jpeg|\.jpg|\.json|\.lst|\.odt|'
                       r'\.pdf|\.png|\.tif|\.txt|\.xlsx|\.zip|\.html|'
                       r'\.htm|\.md|\.sha|\.DVD|\.db|\.yml|\.yaml|\.lock)$', 
                    filepath, flags=re.I)
    if bool(match):
        extension = match[1]
        name = re.sub(fr'\{extension}$', '', filepath)
        return name, extension
    
    match = re.search(r'(\.[^\.]{1,4})$', filepath, flags=re.I)
    if bool(match):
        extension = match[1]
        name = re.sub(fr'\{extension}$', '', filepath)
        print(f"Warning: unusual extension: {extension}")
        return name, extension
    
    return filepath, ''
    
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号