开发者

Parsing srt subtitles

开发者 https://www.devze.com 2022-12-26 04:13 出处:网络
I want to parse srt subtitles: 1 00:00:12,815 --> 00:00:14,509 Chlapi, jak to jde s těma pracovníma světlama?.

I want to parse srt subtitles:

    1
    00:00:12,815 --> 00:00:14,509
    Chlapi, jak to jde s
    těma pracovníma světlama?.

    2
    00:00:14,815 --> 00:00:16,498
    Trochu je zesilujeme.

    3
    00:00:16,934 --> 00:00:17,814
    Jo, sleduj.

Every item into structure. With this regexs:

A:

开发者_如何学编程RE_ITEM = re.compile(r'(?P<index>\d+).'
    r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
    r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
    r'(?P<text>.*?)', re.DOTALL)

B:

RE_ITEM = re.compile(r'(?P<index>\d+).'
    r'(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> '
    r'(?P<end>\d{2}:\d{2}:\d{2},\d{3}).'
    r'(?P<text>.*)', re.DOTALL)

And this code:

    for i in Subtitles.RE_ITEM.finditer(text):
    result.append((i.group('index'), i.group('start'), 
             i.group('end'), i.group('text')))

With code B I have only one item in array (because of greedy .*) and with code A I have empty 'text' because of no-greedy .*?

How to cure this?

Thanks


Why not use pysrt?


I became quite frustrated with srt libraries available for Python (often because they were heavyweight and eschewed language-standard types in favour of custom classes), so I've spent the last year or so working on my own srt library. You can get it at https://github.com/cdown/srt.

I tried to keep it simple and light on classes (except for the core Subtitle class, which more or less just stores the SRT block data). It can read and write SRT files, and turn noncompliant SRT files into compliant ones.

Here's a usage example with your sample input:

>>> import srt, pprint
>>> gen = srt.parse('''\
... 1
... 00:00:12,815 --> 00:00:14,509
... Chlapi, jak to jde s
... těma pracovníma světlama?.
... 
... 2
... 00:00:14,815 --> 00:00:16,498
... Trochu je zesilujeme.
... 
... 3
... 00:00:16,934 --> 00:00:17,814
... Jo, sleduj.
... 
... ''')
>>> pprint.pprint(list(gen))
[Subtitle(start=datetime.timedelta(0, 12, 815000), end=datetime.timedelta(0, 14, 509000), index=1, proprietary='', content='Chlapi, jak to jde s\ntěma pracovníma světlama?.'),
 Subtitle(start=datetime.timedelta(0, 14, 815000), end=datetime.timedelta(0, 16, 498000), index=2, proprietary='', content='Trochu je zesilujeme.'),
 Subtitle(start=datetime.timedelta(0, 16, 934000), end=datetime.timedelta(0, 17, 814000), index=3, proprietary='', content='Jo, sleduj.')]


The text is followed by an empty line, or the end of file. So you can use:

r' .... (?P<text>.*?)(\n\n|$)'


Here's some code I had lying around to parse SRT files:

from __future__ import division

import datetime

class Srt_entry(object):
    def __init__(self, lines):
        def parsetime(string):
            hours, minutes, seconds = string.split(u':')
            hours = int(hours)
            minutes = int(minutes)
            seconds = float(u'.'.join(seconds.split(u',')))
            return datetime.timedelta(0, seconds, 0, 0, minutes, hours)
        self.index = int(lines[0])
        start, arrow, end = lines[1].split()
        self.start = parsetime(start)
        if arrow != u"-->":
            raise ValueError
        self.end = parsetime(end)
        self.lines = lines[2:]
        if not self.lines[-1]:
            del self.lines[-1]
    def __unicode__(self):
        def delta_to_string(d):
            hours = (d.days * 24) \
                    + (d.seconds // (60 * 60))
            minutes = (d.seconds // 60) % 60
            seconds = d.seconds % 60 + d.microseconds / 1000000
            return u','.join((u"%02d:%02d:%06.3f"
                              % (hours, minutes, seconds)).split(u'.'))
        return (unicode(self.index) + u'\n'
                + delta_to_string(self.start)
                + ' --> '
                + delta_to_string(self.end) + u'\n'
                + u''.join(self.lines))


srt_file = open("foo.srt")
entries = []
entry = []
for line in srt_file:
    if options.decode:
        line = line.decode(options.decode)
    if line == u'\n':
        entries.append(Srt_entry(entry))
        entry = []
    else:
        entry.append(line)
srt_file.close()


splits = [s.strip() for s in re.split(r'\n\s*\n', text) if s.strip()]
regex = re.compile(r'''(?P<index>\d+).*?(?P<start>\d{2}:\d{2}:\d{2},\d{3}) --> (?P<end>\d{2}:\d{2}:\d{2},\d{3})\s*.*?\s*(?P<text>.*)''', re.DOTALL)
for s in splits:
    r = regex.search(s)
    print r.groups()


Here's a snippet I wrote which converts SRT files into dictionaries:

import re
def srt_time_to_seconds(time):
    split_time=time.split(',')
    major, minor = (split_time[0].split(':'), split_time[1])
    return int(major[0])*1440 + int(major[1])*60 + int(major[2]) + float(minor)/1000

def srt_to_dict(srtText):
    subs=[]
    for s in re.sub('\r\n', '\n', srtText).split('\n\n'):
        st = s.split('\n')
        if len(st)>=3:
            split = st[1].split(' --> ')
            subs.append({'start': srt_time_to_seconds(split[0].strip()),
                         'end': srt_time_to_seconds(split[1].strip()),
                         'text': '<br />'.join(j for j in st[2:len(st)])
                        })
    return subs

Usage:

import srt_to_dict
with open('test.srt', "r") as f:
        srtText = f.read()
        print srt_to_dict(srtText)
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号