I'm using Python 2.7's difflib.HtmlDiff.make_table()
function to generate diffs between expected and actual files for an internal test case runner. They end up in an HTML test report.
This has worked fine so far -- until I added a test case with bigger files (~400 KiB), lots of differences, often containing no line breaks. Almost all of my test cases are executed in less than 2s, with a few more complex ones going as high as 4s. This new one is just as fast when passing, but takes 13 minutes (!) to fail. All of that time is spent generating the report. I hope you can see how that's a problem.
An attempt to demonstrate this (probably not the best way, I know):
s = """import os, difflib
a = [os.urandom(length)]
b = [os.urandom(length)]
difflib.HtmlDiff().make_table(a, b)"""
import timeit
print 'length 100:', timeit.timeit(s, setup='length = 100', number=1)
print 'length 1000:', timeit.timeit(s, setup='length = 1000', number=1)
print 'length 10000:', timeit.timeit(s, setup='length = 10000', number=1)
print 'length 100000:', timeit.timeit(s, setup='length = 100000', number=1)
print 'length 400000:', timeit.timeit(s, setup='length = 400000', number=1)
And the results:
length 100: 0.022672659081
length 1000: 0.0125987213238
length 10000: 0.479898318086
length 100000: 54.9947423284
length 400000: 1451.59828412
difflib.ndiff()
(which is used by make_table()
internally, as far as I understand it) does not seem to have this problem:
s = """import os, difflib
a = [os.urandom(length)]
b = [os.urandom(length)]
difflib.ndiff(a, b)"""
import timeit
print 'length 100:', timeit.timeit(s, setup='length = 100', number=100)
print 'length 1000:', timeit.timeit(s, setup='length = 1000', number=100)
print 'length 10000:', timeit.timeit(s, setup='length = 10000', number=100)
print 'length 100000:', timeit.timeit(s, setup='length = 100000', number=100)
print 'length 400000:', timeit.timeit(s, setup='length = 400000', number=100)
Gives me this:
length 100: 0.0233492320197
length 1000: 0.00770079984919
length 10000: 0.0672924110913
length 100000: 0.480133018906
length 400000: 1.866792587
Which looks very reasonable, i.e. it's proportional. Four times the size takes four times as long.
Not sure where to go from here. I would guess that the HTML generator does a lot of backtracking when there are differences (although you would think that ndiff() had already handled that). Can I tell it to abort earlier, give up and mark the whole section as 'different'?
I understand that there are a lot of different algorithms for generating diffs. In this case, I don't need it to do a very deep analysis and try to resynchronize everywhere. I just need it to tell me roughly from what position on the file is different and then terminate in a reasonable timeframe.
Alternatively, are there other HTML-generati开发者_运维百科ng Python diff libraries which do not have this worst-case problem?
CPython issues related to this:
- http://bugs.python.org/issue6931: dreadful performance in difflib: ndiff and HtmlDiff
- http://bugs.python.org/issue11740: difflib html diff takes extremely long
精彩评论