Worst-case behaviour of Python's HtmlDiff.make_table()_问答_开发者

Worst-case behaviour of Python's HtmlDiff.make_table()

开发者 https://www.devze.com 2023-03-27 06:47 出处：网络

I\'m using Python 2.7\'s difflib.HtmlDiff.make_table() function to generate diffs between expected and actual files for an internal test case runner. They end up in an HTML test report.

I'm using Python 2.7's difflib.HtmlDiff.make_table() function to generate diffs between expected and actual files for an internal test case runner. They end up in an HTML test report.

This has worked fine so far -- until I added a test case with bigger files (~400 KiB), lots of differences, often containing no line breaks. Almost all of my test cases are executed in less than 2s, with a few more complex ones going as high as 4s. This new one is just as fast when passing, but takes 13 minutes (!) to fail. All of that time is spent generating the report. I hope you can see how that's a problem.

An attempt to demonstrate this (probably not the best way, I know):

s = """import os, difflib
a = [os.urandom(length)]
b = [os.urandom(length)]
difflib.HtmlDiff().make_table(a, b)"""

import timeit
print 'length    100:', timeit.timeit(s, setup='length = 100', number=1)
print 'length   1000:', timeit.timeit(s, setup='length = 1000', number=1)
print 'length  10000:', timeit.timeit(s, setup='length = 10000', number=1)
print 'length 100000:', timeit.timeit(s, setup='length = 100000', number=1)
print 'length 400000:', timeit.timeit(s, setup='length = 400000', number=1)

And the results:

length    100: 0.022672659081
length   1000: 0.0125987213238
length  10000: 0.479898318086
length 100000: 54.9947423284
length 400000: 1451.59828412

difflib.ndiff() (which is used by make_table() internally, as far as I understand it) does not seem to have this problem:

s = """import os, difflib
a = [os.urandom(length)]
b = [os.urandom(length)]
difflib.ndiff(a, b)"""

import timeit
print 'length    100:', timeit.timeit(s, setup='length = 100', number=100)
print 'length   1000:', timeit.timeit(s, setup='length = 1000', number=100)
print 'length  10000:', timeit.timeit(s, setup='length = 10000', number=100)
print 'length 100000:', timeit.timeit(s, setup='length = 100000', number=100)
print 'length 400000:', timeit.timeit(s, setup='length = 400000', number=100)

Gives me this:

length    100: 0.0233492320197
length   1000: 0.00770079984919
length  10000: 0.0672924110913
length 100000: 0.480133018906
length 400000: 1.866792587

Which looks very reasonable, i.e. it's proportional. Four times the size takes four times as long.

Not sure where to go from here. I would guess that the HTML generator does a lot of backtracking when there are differences (although you would think that ndiff() had already handled that). Can I tell it to abort earlier, give up and mark the whole section as 'different'?

I understand that there are a lot of different algorithms for generating diffs. In this case, I don't need it to do a very deep analysis and try to resynchronize everywhere. I just need it to tell me roughly from what position on the file is different and then terminate in a reasonable timeframe.

Alternatively, are there other HTML-generati开发者_运维百科ng Python diff libraries which do not have this worst-case problem?

CPython issues related to this:

http://bugs.python.org/issue6931: dreadful performance in difflib: ndiff and HtmlDiff
http://bugs.python.org/issue11740: difflib html diff takes extremely long