I would like to determine the tab width used in source files indented with spaces. This is not hard for files with particularly regular indentation, where the leading spaces are only used for indentation, always in multiples of the tab width, and with indentation increasing one level at at time. But many files will have some departure from this sort of regular indentation, generally for some form of vertical alignment. I'm thus looking for a good heuristic to estimate what tab width w开发者_StackOverflowas used, allowing some possibility for irregular indentation.
The motivation for this is writing an extension for the SubEthaEdit editor. SubEthaEdit unfortunately doesn't make the tab width available for scripting, so I'm going to guess at it based on the text.
A suitable heuristic should:
- Perform well enough for interactive use. I don't imagine this will be a problem, and just a portion of the text can be used if need be.
- Be language independent.
- Return the longest suitable tab width. For example, any file with a tab width of four spaces could also be a file with two-space tabs, if every indentation was actually by twice as many levels. Clearly, four spaces would be the right choice.
- Always get it right if the indentation is completely regular.
Some simplifying factors:
- At least one line can be assumed to be indented.
- The tab width can be assumed to be at least two spaces.
- It's safe to assume that indentation is done with spaces only. It's not that I have anything against tabs---quite the contrary, I'll check first if there are any tabs used for indentation and handle it separately. This does mean that indentation mixing tabs and spaces might not be handled properly, but I don't consider it important.
- It may be assumed that there are no lines containing only whitespace.
- Not all languages need to be handled correctly. For example, success or failure with languages like lisp and go would be completely irrelevant, since they're not normally indented by hand.
- Perfection is not required. The world isn't going to end if a few lines occasionally need to be manually adjusted.
What approach would you take, and what do you see as its advantages and disadvantages?
If you want to provide working code in your answer, the best approach is probably to use a shell script that reads the source file from stdin
and writes the tab width to stdout
. Pseudocode or a clear description in words would be just fine, too.
Some Results
To test different strategies, we can apply different strategies to files in the standard libraries for language distributions, as they presumably follow the standard indentation for the language. I'll consider the Python 2.7 and Ruby 1.8 libraries (system framework installs on Mac OS X 10.7), which have expected tab widths of 4 and 2, respectively. Excluded are those files which have lines beginning with tab characters or which have no lines beginning with at least two spaces.
Python:
Right None Wrong
Mode: 2523 1 102
First: 2169 1 456
No-long (12): 2529 9 88
No-long (8): 2535 16 75
LR (changes): 2509 1 116
LR (indent): 1533 1 1092
Doublecheck (10): 2480 15 130
Doublecheck (20): 2509 15 101
Ruby:
Right None Wrong
Mode: 594 29 51
First: 578 0 54
No-long (12): 595 29 50
No-long (8): 597 29 48
LR (changes): 585 0 47
LR (indent): 496 0 136
Doublecheck (10): 610 0 22
Doublecheck (20): 609 0 23
In these tables, "Right" should be taken as determination of the language-standard tab width, "Wrong" as a non-zero tab width not equal to the language-standard width, and "None" as zero tab-width or no answer. "Mode" is the strategy of selecting the most frequently occurring change in indentation; "First" is taking the indentation of the first indented line; "No-long" is FastAl's strategy of excluding lines with large indentation and taking the mode, with the number indicating the maximum allowed indent change; "LR" is Patrick87's strategy based on linear regression, with variants based on the change in indentation between lines and on the absolute indentation of lines; "Doublecheck" (couldn't resist the pun!) is Mark's modification of FastAl's strategy, restricting the possible tab width and checking whether half the modal value also occurs frequently, with two different thresholds for selecting the smaller width.
Okay, as you want a language-agnostic solution, we won't be able to use any syntactical hints. Although you said, that you don't want a perfect solution, here is one, that is working very well with most languages.
I actually had to solve a similar issue in cryptography to get the correct code word length in a polyalphabetic cipher. This kind of encryption is a basic Caesar-chiffre (each letter of the alphabet is moved by n letters), where the cryptword is used to move the letters differently (the nth letter of the clear text is moved by the mod(nth, length(cryptword)) letter of the cryptword). The weapon of choice is autocorrelation.
The algorithm would be like this:
- strip all characters after the whitespace at the beginning of a line has ended - leave the line-end markers intact.
- remove lines with zero whitespace (as they are only blank lines)
- Count the whitespace width for each line and save this in an array lengths
- Autocorrelation: loop until the maximum estimated number - may be fairly high like 32 or something - current iteration shall be i. For each iteration, calculate the distance between each entry and the ith entry. Count the number of distances = 0 (same values for the nth and (n+i)th entries), save in an array for the key i.
- You now have an array of same-pair-occurances. Calculate the mean of this array, and delete all values near this mean (leaving the spikes of the autocorrelation). The spikes will be multiples of the lowest value, which will be the searched number of spaces used for indentation.
The autocorrelation is a very nice function, usable for every situation, in which you want to detect repeating values in a stream of data. It is heavily used in signal processing and very fast (depending on the estimated maximum distance of signal repeats).
And yes, back then I cracked the polyalphabetic ciphertext with autocorrelation. ;)
- For each line in the file
- If indented more than the previous, add the difference to a list
- discard if > 12, it's probably a line continuation
- If indented more than the previous, add the difference to a list
- Generate a frequency table of the #s in the list
- #1 is likely your answer.
edit
I have VB.Net open (didn't you? :-) Here's what I mean:
Sub Main()
Dim lines = IO.File.ReadAllLines("ProveGodExists.c")
Dim previndent As Integer = 0
Dim indent As Integer
Dim diff As Integer
Dim Diffs As New Dictionary(Of Integer, Integer)
For Each line In lines
previndent = indent
indent = Len(line) - Len(LTrim(line))
diff = indent - previndent
If diff > 0 And diff < 13 Then
If Diffs.ContainsKey(diff) Then
Diffs(diff) += 1
Else
Diffs.Add(diff, 1)
End If
End If
Next
Dim freqtbl = From p In Diffs Order By p.Value Descending
Console.WriteLine("Dump of frequency table:")
For Each item In freqtbl
Console.WriteLine(item.Key.ToString & " " & item.Value.ToString)
Next
Console.WriteLine("My wild guess at tab setting: " & freqtbl(0).Key.ToString)
Console.ReadLine()
End Sub
Results:
Dump of frequency table:
4 748
8 22
12 12
2 2
9 2
3 1
6 1
My wild guess at tab setting: 4
Hope that helps.
Your choices are (realistically) 2,3,4,5,6,7,8.
I'd scan the the first 50-100 lines or so using something like what @FastAl suggested. I'd probably lean toward just blindly pulling the spaces count from the front of any row with text and counting the length of the white space string. Left trimming lines and running length twice seems like a waste if you have regex available. Also, I'd do System.Math.abs(indent - previndent)
so you get de-indent data. The regex would be this:
row.matches('^( +)[^ ]') # grab all the spaces from line start to non-space.
Once you've got a statistic for which of the 7 options has the highest count, run with it as the first guess. For 8, 6, and 4 you should check to see if there is also a significant count (2nd place or over 10% or some other cheapo heuristic) for 4 and 2, 3, or 2. If there are a lot of 12s (or 9s) that might hint that 4 (or 3) is a better choice than 8 (or 6) as well. Dropping or adding more than 2 levels at a time (usually collapsed ending brackets) is super rare.
Irrelevant mumbling
The one problem I see is that old .c code in particular has this nasty pattern going on in it:
code level 0
/* Fancy comments get weird spacing because there
* is an extra space beyond the *
* looks like one space!
*/
code indent (2 spaces)
/* Fancy comments get weird spacing because there
* is an extra space beyond the *
* looks like three spaces!
*/
code level 0
code indent (2 spaces)
/* comment at indent level 1
With no stars you wind up with 2 spaces + 3 spaces.
*/
Yuck. I don't know how you deal with comment standards like that. For code that is "c" like you might have to deal with comments special in version 2.0... but I would just ignore it for now.
Your final issue is dealing with lines that don't match your assumptions. My suggestion would be to "tab" them to depth and then leave the extra spaces in place. If you have to correct I'd do this: rowtabdepth = ceiling((rowspacecount - (tabwidth/2)) / tabwidth)
Maybe do something like...
- get a list of all the tab widths in the file
- remove 50% of the entries which are least frequent
- sort the remaining entries in ascending order
- compute a list of (a, b) pairs where b's are in the list of tab widths and the a's give the rank of that tab width.
- plot a best-fit line
- the slope of the best-fit line is the guess for the tab width. round to the nearest integer.
Example:
- list = [4, 4, 6, 8, 8, 4, 4, 4, 8, 8, 12, 5, 11, 13, 12, 12]
- list = [4, 4, 4, 4, 4, 8, 8, 8]
- already sorted
- [(1, 4), (1, 4), (1, 4), (1, 4), (1, 4), (2, 8), (2, 8), (2, 8)]
- the best fit line is b = 4a + 0 (R^2 = 0)
- slope is 4, so this is probably the tab width.
For each langauge you want to support, you'll need to do a bit of parsing:
1) exclude comments (either line-wise or block-wise, maybe also nested?)
2) find openings of sub-block ({
in C-like languages, begin
in pascal, do
in shell etc.)
Then just see how much the number of spaces increase after sub-block has been opened. Make some simple statistics - to find most frequent value, maximum and minimum value, average value. This way you can also see if the indentation is regular or not and how much.
As a baseline, one could simply calculate all indentation increases, and take the most frequent increase as the tab width. As a shell script, written to have small actions per pipeline stage, it could look like this:
#!/bin/sh
grep -v -E '^[[:space:]]*$' |
sed 's/^\([[:space:]]*\).*/\1/' |
awk '{ print length($0) }' |
awk '$1 > prev { print $1 - prev } { prev = $1 }' |
sort |
uniq -c |
sort -k1nr |
awk '{ print $2 }' |
head -n 1
This implementation is O(n log(n))
where n
is the number of lines in the file, but it could readily be done in O(n)
.
Heuristic:
- Get a list of all indention changes from a line to its next line which are > 0.
- Make a frequency table of all values in this list.
- Take the value with highest frequency.
Python script, takes filenames or stdin and prints best indent number:
#!/usr/bin/env python
import fileinput, collections
def leadingSpaceLen(line):
return len(line) - len(line.lstrip())
def indentChange(line1, line2):
return leadingSpaceLen(line2) - leadingSpaceLen(line1)
def indentChanges(lines):
return [indentChange(line1, line2)
for line1, line2 in zip(lines[:-1], lines[1:])]
def bestIndent(lines):
f = collections.defaultdict(lambda: 0)
for change in indentChanges(lines):
if change > 0:
f[change] += 1
return max(f.items(), key=lambda x: x[1])[0]
if __name__ == '__main__':
print bestIndent(tuple(fileinput.input()))
精彩评论