Looking for articles, documentation or straight head knowledge of how different source control systems differentiate (or detect) the type of file (binary vs. text). Of particular interest is how Git does i开发者_如何学JAVAt vs Mercurial.
Do they look at: File extensions? File signatures or content (ie. is this file UTF8)? A mix of things?
SVN:
When you first add or import a file into Subversion, the file is examined to determine if it is a binary file. Currently, Subversion just looks at the first 1024 bytes of the file; if any of the bytes are zero, or if more than 15% are not ASCII printing characters, then Subversion calls the file binary. This heuristic might be improved in the future, however.
http://subversion.apache.org/faq.html#binary-files
Git works in a similar way. Git usually guesses correctly whether a blob contains text or binary data by examining the beginning of the contents - It checks for any occurrence of a zero byte (NUL “character”) in the first 8000 bytes.
http://git-scm.com/docs/gitattributes
And from Git source:
#define FIRST_FEW_BYTES 8000
int buffer_is_binary(const char *ptr, unsigned long size)
{
if (FIRST_FEW_BYTES < size)
size = FIRST_FEW_BYTES;
return !!memchr(ptr, 0, size);
}
http://git.kernel.org/?p=git/git.git;a=blob;f=xdiff-interface.c;h=0e2c169227ad29b5bf546c6c1b97e1a1d8ed7409;hb=HEAD
And @tonfa makes a good point that "Also note that the only place where it cares about a file being text vs. binary is for diplaying diff, and for doing merges. The storage format does not care about it."
Mercurial looks for some occurence of the null character (\0) in the content of the file. If there's one, then the file is considered as binary. Otherwise it is considered as textual, unless explicitely mentionned.
I guess git uses the same approach.
精彩评论