linux + verify if file is text or binary_问答_开发者

How can 开发者_运维问答I verify if the file is binary or text without to open the file?

Schrödinger's cat, I'm afraid.

There is no way to determine the contents of a file without opening it. The filesystem stores no metadata relating to the contents.

If not opening the file is not a hard requirement, then there are a number of solutions available to you.

Edit:

It has been suggested in a number of comments and answers that file(1) is a good way of determining the contents. Indeed it is. However, file(1) opens the file, which was prohibited in the question. See the penultimate line in the following example:

> echo 'This is not a pipe' > file.jpg && strace file file.jpg 2>&1 | grep file.jpg
execve("/usr/bin/file", ["file", "file.jpg"], [/* 56 vars */]) = 0
lstat64("file.jpg", {st_mode=S_IFREG|0644, st_size=19, ...}) = 0
stat64("file.jpg", {st_mode=S_IFREG|0644, st_size=19, ...}) = 0
open("file.jpg", O_RDONLY|O_LARGEFILE)  = 3
write(1, "file.jpg: ASCII text\n", 21file.jpg: ASCII text

The correct way to determine the type of a file is to use the file(1) command.

You also need to be aware that UTF-8 encoded files are "text" files, but may contain non-ASCII data. Other encodings also have this issue. In the case of text encoded with a code page, it may not be possible to unambiguously determine if a file is text or not.

The file(1) command will look at the structure of a file to try and determine what it contains - from the file(1) man page:

The type printed will usually contain one of the words text (the file contains only printing characters and a few common control characters and is probably safe to read on an ASCII terminal), executable (the file contains the result of compiling a program in a form understandable to some UNIX kernel or another), or data meaning anything else (data is usually ‘binary’ or non-printable).

With regard to different character encodings, the file(1) man page has this to say:

If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non- ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified as ‘text’ because they will be mostly readable on nearly any terminal; UTF-16 and EBCDIC are only ‘character data’ because, while they contain text, it is text that will require translation before it can be read.

So, some text will be identified as text, but some may be identified as character data. You will need to determine yourself if this matters to your application and take appropriate action.

There is no way of being certain without looking inside the file. Hoewever, you don't have to open it with an editor and see for yourself to have a clue. You may want to look into the file command: http://linux.die.net/man/1/file

If you are attempting to do this from a command shell then the file command will take a guess at what filetype it is. If it is text then it will generally include the word text in its description.

I am not aware of any 100% method of determining this but the file command is probably the most accurate.

In unix, a file is just some bytes. So, without opening the file, you cannot figure out 100% that's it's ASCII or Binary.

You can just use tools available to you and dig deeper to make it fool proof.