开发者

Detecting if a file is binary or plain text?

开发者 https://www.devze.com 2023-01-01 20:58 出处:网络
How can I detect if a file is binary or a plain text? Basically my .NET app is processing batch files and extracting data however I don\'t want to process binary files.

How can I detect if a file is binary or a plain text?

Basically my .NET app is processing batch files and extracting data however I don't want to process binary files.

As a solution I'm thinking about analysing first X bytes of the file and if there are more unprintable characters than printable characters it s开发者_StackOverflow社区hould be binary.

Is this the right way to do it? Is there any better implementation for this task?


What exactly do you mean by binary? Is the 'Art of War' written in Chinese binary to you? What about a Japanese-English dictionary?

There is no really 100% way.

You would need to use some kind of heuristic.

Some options might be to look at:

  • Byte Order Mark
  • File Signatures (AKA magic numbers)
  • File Extensions

If the above (especially file signatures and extensions) don't help, then try to guess based on the presence/absence of certains bytes (like you are doing).

Note: It is better to check extensions/signatures first, as you would only need to read a few bytes/file metadata and that would be pretty efficient as compared to actually reading the whole file.


Unix file command does this in a clever way. Of course, it does a lot more, but you can check the algorithm here and then build something specialized.


UPDATE: The link above seems to be broken. Try this.


You could regex the first X number of bytes, and give a valid match if all bytes are in a proper character class. But that might presuppose that you know the encoding.


I think the best way of doing this is to take at most the first X bytes from the file (X could be 256, 512, etc), count the number of chars that are not used by ASCII files (ascii codes permitted are: 10, 13, 32-126). If you know for sure that the script is written in English, than no character can be outside of the mentioned set. If you are not sure about the language, than you may permit at most Y char to be outside of the set (if X is 512, I would choose Y to be 8 or 10).

If this is not good enough, you may use more constraints such as: depending on the syntax of the files, such keywords should be present (eg: for your batch files, there should be some echo, for, if, goto, call, exit, etc)

0

精彩评论

暂无评论...
验证码 换一张
取 消