开发者

Dumping text files

开发者 https://www.devze.com 2023-01-20 05:01 出处:网络
I\'m writing a shell script that will create a textual (i.e. diffable) dump of an archive. I\'d like to detect whether or not each file is printable in some given character set, and if it is printabl

I'm writing a shell script that will create a textual (i.e. diffable) dump of an archive.

I'd like to detect whether or not each file is printable in some given character set, and if it is printable, I'd like to convert to that character set from whatever one it's in, if this is possible, and make its contents part of the dump.

I've considered using the file utility, but there doesn't seem to be any way to tell it to just print the character encoding or data. For example:

$ file -e soft -e tokens -e tar -e apptype -e cdf -e compress -e elf -e t开发者_StackOverflow中文版ar config.sub
config.sub: Lisp/Scheme program text

config.sub is one of the files distributed with the file source code.

I'm also a bit wary of parsing its rather unpredictable output.

I'd like to keep dependencies for this script to a minimum. I'm already using perl, but would prefer not to have to rely on any perl packages. Presumably iconv would be the best way to do the conversion, and I don't mind making this a dependency.

On the other hand, maybe such a utility as my nascent script is already readily available?

update: I ended up writing this in Python instead. It can be found in its github repo or on PyPI. The current version doesn't actually do the stuff that I mentioned in this question: that ended up being too time-consuming and not necessary enough to implement.

It might make its way into a later revision, though; if so, I will likely end up using some combination of quick scanning for binary detection (as mentioned in one of the comment threads) and use of the chardet module, as mentioned by Zack. Another option might be to use the Python wrapper for the file C utility, though I'm not sure how portable this is.


Have you tried the mime options which give more consistent output?

file --mime-encoding --mime-type -b somefile


The Universal Encoding Detector does a pretty damn good job of this -- it's not possible to do it perfectly, alas. And it requires Python.

0

精彩评论

暂无评论...
验证码 换一张
取 消