开发者

encoding of file shell script

开发者 https://www.devze.com 2022-12-11 22:35 出处:网络
How can I check the file encoding in a shell script? I ne开发者_运维技巧ed to know if a file is encoded in utf-8 or iso-8859-1.

How can I check the file encoding in a shell script? I ne开发者_运维技巧ed to know if a file is encoded in utf-8 or iso-8859-1.

Thanks


I'd just use

file -bi myfile.txt

to determine the character encoding of a particular file.

A solution with an external dependency but I suspect file is very common nowadays among all semi-modern distro's.

EDIT:

As a response to Laurence Gonsalves' comment: b is the option to be 'brief' (not include the filename) and i is the shorthand equivalent of --mime so the most portable way (including Mac OSX) then probably is:

file --mime myfile.txt 


There's no way to be 100% certain (unless you're dealing with a file format that internally states its encoding).

Most tools that attempt to make this distinction will try and decode the file as utf-8 (as that's the more strict encoding), and if that fails, then fall back to iso-8859-1. You can do this with iconv "by hand", or you can use file:

$ file utf8.txt
utf8.txt: UTF-8 Unicode text
$ file latin1.txt
latin1.txt: ISO-8859 text

Note that ASCII files are both UTF-8 and ISO-8859-1 compatible.

$ file ascii.txt
ascii.txt: ASCII text

Finally: there's no real way to distinguish between ISO-8859-1 and ISO-8859-2, for example, unless you're going to assume it's natural language and use statistical methods. This is probably why file says "ISO-8859".


you can use the file command file --mime myfile.text


File command is not 100% certain. Simple test:

#!/bin/bash

echo "a" > /tmp/foo

for i in {1..1000000}
do
  echo "asdas" >> /tmp/foo
done

echo "üöäÄÜÖß " >> /tmp/foo

file -b --mime-encoding /tmp/foo

this outputs:

us-ascii

Ascii does not know german umlauts.

File is a bunch of bytes (sequence of bytes). Without trusting meta data (BOM only recomended for utf-16 and utf-32, MIME, header of data) you can't really detect encoding. Sequence of bytes can be interpreted as utf-8 or ISO-8859-1/2 or anything you want. Well it depends for certain sequence if iso-8850-1/utf-8 map exist. What you want is to encode the whole file content to desired character encoding. If it fails the desired encoding does not have map for this sequence of bytes.

In shell maybe use python, perl or like Laurence Gonsalves says iconv. For text files I use in python this:

f = codecs.open(path, encoding='utf-8', errors='strict')


def valid_string(str):
  try:
    str.decode('utf-8')
    return True
  except UnicodeDecodeError:
    return False

How do you that a file is a text file. You don't. You encode line by line with desired character encoding. Ok, you can add a little trust and check if BOM exists (file is utf encoded).

0

精彩评论

暂无评论...
验证码 换一张
取 消