printf field width doesn't support multibyte characters?_问答_开发者

I want printf to recognize multi-byte characters when calculating the field width so that columns line up properly... I can't find an 开发者_开发问答answer to this problem and was wondering if anyone here had any suggestions, or maybe a function/script that takes care of this problem.

Here's a quick and dirty example:

printf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n" '' '*' '' '' "•" ''
>##           *       ##
>##         •       ##

Obviously, I want the result:

>##           *       ##
>##           •       ##

Any way to achieve this?

The best I can think of is:

function formatwidth
{
  local STR=$1; shift
  local WIDTH=$1; shift
  local BYTEWIDTH=$( echo -n "$STR" | wc -c )
  local CHARWIDTH=$( echo -n "$STR" | wc -m )
  echo $(( $WIDTH + $BYTEWIDTH - $CHARWIDTH ))
}

printf "## %5s %*s %5s ##\n## %5s %*s %5s ##\n" \
    '' $( formatwidth "*" 5 ) '*' '' \
    '' $( formatwidth "•" 5 ) "•" ''

You use the * width specifier to take the width as an argument, and calculate the width you need by adding the number of additional bytes in multibyte characters.

Note that in GNU wc, -c returns bytes, and -m returns (possibly multibyte) characters.

I will probably use GNU awk:

awk 'BEGIN{ printf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n", "", "*", "", "", "•", "" }'
##           *       ##
##           •       ##

You can even write shell wrapper function called printf on top of awk to keep same interface:

tr2awk() { 
    FMT="$1"
    echo -n "gawk 'BEGIN{ printf \"$FMT\""
    shift
    for ARG in "$@"
        do echo -n ", \"$ARG\""
    done
    echo " }'"
}

and then override printf with simple function:

printf() { eval `tr2awk "$@"`; }

Test it:

# buggy printf binary test:
/usr/bin/printf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n" '' '*' '' '' "•" ''
##           *       ##
##         •       ##
# buggy printf shell builin test:
builtin printf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n" '' '*' '' '' "•" ''
##           *       ##
##         •       ##

# fixed printf function test:
printf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n" '' '*' '' '' "•" ''
##           *       ##
##           •       ##

A language like python will probably solve your problems in a simpler, more controllable way...

#!/usr/bin/python
# coding=utf-8

import sys
import codecs
import unicodedata

out = codecs.getwriter('utf-8')(sys.stdout)

def width(string):
    return sum(1+(unicodedata.east_asian_width(c) in "WF")
        for c in string)

a1=[u'する', u'します', u'trazan', u'した', u'しました']
a2=[u'dipsy', u'laa-laa', u'banarne', u'po', u'tinky winky']

for i,j in zip(a1,a2):
    out.write('%s %s: %s\n' % (i, ' '*(12-width(i)), j))

A pure shell solution

right_justify() {
        # parameters: field_width string
        local spaces questions
        spaces=''
        questions=''
        while [ "${#questions}" -lt "$1" ]; do
                spaces=$spaces" "
                questions=$questions?
        done
        result=$spaces$2
        result=${result#"${result%$questions}"}
}

Note that this still does not work in dash because dash has no locale support.

This is kind of late, but I just came across this, and thought I would post it for others coming across the same post. A variation to @ninjalj's answer might be to create a function that returns a string of a given length rather than calculate the required format length:

#!/bin/bash
function sized_string
{
        STR=$1; WIDTH=$2
        local BYTEWIDTH=$( echo -n "$STR" | wc -c )
        local CHARWIDTH=$( echo -n "$STR" | wc -m )
        FMT_WIDTH=$(( $WIDTH + $BYTEWIDTH - $CHARWIDTH ))
        printf "%*s" $FMT_WIDTH $STR
}
printf "[%s]\n" "$(sized_string "abc" 20)"
printf "[%s]\n" "$(sized_string "ab•cd" 20)"

which outputs:

[                 abc]
[               ab•cd]

Here's another solution with (g)awk:

function multibyte_printf {
    begin_rule='BEGIN { printf'
    vars=()
    
    for (( arg_index=1; arg_index<=$#; arg_index++ )); do
        begin_rule+=" arg${arg_index},"
        arg="${!arg_index}"
        vars+=('-v' "arg${arg_index}=${arg}")
    done
    
    # Remove last ','
    begin_rule="${begin_rule:0:${#begin_rule}-1}"
    begin_rule+=' }'
    
    gawk "${vars[@]}" "$begin_rule"
}

It generates and executes commands like this:

gawk -v 'arg1=%10s' -v 'arg2=World' 'BEGIN { printf arg1, arg2 }'

The main advantage of this solution over @Michał Šrajer's is improved security. Using awk variables instead of baking parameters into the rule code eliminates the need to escape special characters. It should be impossible to tamper with execution using malformed arguments.

Are these the only way? There's no way to do it with printf alone?

Well with the example from ninjalj (thx btw), I wrote a script to deal with this problem, and saved it as fprintf in /usr/local/bin:

#! /bin/bash

IFS=' '
declare -a Text=("${@}")

## Skip the whole thing if there are no multi-byte characters ##
if (( $(echo "${Text[*]}" | wc -c) > $(echo "${Text[*]}" | wc -m) )); then
    if echo "${Text[*]}" | grep -Eq '%[#0 +-]?[0-9]+(\.[0-9]+)?[sb]'; then
        IFS=$'\n'
        declare -a FormatStrings=($(echo -n "${Text[0]}" | grep -Eo '%[^%]*?[bs]'))
        IFS=$' \t\n'
        declare -i format=0

    ## Check every format string ##
        for fw in "${FormatStrings[@]}"; do
            (( format++ ))
            if [[ "$fw" =~ ^%[#0\ +-]?[1-9][0-9]*(\.[1-9][0-9]*)?[sb]$ ]]; then
                (( Difference = $(echo "${Text[format]}" | wc -c) - $(echo "${Text[format]}" | wc -m) ))

            ## If multi-btye characters ##
                if (( Difference > 0 )); then

                ## If a field width is entered then replace field width value ##
                    if [[ "$fw" =~ ^%[#0\ +-]?[1-9][0-9]* ]]; then
                        (( Width = $(echo -n "$fw" | gsed -re 's|^%[#0 +-]?([1-9][0-9]*).*[bs]|\1|') + Difference ))
                        declare -a Text[0]="$(echo -n "${Text[0]}" | gsed -rne '1h;1!H;${g;y|\n|\x1C|;s|(%[^%])|\n\1|g;p}' | gsed -rne $(( format + 1 ))'s|^(%[#0 +-]?)[1-9][0-9]*|\1'${Width}'|;1h;1!H;${g;s|\n||g;y|\x1C|\n|;p}')"
                    fi

                ## If a precision is entered then replace precision value ##
                    if [[ "$fw" =~ \.[1-9][0-9]*[sb]$ ]]; then
                        (( Precision = $(echo -n "$fw" | gsed -re 's|^%.*\.([1-9][0-9]*)[sb]$|\1|') + Difference ))
                        declare -a Text[0]="$(echo -n "${Text[0]}" | gsed -rne '1h;1!H;${g;y|\n|\x1C|;s|(%[^%])|\n\1|g;p}' | gsed -rne $(( format + 1 ))'s|^(%[#0 +-]?([1-9][0-9]*)?)\.[1-9][0-9]*([bs])|\1.'${Precision}'\3|;1h;1!H;${g;s|\n||g;y|\x1C|\n|;p}')"
                    fi
                fi
            fi
        done
    fi
fi

printf "${Text[@]}"
exit 0

Usage: fprintf "## %5s %5s %5s ##\n## %5s %5s %5s ##\n" '' '*' '' '' '•' ''

A few things to note:

I didn't write this script to deal with * (asterisk) values for formats because I never use them. I wrote this for me and didn't want to over-complicate things.
I wrote this to check only the format strings %s and %b as they seem to be the only ones that are affected by this problem. Thus, if somehow someone manages to get a multi-byte unicode character out of a number, it may not work without minor modification.
The script works great for basic use of printf (not some old-skooler UNIX hacker), feel free to modify, or use as is all!