Split large string into substrings_问答_开发者

I have a huge string like

ABCDEFGHIJKLM...

an开发者_C百科d I would like to split it into substrings of length 5 in this way:

>1
ABCDE
>2
BCDEF
>3
CDEFG
[...]

${string:position:length}

Extracts $length characters of substring from $string at $position.

stringZ=abcABC123ABCabc
#       0123456789.....
#       0-based indexing.

echo ${stringZ:0}          # abcABC123ABCabc
echo ${stringZ:1}          # bcABC123ABCabc
echo ${stringZ:7}          # 23ABCabc

echo ${stringZ:7:3}        # 23A
                           # Three characters of substring.

-- from Manipulating Strings in the Advanced Bash-Scripting Guide by Mendel Cooper

Then use a loop to go through and add 1 to the position to extract each substring of length 5.

end=$(( ${#stringZ} - 5 ))
for i in $(seq 0 $end); do
    echo ${stringZ:$i:5}
done

fold -w5 should do the trick.

$ echo "ABCDEFGHIJKLMNOPQRSTUVWXYZ" | fold -w5
ABCDE
FGHIJ
KLMNO
PQRST
UVWXY
Z

Cheers!

sed can do it in one shot:

$ echo "abcdefghijklmnopqr"|sed -r 's/(.{5})/\1 /g'
abcde fghij klmno pqr

depends on your needs:

$ echo "abcdefghijklmnopqr"|sed -r 's/(.{5})/\1\n/g' 
abcde
fghij
klmno
pqr

update

i thought it was just simply split string problem, didn't read the question very carefully. Now it should give what you need:

still one shot, but with awk this time:

$ echo "abcdefghijklmnopqr"|awk '{while(length($0)>=5){print substr($0,1,5);gsub(/^./,"")}}'

abcde
bcdef
cdefg
defgh
efghi
fghij
ghijk
hijkl
ijklm
jklmn
klmno
lmnop
mnopq
nopqr

...or use the split command:

$ ls

$ echo "abcdefghijklmnopqr" | split -b5

$ ls
xaa  xab  xac  xad

$ cat xaa
abcde

split also operates on files...

In bash:

s=ABCDEFGHIJ
for (( i=0; i < ${#s}-4; i++ )); do 
  printf ">%d\n%s\n" $((i+1)) ${s:$i:5}
done

outputs

>1
ABCDE
>2
BCDEF
>3
CDEFG
>4
DEFGH
>5
EFGHI
>6
FGHIJ

Would sed do it?:

$ sed 's/\(.....\)/\1\n/g' < filecontaininghugestring

str=ABCDEFGHIJKLM
splitfive(){ echo "${1:$2:5}" ; }
for (( i=0 ; i < ${#str} ; i++ )) ; do splitfive "$str" $i ; done

Or, perhaps you want to do something more intelligent with the results

#!/usr/bin/env bash

splitstr(){
    printf '%s\n' "${1:$2:$3}"
}

n=$1
offset=$2

declare -a by_fives

while IFS= read -r str ; do
    for (( i=0 ; i < ${#str} ; i++ )) ; do
            by_fives=("${by_fives[@]}" "$(splitstr "$str" $i $n)")
    done
done

echo ${by_fives[$offset]}

And then call it

$ split-by 5 2 <<<"ABCDEFGHIJKLM"
CDEFG

You can adapt it from there.

EDIT: trivial version in C, for performance comparison:

#include <stdio.h>

int main(void){
    FILE* f;
    int n=0;
    char five[6];

    five[5] = '\0';

    f = fopen("inputfile", "r");

    if(f!=0){
            fread(&five, sizeof(char), 5, f);
            while(!feof(f)){
                    printf("%s\n", five);
                    fseek(f, ++n, SEEK_SET);

                    fread(&five, sizeof(char), 5, f);
            }
    }

    return 0;
}

Forgive my bad C, I really don't knw the language.

sed can do it:

 sed -nr ':a;h;s/(.{5}).*/\1/p;g;s/.//;ta;' <<<"ABCDEFGHIJKLM" | # split string
     sed '=' | sed '1~2s/^/>/' # add line numbers and insert '>'

You could use cut and specify characters instead of fields, and then change output delimiter to whatever you need, like new line:

echo "ABCDEFGHIJKLMNOP" | cut --output-delimiter=$'\n' -c1-5,6-10,11-15

output

ABCDE
FGHIJ
KLMNO

echo "ABCDEFGHIJKLMNOP" | cut --output-delimiter=$':' -c1-5,6-10,11-15

output

ABCDE:FGHIJ:KLMNO

thanks to you guys I was able to find a way to do this fast! This is my solution combining a few ideas from here:

str="ABCDEFGHIJKLMNOP"   
splitfive(){
    echo $1 | cut -c $2- | sed -r 's/(.{5})/\1\n/g'
}  
for (( i=0; i <= 5; i++ )); do
    splitfive "$str" $i
done | grep -v "^$"

[The above answer was initially added to the question itself. Here are the relevant comments.]

Your splitfive could be more efficient. There's no need to pipe to cut, in bash you could say cut -c "$2"- <<<"$1" | sed etc and it will be slightly better. -- sorpigal Sep 28 '11 at 11:48