开发者

How do uppercase and lowercase letters differ by only one bit?

开发者 https://www.devze.com 2023-01-13 17:16 出处:网络
I have found one example in Data and Communication Networking book written by Behrouza Forouzan regarding upper- and lowercase letters which differ by only one bit in the 7 bit code.

I have found one example in Data and Communication Networking book written by Behrouza Forouzan regarding upper- and lowercase letters which differ by only one bit in the 7 bit code.

For example, character A is 1000001 (0x41) and character a is 1100001 (0x61).The difference is in bit 6, which is 0 in uppercase letters and 1 in lowercase letters. If we know the code for one case, we can easily find the code for the other by adding or subtracting 32 in decimal, or we can just flip the sixth bit.

What does all this mean?

I have found myself very confused with all these things. Could someon开发者_StackOverflow社区e provide examples of how these things really work out?


Let's use a case that you'll find more familiar: base 10.

  1. Suppose we have a base 10 computer, where each 10bit stores a value from 0 to 9, and a 10byte is 5 10bits long, so that each byte can store 100,000 values (0 through 99,999).

  2. You wish to assign letters to particular positions in a 10byte so that this computer can communicate text data with other computers. One way you could do this would be like so:

     00101 A    00201 a
     00102 B    00202 b
     00103 C    00203 c
     00104 D    00204 d
     00105 E    00205 e
     00106 F    00206 f
     00107 G    00207 g
     00108 H    00208 h
     00109 I    00209 i
     00110 J    00210 j
     00111 K    00211 k
     00112 L    00212 l
     00113 M    00213 m
     00114 N    00214 n
     00115 O    00215 o
     00116 P    00216 p
     00117 Q    00217 q
     00118 R    00218 r
     00119 S    00219 s
     00120 T    00220 t
     00121 U    00221 u
     00122 V    00222 v
     00123 W    00223 w
     00124 X    00224 x
     00125 Y    00225 y
     00126 Z    00226 z
    
  3. Do you see that each lower case letter differs from the upper case letter by only a single 10bit digit, in the 3rd column from the right? It didn't have to be designed this way. It was simply convenient, because then any time we want to adjust the case of a letter we can simply modify one of the digits (10bits) without caring what the rest of the number is or bothering with twenty-six different transformations when we can do one. We couldn't have chosen the second digit because instead of being 100 apart, they'd be only 10 apart and would overlap.

  4. Now, in base 2 it is exactly the same, but instead of each bit representing 0-9, it can only represent 0-1. Using eight 2-bits gives us only 256 possible combinations, 0-255. The ASCII codes for the upper and lower case letters in binary look like this:

     01000001 A        01100001 a
     01000010 B        01100010 b
     01000011 C        01100011 c
     01000100 D        01100100 d
     01000101 E        01100101 e
     01000110 F        01100110 f
     01000111 G        01100111 g
     01001000 H        01101000 h
     01001001 I        01101001 i
     01001010 J        01101010 j
     01001011 K        01101011 k
     01001100 L        01101100 l
     01001101 M        01101101 m
     01001110 N        01101110 n
     01001111 O        01101111 o
     01010000 P        01110000 p
     01010001 Q        01110001 q
     01010010 R        01110010 r
     01010011 S        01110011 s
     01010100 T        01110100 t
     01010101 U        01110101 u
     01010110 V        01110110 v
     01010111 W        01110111 w
     01011000 X        01111000 x
     01011001 Y        01111001 y
     01011010 Z        01111010 z
    

    Just the same as before, they differ by only one 2bit digit, here in the 6th column from the right. We couldn't have used a digit any farther to the right (smaller) because then the lists would have overlapped (2^5 = 32 and accordingly we used all bits 0 through 5, but 2^4 = 16, which could not cover the 26 letters of the alphabet).

  5. Just to fill things out a little, here's an example of what those binary values mean. Let's take the one for G. To understand what 01000111 means in binary:

      Pos:   7  6  5  4  3  2  1  0
      Bit:   0  1  0  0  0  1  1  1
      Val: 128 64 32 16  8  4  2  1
     Mult:   0 64  0  0  0  4  2  1
      Add: 64 + 4 + 2 + 1 = 71, which is the ASCII code for G.
    

    Doing the same thing for the letter G in the special base 10 system I constructed above:

       Pos:     4    3    2    1    0
     10Bit:     0    0    1    0    7
       Val: 10000 1000  100   10    1
      Mult:     0    0  100    0    7
       Add: 100 + 7 = 107, which is my special 10ASCII code for G.
    

    Look back at the "Val" row for binary. Do you see that starting from the right, each value is double the previous one? Doubling each time we get 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and so on. This is how a binary digit's position determines its value, just like a decimal digit's position determines its value with powers of 10: 1, 10, 100, 1000, 10000, 100000, and so on.

    I realize this seems silly because all I did was convert 107 to 107... but 107 isn't just a number, it is a shorthand form for:

     1 hundreds + 0 tens + 7 ones.
    

    Another way we could represent that is

     0 x 10^4 + 0 x 10^3 + 1 x 10^2 + 0 x 10^1 + 7 x 10^0.
    

    Similarly, 01000111 isn't just a binary number, it is a shorthand form for

     0 x 2^7 + 1 x 2^6 + 0 x 2^5 + 0 x 2^4 + 0 x 2^3 + 1 x 2^2 + 1 x 2^1 + 1 x 2^0
    

    Which is what I already showed you:

     0 + 64 + 0 + 0 + 0 + 4 + 2 + 1
     = 64 + 4 + 2 + 1
     = 71
    

Also, you may have been wondering what 0x41 and 0x61 meant. The 0x part indicates that the digits to follow are to be understood as hexadecimal, which is base 16. There are only 10 digits in our number system, so we need 6 more digits somehow. Thus, hexadecimal uses the digits 0-9 and treats the letters A-F as the remaining digits, where A is 10 up through F as 15. Hexadecimal is very convenient for computers because 16 is a power of 2, and an 8-bit byte thus takes exactly two hex digits to encode (and each hex digit encodes exactly four binary digits). Taking 0x41, expanding 4 to its binary representation 0100 and expanding 1 to its binary representation 0001 you get 01000001, which you can see is the code for A as shown. To convert it to decimal it is 4 x 16 + 1 x 1 = 65. We multiply the 4 by 16 because each successive hexadecimal digit leftward is 16 times the previous digit, following the same pattern as I showed you above for base 2 and 10.

I hope this is sufficient for you to understand a little bit more about binary and ASCII codes.

Note 1: The reason for 8 bits in a byte instead of 2 as you might think is that 8 is a better balance, as a 2-bit "byte" would only encode 4 values, and transmitting the upper and lower case letters of the alphabet alone would require 3 bytes! There is nothing inherent in binary that forces the choice of 8 bits per byte, except that 8 is also a power of 2 which makes a lot of the math involved in working with binary information simpler and things align on edges better.

In the early days of computing different systems had many different byte lengths, including 7, 9, or other numbers!) Nowadays the computing world settled on 8 as a standard and useful of bits in a byte (though, note that text sometimes require 2 – 4 bytes to fully represent all the possible characters.

I am sure that choosing something like 6 bits per byte would have worked out awkwardly, and would not have made good use of the full range of values available.

Note 2: My system of five bits in a 10byte is based on the impracticality of using ten 10bits per byte, which yields a really huge number that would waste a lot of storage space. I chose five because ten is evenly divisible by it, which would undoubtedly be useful. (Originally, my answer used ten 10bits per 10byte, but it was too darned big!)


This relationship between the upper case and lower case letters was deliberate. When the ASCII code was formulated, computer hardware was primitive and software needed to conserve every byte. Flipping a single bit takes very little hardware or code to accomplish.


In order to add or subtract 32, you first must know whether the character is greater or less than 'A'.

When this book was written, the programming languages most people were using did not have Strings, or .equalsIgnoreCase. This was pre-i18n, and when a business had a server, you would telnet to it (like xterm), and get a command line menu. What he's describing, was typically used to create a nice case-insensitive menu for your users, taking advantage of the numeric layout of the ascii table.

It can be very fast, because there are bit-wise assembler instructions to do the math in either direction, regardless of whether the characters are already upper or lowercase.

c = c | 32 // to uppercase

c = c & (1+2+4+8+16+ 0 +64+128) // to lowercase

Say you had a Java-like language, without objects or the standard libs. Your networking author is prompting you to code like this:

    public static void main()
    {
        println("What would you like to do?");
        println("Inventory (inv)");
        println("Reports (rep)");

        char[] ca = readUserInput();        
        for (int i = 0; i < ca.length; i++)
            ca[i] = ca[i] | 32;  // convert to uppercase, by ensuring bit 32 is set

        if (compareInput(ca, "INV") == true)
            doInventory();
    }

Have you tried searching Google, and sometimes capitalized a person's name?


http://asciitable.com/

0x61 is hexadecimal for 97 = a
0x41 is hexadecimal for 65 = A

So subtracting/adding decimal 32 is indeed the way to convert to uppercase/lowercase.

Z is 90 = 0b1111010    = 0x5A
z is 122 = 0b1011010   = 0x7A

Which is a difference of 0b01000000 in binary or 0x20 or 32 in decimal.

Thus switching the 6th bit changes case.


take a look, the 6th bit = 32, so if you flip it you subract or add 32

Bit value
1   1
2   2
3   4
4   8
5   16
6   32 (32 = hex 20)

Now if you look here http://asciitable.com/, you can see the ascii table for all the characters and will notice that A = 65 and a = 97


I think most of these answers are unnecessarily complicated and occasionally condescending.

The decimal to ascii character mapping is arbitrary and doesn't really have anything to do with understanding how base 2 or base 10 works. It's purely a convenience thing. If someone mistakenly coded a lowercase character but meant an uppercase, it's more convenient to just flip one bit instead of having to recode an entire byte. It's less prone to human error to just flip one bit. IF the output is 'a' but we wanted 'A', at least we know we got most of the bit right and we just have to flip 2^5 to add or subtract 32. It's that easy. Why pick specifically bit 5 (it's not 6 as some have said, you start from 0..), well clearly that's the one that makes sense to satisfy two ranges of 26 characters with only one bit flip. If you did this on a lesser valued bit, you'd have to flip more than one.


template<char TLBound, char TUBound>
struct CharRange
{
    enum 
    {
        LBound = TLBound,
        UBound = TUBound
    };

    static bool InRange(char ch)
    {
        return (ch >= LBound)  && (ch <= UBound);
    };
};

typedef CharRange<'a', 'z'> lcaseLetters;
typedef CharRange<'A', 'Z'> ucaseLetters;

char toUpper(char ch)
{
    if(lcaseLetters::InRange(ch))
    {
        return (ch ^ (0x1 << 5));
    }

    return ch;
}

char toLower(char ch)
{
    if(ucaseLetters::InRange(ch))
    {
        return (ch ^ (0x1 << 5));
    }

    return ch;
}
0

精彩评论

暂无评论...
验证码 换一张
取 消