Text Representation

Like numbers, character data is stored in binary. There are several systems for doing this. Until quite recently (2007) the ASCII (American Standard Code for Information Interchange) system dominated. It has been surpassed by various unicode systems. Note that a number stored as a character does not have the same binary pattern as if it is represented using a number storing system. For example the number 4 can be stored as follows using seven bits.

ASCII Character pattern
Binary number pattern
 0110100  0000100

ASCII representation

ASCII uses seven bits to store information on each character. Each character on a computer keyboard has its equivalent ASCII pattern. For example, the letter A has the pattern 1000001 and the text version of the number 4 (as distinct from its number representation) is 0110100.

128 possible bit patterns can be got from seven bits as 27 = 128. Many of the patterns represent what are known as control characters such line feed (LF) and carriage return (CR). These point out the age the of the ASCII system as it was first created in the early 1960s for teletype machines that had the need for line feeds and carriage returns. In fact the first 31 bit patterns from 0000000 to 0011111 represent control characters. Follow this link to see what they are. The delete key (has a binary representation of 1111111) is sometimes considered to be a control character. Control characters can not be seen on the page or screen.

Follow this link to see the printable character binary patterns. Below is a screenshot of a part of the chart that the previous link points to.


Things to note:
  • Each binary pattern has its Octal, Hexadecimal and Decimal equivalent.
  • Numbers and characters are ordered. For example, 1 (decimal 49) comes after 0 (decimal 48), and B (decimal 66) comes after A (decimal 65). This makes sorting easy.
  • The lowercase letters come after the uppercase ones.
Attached to this page is a BYOB v3.1.1 (Build Your Own Blocks version 3.1.1) program that takes keyboard input and converts it to its ASCII binary equivalent. All keyboard characters can be converted.

Limitations and Unicode

A byte (8 bits) is a commonly used unit in computing, whereas seven bits is not. This means if a byte is used to store a character's binary pattern then the leftmost bit is redundant (and is always 0). There are some systems that use this extra bit, and this allows many more characters to be represented. EBCDIC is one such system. This link points to a Wikipedia page that has more information.

Seven bit ASCII suits the English language, but not others, such as Asian languages that contain many more characters. Accordingly, other encoding schemes have been invented. The most used ones are in the Unicode family. UTF-8 is very common. There are also UTF-16 and higher numbered versions. UTF-8 is a variable length character encoding scheme. This means the number of bytes used depends on the language (or character set) being encoded. For instance, English only needs seven bits (the ASCII set) and can be accommodated using one byte. Up to four bytes can be used to store a character. It would be easy to think that that would give 32 bits to use and 232 possible bit patterns. However, this is not the case as there are control bit patterns present that indicate whether 1, 2, 3, or 4 (or even 6) bytes are being used, and where the actual character bits start.
Normally bits are written from left to right, so the program that is reading them can quickly tell how many bytes are being used to store each character.

The patterns for one to six bytes are shown below. They are all unique. The control bits themselves are in bold.

Max number
of usable bits
Last code point
(hexadecimal)
Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
  7 U+007F 0xxxxxxx
11 U+07FF 110xxxxx 10xxxxxx
16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
21 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Note:
  • The fixed control digits. E.g. 110 for 2 bytes.
  • The x character is a stand in for a 0 or 1.
  • Four bytes has 11 control bits and 21 usable bits. This allows for 221 patterns or 2,097,152 possible characters to be represented.
  • U+007F is the hexadecimal number of possible patterns for the 1 Byte format. This translates to decimal like so:

     Hex   Working
     Decimal
     7  7 x 16 1 = 112    112
     F  15
         15
       Total    127

An example

Given the character below work out its UTF-8 bit pattern.

Unicode
code point
character UTF-8
(hex.)
name
U+0100 Ā c4 80 LATIN CAPITAL LETTER A WITH MACRON
  1. 0100 in hex is 0000000100000000 in binary (using 2 bytes).
  2. As the leftmost 1 occupies the 9th position from the right hand end, two bytes are needed to store all the digits.
  3. The generic UTF-8 two byte pattern is 110xxxxx 10xxxxxx.
  4. Working from the right hand end and replacing the xs with the binary for the character we get 11000100 10000000 (or C480 in hex).

Further reading





Č
ċ
ď
alphatobinaryextended.ypr
(97k)
joebloggsnz .,
15 Jun 2011 13:53
Comments