Representing Text Using ASCII and Unicode

All types of data are stored inside the computer as numbers. Each character in text is stored as a number that represents the character. In personal computers, a common way to do this is to use ASCII - the American Standard Code for Information Interchange - or Unicode. This is just a standard for linking the numbers to the characters - e.g. 65 is A, 66 is B, etc. The range of characters that the computer can display is called the character set. ASCII values go up to 255 (for extended ASCII), whereas Unicode values can be up to four bytes so there are many more symbols. Characters represented by the lower numbers (0-127) are the same in both. As you get to the higher numbers you'll see that symbols such as emojis are actually just characters, like letters of the alphabet.

Character code:

is represented by the code

To see characters with higher codes, change the range of the slider:

Range:

Notice that upper and lower case letters have different codes. There are some multi-coloured characters around the range 9-10000. Note that not all characters are visible - there are things like spaces, carriage returns and line feeds that are invisible on a web-page. Not all Unicode values are assignment to characters, which is why there are gaps in the list of offsets. For ease of programming, only 16 bits are used for character codes in this example.

Although text files tend to be relatively small (compared with sounds and images, for example), you can still use compression techniques to make the file smaller.