Understanding character encoding requires a firm grasp on bits and bytes. In this post I will try to make it as clear as possible how ASCII and UTF-8 works by doing it by hand…
It can be thought of as a follow-up to the excellent What every programmer absolutely, positively needs to know about encodings and character sets to work with text.
Let’s play around with the letter “a”:
printf "a" > a.txt
In this case $LANG
is set to UTF-8
, so the bytes being written to
the file will follow the rules of UTF-8. In other words, “a” will be
encoded to bytes by following the UTF-8 standard. When we use cat
,
we will see the bytes again interpreted as UTF-8:
cat a.txt
a
So that is just an UTF-8 interpretation of the file. But which bytes
does the file really contain? With xxd
we can make a binary dump:
xxd -b a.txt
00000000: 01100001 a
In UTF-8, “a” is 8 bits (1 byte). Let’s try another kind of dump - the hexadecimal dump - or hexdump:
xxd a.txt
00000000: 61 a
Now you are seeing 61
- which is the hexadecimal representation of
01100001
.
You may not know what a hexdump is or how to interpret hexadecimal numbers or how to count with them, but the most simple facts are:
- Hex means 6 (think hexagon)
- Decimal means 10 (think decilitre)
In the context of hexadecimal, decimal means we have the symbols 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. These are 10 symbols. Hexa tells us we have 6 other symbols - A, B, C, D, E and F. Hexadecimal is also called “base 16”. Hexadecimal is a very nice way of counting computer data. I will not explain how it works, because LiveOverflow and David J. Malan have great videos on it already.
Let’s try to write a hexdump manually to create a file that will show an “a” if interpreted as UTF-8:
echo "0: 61" | xxd -r
a
Let’s get a xxd
formatted hexdump by writing a hexdump ourselves:
echo "0: 61" | xxd -r | xxd
00000000: 61 a
We can convert it to binary like so:
echo "0: 61" | xxd -r | xxd -b
00000000: 01100001 a
Can you write your name using this method? You can use man ascii
to
figure out what hexadecimals you have to use. Hey, isn’t it fun being
4 years old again?
printf '41 6e 64 65 72 73' | xxd -revert -plain
Anders
Let’s try some slicing and dicing with offsets and the dd
utility. First, make a new file:
printf "cat dog giraffe lion monkey bird" > animals.txt
How does it look?
xxd animals.txt
00000000: 6361 7420 646f 6720 6769 7261 6666 6520 cat dog giraffe
00000010: 6c69 6f6e 206d 6f6e 6b65 7920 6269 7264 lion monkey bird
How can we get lion from this? We know the “l” is at offset 0000010
right? Let’s use dd
and use a block size of 1 byte. Then we skip the
first 16 bytes (0000010 in hexadecimal is 16 in decimal).
dd if=animals.txt bs=1 skip=16
lion monkey bird
Wow! With ASCII, each letter is 1 byte, so we need 4 bytes to catch the lion:
dd if=animals.txt bs=1 skip=16 count=4
lion
Unstoppable!
We could also use xxd
in a roundabout way:
xxd -seek 16 -len 4 -plain animals.txt | xxd -revert -plain
lion
If you don’t understand what I am doing here, you should remove parts of the pipeline to reveal the data.
Now that you know how to write bits by hand, I recommend opening a
file and activate hexl-mode
in Emacs.
Fun fact: Some people use xxd
to get a poor man’s hex editor inside
Vim by dumping and reverting the whole buffer by using %!xxd
.
How about Windows vs. Unix newlines? Those things are annoying. Could
you convert them by hand, instead of using dos2unix
and linux2dos
?
I’ll leave it up to you.
Python bonus round
Writing your name:
print(bytes.fromhex('61 6e 64 65 72 73').decode("UTF-8"))
anders
With the \x
escape sequence:
print(b"\x61\x6e\x64\x65\x72\x73".decode("UTF-8"))
anders
Catching the lion:
with open("animals.txt", "rb") as binary_file:
binary_file.seek(16)
lion = binary_file.read(4)
print(lion.decode("UTF-8"))
lion
The Python bytes
type can be created from ASCII characters or hex
escape sequences, so all of these are the same:
a = b"\x61\x6e\x64\x65\x72\x73"
a2 = b"anders"
a3 = b"a\x6e\x64\x65\x72\x73"
print(a == a2 == a3)
True
The bytes
type is immutable, so we need to use the bytearray
class to
modify sequences of bytes.
a = bytearray(b"anders")
A bytearray is a sequence of integers (0-255), so the bytearray
above
looks like this:
a[0] |
a[1] |
a[2] |
a[3] |
a[4] |
a[5] |
---|---|---|---|---|---|
97 | 110 | 100 | 101 | 114 | 115 |
To modify a single element in the bytearray we have to pass a decimal
value. We can use ord()
to convert “A” to decimal:
a = bytearray(b"anders")
a[0] = ord(b"A")
print(a)
bytearray(b'Anders')
To go from decimal to hex:
print(hex(65))
0x41
When slicing, we get a bytearray back:
a = bytearray(b'ANDers')
print(a[0:3])
bytearray(b'AND')
So to replace we don’t use decimals:
a = bytearray(b"anders")
a[0:3] = b"AND"
print(a)
bytearray(b'ANDers')
How do we write these things to files?
with open("anders.txt", "wb") as f:
# "and" interpreted as ASCII and the rest is interpreted as hexadecimal
f.write(b"and\x65\x72\x73")
Let’s read the file again:
with open("anders.txt", "rb") as f:
print(f.read())
b'anders'
When we read and open in binary mode (wb/rb
) it means that we will be
reading and writing with bytes
(b"anders"
).
It’s possible to create a file-like object in memory by using
io.BytesIO
. You might want to do this when you have some library that
wants to write binary data to a file. You could for example generate a
plot with matplotlib
and add it to a PDF.
To “emulate” the f
variable from above, we could do this:
import io
f_in_memory = io.BytesIO()
f_in_memory.write(b"and\x65\x72\x73")
f_in_memory.seek(0)
print(f_in_memory.read())
f_in_memory.close()
b'anders'
If you don’t seek to the beginning (0), you would get an empty value here.
Another cool alternative is to use SpooledTemporaryFile
that uses the
BytesIO
or StringIO
up until a certain size:
import tempfile
with tempfile.SpooledTemporaryFile(max_size=100, mode="w+t", encoding="utf-8") as temp:
print("temp: {!r}".format(temp))
for i in range(3):
temp.write("This line is repeated over and over.\n")
print(temp._rolled, temp._file)
temp: <tempfile.SpooledTemporaryFile object at 0x1065837f0>
False <_io.TextIOWrapper encoding='utf-8'>
False <_io.TextIOWrapper encoding='utf-8'>
True <_io.TextIOWrapper name=3 mode='w+t' encoding='utf-8'>
Resources
- Unicode & Character Encodings in Python: A Painless Guide – Real Python
- Strings, Unicode, and Bytes in Python 3: Everything You Always Wanted to Know
- Python Cookbook, 3rd Edition
- io - Text, Binary, and Raw Stream I/O Tools — PyMOTW 3
- Working with Binary Data in Python | DevDungeon
- The deal with numbers: hexadecimal, binary and decimals - bin 0x0A
- Abstraction by Professor David J. Malan - YouTube
- CS50 Lectures 2018
- Fluent Python - Chapter 4. Text versus Bytes
- DEFCON 28 Safe Mode - PHV - Take Down The Internet! With Scapy - YouTube