Python Can’t Add Arabic Unicode digits

Arithmetic in Python is pretty easy, right. For example, just add numbers:

5 + 3

But, just try this with numbers in other languages, like Arabic (it’s the same operation as above: 5 + 3):

۵ + ۳

  Cell In[3], line 1
    ۵ + ۳
    ^
SyntaxError: invalid character '۵' (U+06F5)

Oh no! Python can’t do it! Why not? Well, we already know that Python3 can display text from other languages by requiring all strings to be encoded by using the Unicode standard, which is very good. So, let’s debug a little by examing the Arabic number 5: “۵” in detail.

How does Python represent ۵ (5) internally?

Exploring Arabic’s Digit: 5

Let’s convert our Arabic digit to bytes since bytes are the format closest to the bare-metal of computers:

# has to be within quotes, else error! TODO: check C code for why...in Part 2?
five = "۵"
five.to_bytes(1)
# argument of 1 means to only convert to 1 byte. We can make it bigger if we want: 2, 3 or more bytes, but it'll just have lots of zeros.

# this won't work, either:
# bytes(["۵"])

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[8], line 3
      1 # has to be within quotes, else error! TODO: check C code for why...in Part 2?
      2 five = "۵"
----> 3 five.to_bytes(1)
      4 # argument of 1 means to only convert to 1 byte. We can make it bigger if we want: 2, 3 or more bytes, but it'll just have lots of zeros.
      5 
      6 # this won't work, either:
      7 # bytes(["۵"])

AttributeError: 'str' object has no attribute 'to_bytes'

Opps, only Python’s Int object has the to_bytes() method! Here, Python clues us in that it just doesn’t consider this character as a number, especially not an integer, but as a string, therefore quotes are required, so let’s try that:

"۵".encode("utf-8", "strict")

b'\xdb\xb5'

Cool; We’re seeing the individual bytes here; in Python \x escapes bytes in hexadecimal. The real bytes are db and b5, but separate bytes isn’t how Unicode works. This is the same as a unicode codepoint. Let’s encode this Arabic digit into its full representation and compare it to the individual bytes we saw above:

"\u06f5"

'۵'

Ah, see how large that value is with four hexadecimal digits? Our Jupyter notebook helpfully renders it as a character for us since we quoted it, so we know that this is just another way that a computer can store the Arabic digit for 5.

But, why is it different from b'\xdb\xb5'? The clue is in the prefix \u which is called an escape sequence. Python uses this escape sequence to tell us that the following four hexademical digits is grouped together and is really a Unicode character and not a number. If it were a raw number, then its escape sequence would be \x where the x stands for hexadecimal.

If we output only the raw bytes b'\xdb\xb5', then we’d have to guess on our own what it stands for: a number? Maybe, but then again maybe those bytes are raw characters in a group together (a sequence). If they are, then how do we know which encoding are they in? (there were many encoding standards for different languagues, but Unicode is the modern standard)

Let’s re-render it from the individual bytes to make sure it’s still the Arabic 5. This is called decoding:

# Convert the screen character '۵' to bytes
b"\xdb\xb5".decode("utf-8")

'۵'

Yup, it’s really a five as an Arabic speaker would see it.

Next, let’s examine “5” as a string in English (just to be fair) and see if Python gives it the same Unicode treatment as Arabic “۵”.

Comparing with English’s Digit: “5”

print("5".encode("utf-8", "strict"))

b'5'

So, is this the actual byte value of 5, then? Well, let’s see a raw numeral 5 and compare all three, but in binary this time to so all three are on an equal footing. We need to show a 16 bit binary format, because I’m foreshadowing a bit here :-)

# as raw binary, cose to the metal!
print(f" 5  as binary = {5:016b}")
print(f"'5' as binary = {ord('5'):016b}")
print("\n")

# as hex
print(f" 5  as hex = {5:04x}")
print(f"'5' as hex = {ord('5'):04x}")

 5  as binary = 0000000000000101
'5' as binary = 0000000000110101


 5  as hex = 0005
'5' as hex = 0035

five = 5
five.to_bytes(1)
# arg=1 means to only convert to 1 byte. We can make it bigger if we want:2, 3 or more bytes, but it'll just have lots of zeros.
# Anyways, this is the same as print(f" 5  as hex = {5:04x}"), just formatted differently.

# same thing: bytes([five])

b'\x05'

They. are. not. the. same. BUT, that’s OK because they really shouldn’t be the same. We’ll get back to that another time, but for now let’s really get to know the Arabic digits as encoded by the Unicode system.

Exploring Arabic Digits in Unicode

In Unicode, Arabic characters are assigned bytes within what computer scientists call a Range: from 0x0600 to 0x06FF including both the first and last boundries of the range. These are 16 bit hexadecimal values which equals 2 bytes. You can get a PDF chart of all characters used in Arabic. This range of Arabic characters happily includes other languages which also use the Arabic alphabet such as Persian, Urdu, Sindhi, and my favorite Punjabi dialects: Hindko and Saraiki (Majhi is very nice, too!) The clever people in the Unicode organization also included characters from historical Arabic scripts which were used for Central Asian languages, languages in Indonesia and more.

Each of these groups of 2 bytes are called codepoints (“2 bytes” is the same as saying “16 bits”, remember). Plot twist: There are two sequences of codepoints for Arabic numerals. One is called the “Indic” sequence, though they are mostly only used in Arabic-speaking regions not commonly thought of as “Indic”. They’re named this because these characters used for writing these digits came from ancient India:

٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩

The other is termed “Indic Eastern” and are encoded as bytes in the Range: 0x06F0 to 0x06F9 These variants are used in Iran, Afghanistan, Pakistan and India, which are what I think many people these days would consider as the actual “Indic” regions:

۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹

We’ll focus today on the digits used in Arabic-speaking countries and on our own, call them “Western” in contrast to the “Eastern” digits which are used, you know, east of the Arabic-speaking regions. Let’s take a look at their byte values:

import pandas as pd

west_digits = "۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹".split(" ")
west_codepoints = [
    f"{cp:04x}".upper() for cp in range(0x06F0, 0x06F9 + 1)
]  # recall that Python's range() function excludes the second arg, thus we add 1 to keep it.
# '\u' +
# print(list(zip(west_digits, west_codepoints)))

df_west = pd.DataFrame(zip(west_digits, west_codepoints))
display(df_west.transpose().style.hide(axis="index"))

0	1	2	3	4	5	6	7	8	9
۰	۱	۲	۳	۴	۵	۶	۷	۸	۹
06F0	06F1	06F2	06F3	06F4	06F5	06F6	06F7	06F8	06F9

But Arabic is a Right-to-Left language, so let’s be faithful to that and reverse the display direction!

display(df_west.iloc[::-1].transpose().style.hide(axis="index"))

9	8	7	6	5	4	3	2	1	0
۹	۸	۷	۶	۵	۴	۳	۲	۱	۰
06F9	06F8	06F7	06F6	06F5	06F4	06F3	06F2	06F1	06F0

Even though we now see how Unicode classifies Arabic digits as numerals, this doesn’t mean programming languages will follow suit.

Parsing digits in Python

So, back to our original question: why does adding ASCII digits work but Arabic digits do not? As we saw above when we printed the byte values of all three versions of 5, Python only sees the raw, unquoted 5 as a “real” numeral.

Python thinks ۵ is just a string and not a number. But, Python does think that 5 is a number… It’s about the parsing…After all even a raw 5 is just a character in a text files, just without quotes, and not “really” a number. Numbers for arithmetic have to be converted to bytes for the CPU to do its thing with them. Thus, digits in programming languages are translated to raw bytes first, before operating on them like, adding, subtracting and so on (arithmetic!). Let’s see in Part 2…

Until then, if you want more Unicode, then the Python docs on its Unicode support make for great reading on a Saturday night :-)

Mashq and Machine