The Cryptographic Font
1. What? Why?
One of the first projects I ever worked on was a simple cypher tool written in Python which could encode and decode messages. At the time I thought this was the coolest thing ever, and today I stand by that. When done right, cryptography can seem like magic even if you know how it works.
From National Treasure to the fantasy world of Harry Potter, secret messages are quite prevalent in pop culture for various reasons. My favorite theory is that we love secret messages because they make their contents seem more important. Why would you bother encoding a message if it weren't anything special?
In many cases this can create a sense of intimacy between the communicators since encoding a message intentionally narrows the potential audience considerably. In reality cryptography is something most people use countless times every single day without realizing. So much web traffic and so many digital devices are encrypted or cryptographically signed, but that just adds to the magic of it for me.
For most of high school I had this idea floating around in my head to try to create a cryptographic font. What is a cryptographic font? Fonts dictate the way that different characters such as letters, digits, and punctuation are displayed on our screens. I wanted to figure out a way to alter how different characters looked so that an "L" might appear as a "C" for example.
The result would be that most people would struggle to read something using this font since all of the characters would be wrong and it would look like someone had smashed their forehead on their keyboard.
So why did I want to create something seemingly useless? Why would I create a font that is intentionally hard to read?
If I were to learn how to read such a font, if I put in the hundreds of hours necessary to re-learn how to read, then I would be able to read text displayed by this font while other people would not. Everything I ever read could be a kind of secret message, because no one but me is likely to put in the time to be able to read my cryptographic font!
There are limitations to this concept, and it is by no means perfect, but I just had to give it a go.
2. How Fonts Work
In 1991 Apple released the TrueType font system which is the basis for most modern fonts. A TrueType font file has the extension ".ttf" and, like many file types, contains a number of tables which store data. Some of the tables relate to the general behavior of the font such as if it is left-to-right or right-to-left, and other tables tell your computer how to draw the different "Glyphs" which are the characters you see on the screen.
The table I was interested in was the 'cmap' table, which maps the aforementioned glyphs to their corresponding decimal character codes. Without this table your computer would have no easy way to sort out which glyph to choose when trying to display a character such as an "A" which has the decimal character code 65.
The documentation for the TrueType file format can be found here, but there are a lot of ways in which Apple's documentation is incomplete, so I also recommend checking out Microsoft's version. Neither resource is perfect, but I was able to figure everything out by alternating between them. It's also worth mentioning that Microsoft's OpenType font system is nearly identical to TrueType in terms of its general structure.
Don't worry, I'm not just going to restate everything in the specification word-for-word. If you want to learn all the details I would recommend that you start with Apple's documentation and then cross-reference against Microsoft's. This post is meant more to be an overview of how fonts work, as well as how I created my cryptographic font generator.
2.1 Data Types and the Table Directory
Both resources start off by listing the different data types you might encounter in parsing a font file. I started out by writing a readObject() function to read and decode these different data types from the current open file:
import struct
def readObject(object_type):
if object_type == "uint32":
return struct.unpack(">I", f.read(4))[0]
elif object_type == "uint16":
return struct.unpack(">H", f.read(2))[0]
elif object_type == "uint8":
return struct.unpack(">B", f.read(1))[0]
elif object_type == "formal_tag":
arr = []
for i in range(0, 4):
arr.append(readObject("uint8"))
return arr
elif object_type == "informal_tag":
tag = ""
for i in range(0, 4):
tag += chr(struct.unpack(">B", f.read(1))[0])
return tag
else:
raise ValueError("Type Not Understood")
Note that the ">" symbols denote that all values are stored in big-endian format. This can be confusing at first, but it's actually a really cool idea and I'm glad that I've now had exposure to it. You can read about how struct.unpack() works here.
After this it's just a matter of reading along in the specification and reading the right data types in the right order. Files like these often contain a table or tables at the beginning which state how many other tables can be found in the document, and where to find them. In TrueType fonts this is called the Table Directory.
A few for-loops and readObject() calls later and I had a series of parallel lists containing information about all of the different tables in the font file. From this I narrowed my focus to just the information pertaining to the 'cmap' table.
In order to read a table like the 'cmap' table into memory, you first move to the start of the table using the offset value, and then you read for however many bytes the table is long. Ok so now that we have the 'cmap' table, we can just start editing the mappings right?
2.2 The 'cmap' Table
Wrong. Nothing could ever be that simple. In order to support many different platforms, the 'cmap' table actually uses several different formats for storing glyph-to-character mappings. The mappings are stored in one or more of these formats in sub-tables within the 'cmap' table.
The font file I was working with (more on that later) was using 'cmap' format 12 to be compatible with the Unicode standard and modern Windows systems, as well as a different standard for backwards compatibility with older Windows systems which might not support the newer UCS-4 standard.
Because I was planning on being the sole user of this tool, I only worried about the UCS-4, format 12 sub-table. It wouldn't be too much more work to also edit the format 4 table. For my purposes, however, it would have been an unnecessary step.
Now, you might expect that the character-to-glyph mapping would be a one-to-one mapping, but that's only half of the truth. They're actually defined in blocks (at least in format 12). It can get pretty complicated, but the idea is that if you have a block of consecutive character codes which maps to a block of consecutive glyph codes, it makes more sense to just define the start and endpoints for those ranges, right?
Say character codes 10-20 map to glyph codes 25-35. Saying this is way more efficient than writing:
Character 10 maps to glyph 25
Character 11 maps to glyph 26
Character 12 maps to glyph 27
Character 13 maps to glyph 28
Character 14 maps to glyph 29
Character 15 maps to glyph 30
Character 16 maps to glyph 31
Character 17 maps to glyph 32
Character 18 maps to glyph 33
Character 19 maps to glyph 34
Character 20 maps to glyph 35
We can actually make this even more efficient. There are as many character codes in the range 10-20 as there are glyph codes in the range 25-35, so we only really need to state the start glyph code.
Then we can say that character codes 10-20 map to the block of glyph codes of equal size starting on code 25. In English this takes more words, but the computer only cares about the values 10, 20, and 25, so this is more space efficient.
That's basically how format 12 works! There's a bit more nuance to it, but that's the general idea.
3. Cryptography Time!
Now that we know how the 'cmap' table works, we can start making changes to it. For this I just loaded all of the character codes into two separate lists, one for uppercase and one for lowercase letters. I also created a lookup function to get the existing glyph codes from the character codes, and then I compiled them into a list in the same order as the character codes.
This looks a lot more like the inefficient image I included above with every single code being loaded instead of just the ranges. The font I was using as a base had very small ranges for the alphabetical characters, usually no more than 2 or 3 to a "group." Many of the other types of characters used longer ranges, but the Roman alphabet seemed to be an exception. My guess is that since modern hardware is so much faster than the hardware from 1991, we don't really have to worry about the efficiency of our fonts anymore.
Once I had the glyph codes in arrays, it was simply a matter of shuffling them. I wanted the corresponding uppercase and lowercase letters to still line up, so I created a shuffle key. This was just a list of the numbers 0-25 which I would then use to shuffle the two lists of glyph characters. For instance if 12 ended up at the first index in the shuffle key, then the thirteenth letter of the alphabet ("M") would display wherever the first letter ("A") would display with a normal font.
"""Create shuffle key"""
from random import shuffle
shuffle_key = range(0, 26)
shuffle(shuffle_key)
"""Shuffle glyphs"""
latinUpperShuffled = []
latinLowerShuffled = []
for pos in shuffle_key:
latinUpperShuffled.append(latinUppercaseGlyphs[pos])
latinLowerShuffled.append(latinLowercaseGlyphs[pos])
print(latinUpperShuffled)
print(latinLowerShuffled)
Could I have just used random seeds? Yes, but I also wanted to have the option to manually define my shuffled mapping just in case I decided I wanted to do so in the future.
Some things to note with this manner of shuffling:
- A letter can become itself, and therefore not change under the font.
- This is not a cypher wheel in the traditional sense since not every letter is shifted by the same amount (like if an "A" became a "C" while a "B" became a "D" etc.).
- Since this is a font, there's no easy way to have a code that changes; If an "R" looks like a "K," then every "R" is going to look like a "K."
Then it's just a matter of writing the 'cmap' table back into the font file and we're almost done.
4. The 'name' Table, and my Big Mistake.
Now that we have our font "randomized," we might want to give it a name which reflects this, a name which will be displayed inside of programs when we are selecting a font. We can do this by modifying the 'name' table, which is thankfully much simpler than the 'cmap' table.
Or at least it would have been that easy had I chosen to modify a normal font when I started this process. Unfortunately, I decided to just use the MacOS system font thinking it would be the easiest and most standard font. If Apple made it, it should conform to the Apple-created specification, right?
The problem with this is that when I tried to use my randomized font, it kept getting picked up as the system font, and much of the text on my computer became unreadable! Thankfully, I was able to uninstall the font but I was still left with the issue of preventing this problem from recurring.
Eventually I learned that MacOS checks the 'name' table when trying to find the system font, so I decided to just delete the entire table instead of simply overwriting the display-name portion. This way I would start with a fresh name table which would give no indication that at one point this had been a system font.
Unfortunately, whatever I did wasn't completely correct, because it incurred errors when I tried to install the font. At this point I had been working on this project for a couple of days, and I just wanted to be done with it, so I opened the font in a font editing program and then saved it without changing anything. Sure enough, it rebuilt the damaged name table and the font now works exactly as intended!
I should also mention that there's a really obnoxious but perfectly warranted process you must undergo to perform and store checksums on all of the different tables and then on the entire file. This wasn't the issue as the font was being detected, it just had an invalid name table.
"""Calculate Table Checksum"""
def calcCheckSum(tableData):
sum = 0
for i in range(0, (len(tableData) + 3) / 4):
sum += struct.unpack(">I", tableData[i*4:i*4+4])[0]
return sum % 4294967296 #modulo the maximum 4 Byte unsigned integer
5. Conclusion
If I had explained every little issue I faced during this project, this post would probably have been three times as long and a lot less interesting. I would have had to talk about byte padding, alignment, the confusing checkSumAdjustment value, and so many other things that just don't make for good blog post material. If you want to see more feel free to check out my repo!
Parsing a file like this manually is usually a bad idea, and you can often find existing tools which will abstract a lot of the process out for you. I would not recommend that anyone try doing this unless the goal is to learn more about file formats, fonts, etc.
That being said, this was one of the most fun problems I've ever solved, and I'm quite happy with the results. While I cannot read anywhere near as fast using my cryptographic font as I can with a normal font, I can now read it without needing a reference, and I've discovered cool tricks like building it into non-DRM protected EPUB files.
I learned so much more doing this project than I would have just reading about file specifications. For me, projects like this will always have value because they give me a deeper reason to care about whatever it is I'm reading and learning. It also doesn't hurt that having a Mission Impossible secret font is really cool, though maybe it's just a tad too nerdy for many people to agree with me.