Language Preservation through Character Encoding

What is Unicode? And how can it help in the quest to preserve the endangered languages of the world?

Before February 15, 2018, I had a partial answer to the first of these two questions. I had heard the name “Unicode” and had a vague idea of what it was. But I had no idea how it could be discussed in the context of preserving languages. So, when I first heard the title of Dr. Deborah Anderson’s talk, “Preserving the World’s Languages and Cultures (Through Character Encoding),” I was both intrigued and, to be honest, somewhat skeptical. The title suggested to me a grand notion of technology’s role in language—and even culture—preservation efforts; was it possible to revive a language on the brink of death, and by extension, its conjoining culture and history, simply through encoding? Dr. Anderson’s talk was very enlightening, to say the least. It was also incredibly inspiring in that it elucidated the resourcefulness that is required in the arduous task of language revitalization.

On the night of February 15, we welcomed Dr. Anderson for her public lecture, and the following day, a smaller seminar including faculty and graduate students convened to hear her continue the previous night’s talk in more depth and to discuss the topic,“Making the world’s scripts accessible: The Script Encoding Initiative in 2018 and beyond.”  What follows is an attempt to distill, to the extent that I can, both of her remarkable talks, in addition to ideas that her presentations brought up.

Dr. Anderson introduced us to UC Berkeley’s Script Encoding Initiative (SEI) which she has founded to provide a bridge between user communities and the Unicode technical committee. But what is Unicode, exactly? Dr. Anderson was quite clear in her explanation (with the use of very helpful slides that can be found here): Unicode is an international character encoding standard. In this context, a character is a unit in a writing system that has meaning. Encoding, therefore, means assigning a unique number, that is, a “code point” to a character. In Unicode, for example, a= 0061, b=0062, and so on. Ultimately, therefore, the Unicode Standard allows us to send text between different electronic devices. One must keep in mind, though, that Unicode encodes characters by script, and not language.

abstract black and white blur book
Photo by Pixabay on

What is the role that SEI plays? It turns out that from the time that an entity proposes a script be added to Unicode (this stage comprises of writing a thorough, well-researched proposal), it takes a considerable amount of time for the script to be used on digital platforms and devices. In the case of Cherokee, for example, it took sixteen years from the time of its proposal to get the script onto a device—in other words, to get it up and running. The sooner the scripts of endangered languages become functional on social media, the greater the incentive for the youth within the communities of these languages to engage with and communicate more in their own language. The main role that SEI takes on, therefore, is helping user communities get their proposals to Unicode, and reduce the aforementioned time by funding the proposals, so that they will be approved with the least amount of revisions needed. SEI, therefore, acts as an intermediary between scholars and the linguistic experts and the Unicode Technical committee; I mention scholars because, as it turns out, scripts of historical significance that are no longer in use are also encoded as part of the SEI project—a current SEI project involves encoding Mayan Hieroglyphs! And a secondary goal of SEI is to help make scripts accessible to users by ensuring that fonts, input devices/keyboards, software updates, and so on, are in tune with the scripts that have been added.

Over the past sixteen years, thanks to SEI, over 70 scripts and sets of characters have been added to Unicode; the distinction here between “scripts” and “sets of characters” signifies the fact that while there may have been scripts, like Arabic, that were already in Unicode and could be employed by Persian users, for these users there are sets of characters in Persian that were not included in the Arabic script; therefore, adding a set of characters would allow for the Persian users to type on different devices and platforms without having to make compromises (Note: Persian was not one of the scripts that the SEI was involved in; I am merely using it here as an example of the above-mentioned distinction).

In the quest to get scripts encoded, there are challenges facing both SEI and the user communities. For SEI, there’s the issue of access to sufficient written materials as source documents for the script, access to users of the script that have significant knowledge of it (in these cases, geography becomes an issue—some of these users are in hard-to-reach locations), and script politics. For example, in the case of the Old Hungarian language, the approval of the script took seventeen years because of internal disputes over naming it for Unicode. For the user communities, as mentioned earlier, SEI tries to make the scripts accessible to them as soon as possible, and while Microsoft, Apple, and Google have supported “lesser-used” scripts with font and software, support in social media is lagging behind. For instance, Balinese is not supported on Facebook; when the social media giant was contacted by SEI, the response they gave was that Balinese “was not high on their priority list.” Naturally, politics play a role in this area as well, as certain languages are deemed more important, and in all likelihood, more profitable than others.

apple applications apps cell phone
Social media is lagging behind in providing support for “lesser-used” scripts. Photo by Tracy Le Blanc on

What is the payoff of all of this in the realm of language and culture preservation? Once again, Dr. Anderson was crystal clear in explaining the results they had gotten from their projects. Adding the scripts of endangered and historic languages to Unicode: “promotes ethnic pride and identity; preserves the memory of past cultures and achievements digitally; the scripts are used to create literary materials in various languages; it allows sharing of research with colleagues and publishers; makes it possible to search within documents; and will save text in a long term standardized way.” A comment that was made during Q&A, however, questioned the merits of standardization: doesn’t standardization destroy diversity?

Several months later, as I’m now contemplating this comment in particular and the event in general, I’m reminded of a conversation that took place during the Q&A in our recent Symposium. While I don’t remember the exact questions, I do remember some critiques that were being made regarding public policies in language preservation efforts that were discussed during the first panel. Two response stood out to me, one of which was given by Dr. Mariam Aboubakrine, the Chair of the United Nations Permanent Forum on Indigenous Issues, and the other by Dr. Dmitrii Harakka-Zaitsev, one of the vice-chairs of said Permanent Forum. Both responses were simple and poignant. Dr. Aboubakrine candidly stated that compromises need to be made, even while acknowledging that the prominent languages we use are the languages of the colonizer: “I wouldn’t be able to speak here if I didn’t speak English.” And Dr. Harakka-Zaitsev emphasized the importance of optimism in the quest to preserve and revive the endangered languages of the world, even when—or better yet especially when— this prospect of this quest seems bleak.

Optimism and compromise: thinking back to the skepticism I expressed at the beginning of this post, I witnessed these two simple yet profound concepts, acknowledged by pioneers and professionals in the work of indigenous issues (including language rights), in the critical work that Dr. Anderson and her colleagues are carrying out at SEI.


By Atefeh Akbari Shahmirzadi, Graduate Research Fellow in Global Language Justice and PhD Candidate in English and Comparative Literature at Columbia University.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s