The Importance of Language Representation in Technology

By Erica Veltman

This post was developed as part of the Columbia University course “Multilingual Technologies and Language Diversity” taught by Smaranda Muresan, PhD and Isabelle Zaugg, PhD.  This cross-disciplinary course offering was a joint effort between the Institute for Comparative Literature and Society and the Department of Computer Science, developed through the generous support of the Collaboratory@Columbia.

As a computer science student, I had never heard of the phrase “language justice” and had never considered its ties to digital technology before enrolling in “Multilingual Technologies and Language Diversity.” I am lucky to know two very popular languages, English and Russian, which would be considered thriving according to mathematical linguist András Kornai. I am lucky that, even though my parents decided not to teach me to read and write in Russian, I was able to use online resources to gain these skills and that I live in a community where I can use it every day. To my family, learning to read and write in English was a priority, and there are so many other families around the world that have to make similar decisions.

In order for a language to survive, it must be passed down from generation to generation. In “Everyone Speaks Text Message,” Tina Rosenburg quotes K. David Harrison, an associate professor of linguistics at Swarthmore College, who noted that, “Whether a language lives or dies…is a choice made by 6-year-olds. And what makes a 6-year-old want to learn a language is being able to use it in everyday life” (Rosenberg, 2011). However, simply learning the language as a child is not enough, as it is not a skill that you automatically retain for the rest of your life. As the saying goes, “You use it, or you lose it,” and this is especially true for languages. If a child learns a language and they do not use it, then they will forget it. Ultimately, they will not be able to pass the language on to their children, causing the language to die. In this day and age, the best way for children to be immersed in a language and use it all the time is to have technology that works in their specific language.

Language vitality is not the only benefit of having technological support for underrepresented languages. Anshuman Pandey, a member of the Unicode team, worked on getting Hanifi Rohingya out in Unicode. Although this may not mean much to the Rohingya that are oppressed in Myanmar, “the encoding of their language carries considerable symbolic weight because it legitimizes an oppressed minority and their language” (Erard, 2017). “Global Linguistic Diversity for the Internet” by Deborah Anderson discusses many more reasons as to why electronic representation is important for languages. The first reason she mentions is the historic significance of languages. She gives the Rosetta Stone as an example; the only language in Unicode is Greek while the other two, Egyptian hieroglyphs and Demotic, remain inaccessible.  The second reason is that technology can be crucial to increasing fluency rates. The people of Bali use Balinese in cultural and literary works and it is taught in schools; however, Bahasa Indonesia is the national language of Indonesia and is the main one taught in schools and used in official capacities. The Balinese people themselves want to see Balinese encoded in Unicode so that people will have access to learning materials.  Next, she mentions the practical effects of having documents in one’s mother tongue such as one being able to read medical documents where there are no translators and being able to communicate with people over the internet if disaster were to strike.  Other reasons include the empowering of minority groups and the ability to search online text documents. (Anderson, 2005)

We think of technology as something that can only help the situation by keeping more languages alive, but the current availability of tools is actually detrimental to speakers of certain languages.  In “Language Presence in the Real World and Cyberspace,” Daniel Prado discusses what he calls a “negative feedback loop.” When there are little resources, speakers turn to languages that are better equipped. Prado writes, “Low productivity is a key risk faced by languages in cyberspace. It causes their speakers to turn to better equipped languages, triggering a negative feedback loop: less productivity, less audience; less audience, less productivity” (41). In this paper, he also mentions statistics that are relevant to how the decisions that tech companies make are hurting languages. A huge fact that stuck with me was that in March 2011, Google supported Icelandic, a language spoken by 240,000 people, but did not support languages such as Bengali, Javanese, Tamil, Malay, Hausa, Yoruba, Fulani and Quechua, all of which are spoken by 10 to 200 million people each. Tech companies are actively making decisions like these, which, arguably, are hurting more people than they are helping.  The resources that went to building tools to support Icelandic could have gone to a more widely used language that happens not to be European. At the time, Google supported thirty European languages but only one African language and no indigenous American or Pacific languages. They are simply going to where they think the money is and that is unfair to so many groups of people.

As computer scientists, I believe it is our job to provide speakers with tools and have it be up to them whether they use them or not. As part of Kornai’s “Digital Language Death,” he looked at two official varieties of Norwegian, Bokmål and Nynorsk.  At one point, they had the same presence online but at the time this paper was written, the Bokmål Wikipedia page was four times the size of Nynorsk’s. Kornai had assumed the opposite would happen, since those that speak Nynorsk have better access to computers and technology but rather “the Norwegian population has already voted with their blogs and tweets to take only Bokmål with them to the digital age” (Kornai, 2013).  Some people are resorting to using other alphabets and transliterating their mother tongues.  Rosenberg mentions a story of a truck driver in Siberia who uses the Cyrillic alphabet as the writing system for Chulyum, an endangered language. It should not be up to the technology which language will take over—it should be up to the people. This connects to Prado’s “negative feedback loop.”  If we provide the support, we can break out of the loop.

It seems like the main barrier to providing support for all languages is getting more big tech companies on-board. There are plenty of people that are interested in making tools for underrepresented languages. In fact, there are probably hundreds of third-party keyboards that are available through app stores for languages that operating systems do not natively support. Keyman alone has over a thousand keyboards available for download. Unicode now supports more than 100 different writing systems and is increasing that number with every update. The group working on Unicode is a group of volunteers from large tech companies like Apple, Facebook and Google. Michael Erard’s “How the Appetite for Emojis Complicates the Effort to Standardize the World’s Alphabets” introduces us to how Unicode came to be, the important work the organization does now and how emojis are now one of its responsibilities.  The Unicode Consortium was incorporated in 1991 and consisted of a group of computer scientists brought together by Joe Becker.  One of these founding members was Ken Whistler. According to Erard, “Unicode’s idealistic founders intended to bring the personal-computing revolution to everyone on the planet, regardless of language. ‘The people who really got the bug,’ Whistler says, ‘saw themselves at an inflection point in history and their chance to make a difference’” (Erard, 2017). As Whistler said, there are people that want to do this kind of work and make a difference; it just seems like it is a matter of having financial backing. Lamine Dabo ran into the same issue when he spoke to manufacturers about creating a cheap cell phone with N’Ko as the main language. Despite the amount of interest from distributors, manufacturers claimed there wasn’t enough money to be made.

These are unfortunate cases of shortsightedness and profit over people, which may be the largest hurdle in the way of support for many languages. With corporate social responsibility becoming more important to businesses nowadays, we can only hope to see change soon. If not, publicizing these issues can prove to be powerful as it did in the case of Uber and its mishandling of multiple situations in the past years. #BoycottUber has trended multiple times due to these situations and Uber saw a decrease in ridership after every scandal (Isaac, 2017). I personally know people who were among many that stopped using Uber and switched to Lyft after Uber’s 2017 sexual harassment scandal. Public opinion plays a huge role in the success or failure of a business and may be an important tool in getting the support lower-resource languages deserve. As we know, social media plays a huge role in society today. A first step of action may be to launch social media campaigns to raise awareness of the issue. There are people out there who care about the representation of minority groups and just need to be shown that representation issues exist in technology and its development as well.


