Multilingual NLP: Why Working with Languages with Complex Scripts is Challenging
Did you know — over 1.5 billion people in the world speak languages that share the same script? In fact, the majority of our world’s population lives in countries with such linguistic diversity. For example, Indian languages like Hindi, Marathi, Nepali and Sanskrit all use the Devanagari script, even though they have a distinct vocabulary and grammar. Interestingly, Devanagari belongs to a larger family of Brahmi script, which also includes the roots of several other languages like Tamil and Gurumukhi.
When it comes to natural language processing (NLP), working with language scripts can be challenging, especially when they have complex systems for representing sounds (orthographies). This is because different languages may use the same characters to represent different words, or they may use different characters to represent the same word. For example, the letter “a” can be pronounced as a short “a” in English, but it can be pronounced as “aa” in (Romanized) Hindi. This can make it challenging to create word representations that are accurate and consistent across languages.
SO, WHAT IS A ‘SCRIPT’?
In linguistics, a script is a system of written symbols that represent the sounds of a language. Most of us are familiar with Latin used to write English, French, Spanish, and many other Western languages. For example, the sentence “The quick brown fox jumps over the lazy dog” uses all 26 letters of the Latin alphabet, and each letter represents a different sound. Other major scripts include:
- The Cyrillic Кириллица script, used to write Russian, Ukrainian, and many other languages.
- The Arabic العربية script, used to write Arabic, Persian, and many languages in the Middle East and South Asia.
- The Hebrew אלפבית script, used to write Hebrew.
- The Chinese 汉字 漢字 script, used to write Chinese.
SCRIPTS ARE COMPLICATED.
When it comes to scripts, what you see isn’t always what you get — a major NLP hurdle. The hidden complexities of scripts can trip up even advanced NLP systems, as they make a crucial aspect of NLP — text preprocessing, significantly more challenging.
When languages share the same script, they may use the same characters to represent different words or use different characters to represent the same word. For example, in English, the letter “a” represents a range of sounds, including both the short “a” as in “but” and a longer “aa” as in “father”. Similarly, in French, “a” may be a short “a” or a nasalized “a” as in “chante” (sings). On the other hand, Hindi (using the Devanagari script) has separate characters for these distinct “aa” (आ) and short “a” (अ) vowel sounds. This added precision, often expressed through diacritics, complicates word representation for NLP models trained primarily on English-like scripts, making it difficult to accurately identify and interpret words.
Handling diacritics poses a further challenge. Diacritics are marks added to letters to change their pronunciation or meaning and are crucial for accurate interpretation of a word. For example, the letter “a” with a diacritic over it (ā) is pronounced differently from the letter “a” without a diacritic. Many NLP libraries do not handle diacritics well, which can lead to errors in text preprocessing. These seemingly minor omissions completely change a word’s meaning — the equivalent of turning “apple” into “ppl”.
Scripts can significantly impact how text is tokenized. In NLP, tokenization refers to the process of breaking down text into individual words or “tokens.” The way text is tokenized can have a major impact on the performance of NLP tasks such as machine translation and text summarization. For instance, incorrect tokenization can lead to the wrong words being translated or summarized.
Here’s a real-world example: I processed Hindi text through a typical preprocessing pipeline using scikit-learn’s CountVectorizer and SentenceTransformers (SBERT)’s standard tokenizer. Unfortunately, this standard tokenizer removes all diacritics from the words, resulting in the following nonsensical output from a topic model:
['ठक', 'सम', 'रगत', 'सच', 'तर']
To someone familiar with Hindi, it’s clear that these aren’t even real words! The original meaning was lost because the tokenizer likely ignored diacritics or other script features. This shows why specialized tokenization methods tailored for complex scripts are crucial for the advancement of NLP.
WHAT CAN WE DO?
While the challenges of languages with complex scripts can be significant, we can use linguistic-informed approaches to address them.
Use specialized NLP libraries to handle languages with complex orthographies. There are a number of NLP libraries that are specifically designed to handle languages with complex orthographies. These libraries can:
- Accurately handle and preserve diacritics.
- Work with different writing systems.
- Employ specialized word segmentation algorithms.
For Indian languages, my favorite one is the Indic NLP Library which contains several different functionalities for text normalization and word tokenization to translation and transliteration.
PyArabic and Stanford’s arabic-nlp are popular pre-processing libraries for Arabic text. I also highly recommending reading Neri Van Otten’s article on overcoming challenges in ArabicNLP.
HanziNLP by Sam Shi and Sidney Kung’s article on Chinese Natural Language (Pre)processing are brilliant resources to get started on ChineseNLP.
Preprocess text manually before it is used in NLP tasks. If you are not able to use a specialized NLP library, you can manually preprocess text before it is used in NLP tasks. This can involve:
- Transliterating text to a different writing system
- Segmenting text into words
While not ideal, these approaches can help you ensure that your text remains coherent and does not lose semantic meaning during an NLP task.
Remember, even if you use a specialized NLP library or preprocess text manually, it is important to be aware of the challenges of working with languages with complex scripts. Always ensure you use a consistent set of rules for tokenizing text and leverage a robust NLP library that can handle errors in text preprocessing.
THANKS FOR READING!
I hope this blog post has been informative and sparked your curiosity about the challenges posed by complex scripts in multilingual natural language processing. Whether you’re an NLP researcher or simply interested in language technologies, understanding these complexities is crucial for building truly global natural language models.
If you are interested in other Indian language NLP resources, I highly recommend you check out the Indic NLP Catalog for lots of great tools and datasets.