Top-level summary: Human expression has such diversity, an exemplar of which is the sheer number of languages with different semantic and syntactic rules. A predominant part of the knowledge base on the internet is in English, which hinders participation from all parts of the world in contributing to and consuming scientific and other information, especially in areas where English is not widely utilized. Manual translation efforts, often by governments and other non-profit organizations certainly aid in making information more accessible but it falls short on being able to do so for the entire corpus of information on the Internet. That’s where machine translation can help and this paper makes a great contribution to doing so for low-resourced African languages with a specific focus on the official languages of South Africa. The paper by Laura Martinus and Jade Abbott provides an overview of some of the challenges that these languages face and why there hasn’t been much movement in meaningful translation efforts, primarily because of lack of comparable work due to unavailability of code and data, benchmarks and public leaderboards and small and poor quality datasets.
The authors use the frequently used ConvS2S and Transformer architectures with default hyperparameter settings to establish benchmarks that they intend to improve upon through tuning and better datasets in the future. One of the key findings from their analysis was that the performance of the models depended a lot on the size and quality of the dataset and the morphological typology of the language itself. They utilized the Autshumato datasets which have parallel, sentence-aligned corpora for several languages with English equivalents. They found that the Transformer architecture had better performance in general for all languages and for languages with smaller datasets, lower byte pair encodings (BPE) tokens led to higher BLEU scores.
Ultimately, this work serves to establish a starting point for future research work which would involve collection of more datasets to cover the other official languages of South Africa and experimenting with unsupervised learning, meta-learning and zero-shot techniques.
The paper highlights how having more translation capabilities available for languages in the African continent will enable people to access larger swathes of the internet and contribute to scientific knowledge which are predominantly English based.
There are many languages in Africa, South Africa alone has 11 official languages and only a small subset are made available on public tools like Google Translate. In addition, due to the scant nature of research on machine translation for African languages, there remain gaps in understanding the extent of the problem and how they might be addressed most effectively. The problems facing the community are many: low resource availability, low discoverability where language resources are often constrained by institution and country, low reproducibility because of limited sharing of code and data, lack of focus from African society in seeing local languages as primary modes of communication and a lack of public benchmarks which can help compare results of machine translation efforts happening in various places.
The research work presented here aims to address a lot of these challenges. They also give a brief background on the linguistic characteristics of each of the languages that they have covered which gives hints as to why some have been better covered by commercial tools than others.In related work, it is evident that there aren’t a lot of studies that have made their code and datasets public which makes comparison difficult with the results as presented in this paper.
Most studies focused on the Autshumato datasets and some relied on government documents as well, others used monolingual datasets as a supplement. The key analysis of all of those studies is that there is a paucity in the focus on Southern African languages and because apart from one study, others didn’t make their datasets and code public, the BLEU scores listed were incomparable which further hinders future research efforts.
The Autshumato datasets are parallel, aligned corpora that have governmental text as its source. They are available for English to Afrikaans, isiZulu, N. Sotho, Setswana, and Xitsonga translations and were created to build and facilitate open source translation systems. They have sentence level parallels that have been created both using manual and automatic methods. But, it contains a lot of duplicates which were eliminated in the study done in this paper to avoid leakage between training and testing phases. Despite these eliminations, there remain some issues of low quality, especially for isiZulu where the translations don’t line up between source and target sentences.
From a methodological perspective, the authors used ConvS2S and Transformer models without much hyperparameter tuning since the goal of the authors was to provide a benchmark and the tuning is left as future work. Additional details on the libraries, hyperparameter values and dataset processing are provided in the paper along with a GitHub link to the code.
In general the Transformer model outperformed ConvS2S for all the languages, sometimes even by 10 points on the BLEU scores. Performance on the target language depended on both the number of sentences and the morphological typology of the language. Poor quality of data along with small dataset size plays an important role as evidenced in the poor performance on the isiZulu translations where a lowly 3.33 BLEU score was achieved. The morphological complexity of the language also played a role in the state of the performance as compared to other target languages.
For each of the target languages studied, the paper includes some randomly selected sentences to show qualitative results and how different languages having different structures and rules impacts the degree of accuracy and meaning in the translations. There are also some attention visualizations provided in the paper for the different architectures demonstrating both correct and incorrect translations, thus shedding light on potential areas for dataset and model improvements. The paper also shows results from ablation studies that the authors performed on the byte pair encodings (BPE) to analyze the impact on the BLEU scores and they found that for datasets that had smaller number of samples, like for isiZulu, having a smaller number of BPE tokens increased the BLEU scores.
In potential future directions for the work, the authors point out the need for having more data collection and incorporating unsupervised learning, meta learning and zero shot techniques as potential options to provide translations for all 11 official languages in South Africa. This work provides a great starting point for others who want to help preserve languages and improve machine translations for low resources languages. As identified in the beginning of this summary, such efforts are crucial to empower everyone in being able to access and contribute to scientific knowledge of the world. Providing code and data in an open source manner will enable future research to build upon it and we need more such efforts that capture the diversity of human expression through various languages.
Original paper by Laura Martinus and Jade Abbott: https://arxiv.org/abs/1906.05685