Big Data & AI for the sake of endangered languages speakers
Revitalization of endangered languages matters to advance rural land rights and reduce climate change, especially in the case of historically significant languages with a large number of speakers, but without a well-defined official status and in the verge of extinction. Language technology powered by artificial intelligence is a part of the solution but the lack of linguistics resources prevent to use so. Artificial intelligence is boosting machine translation, it is easy to imagine that translation, language learning, and even more language services could soon be completely automated, but only for languages with many linguistic resources. To fill the gap we propose a nation-wide media campaign of corpus gathering of one native languages of Peru: Southern Quechua. We will face this great challenge through crowdsourcing, a huge group of volunteer speakers enabled through the use of a smartphone application called HUQARIQ and inspired by the nationwide media campaign called QICHWATHON.
Digital technologies, in particular language technology, content development and dissemination, play a growing role in influencing societal development and contributing to the intergenerational transmission of indigenous languages from older to younger generations, rather than fostering their disappearance in today’s world. But clearly, the lack of a large scale linguistic corpora is the main barrier that impedes the availability of new technology, electronic commerce and content, which in turn prevents the relevant provision of public services to Quechua speakers.
In 2017, Quechua language was spoken by 3,799,780 speakers, 13.6% of the total population over 5 years old in Peru. Quechua is also spoken in Bolivia and Ecuador, according to Ethnologue, they are 8 million speakers all together. But, benefits as peaceful development, inclusion and cohesion, which come with respect for linguistic rights, are not only for Quechua speakers but for all Andean citizens, 60 million people.
Looking the entire world, for only 40 of about 350 languages with more than one million speakers, the situation concerning text resources is comfortable; the remaining languages need for both corpora and tools, clearly, the citizens and governments where these 310 languages are spoken could be receptors of a positive global impact of our initiative.
Without economic power to exert, and no political groundswell to demand change, linguistic communities do not even dream of a meaningful presence in the digital sphere. Thus, one either does or does not access technology through a language in which it is already well developed, without demanding or conceiving of services in a local language. This introduction explains why the market is not created and demand is asleep. These facts don’t scare us, on the contrary, we believe the current situation gives us the first-mover advantage and a great opportunity for growth in our initiative.
We identified the early adopters, in the short term, the parallel voice-text corpus collected will feed an Automatic Speech Recognition system of Quechua language, called QILLQAQ, which first is a tool for literacy and standardization of the written format of the language and therefore benefits at least half a million children from Intercultural Bilingual schools and one million adults from out-of-school programs. Second, QILLQAQ is also a writing tool and therefore benefits the community of writers and communicators, unions that group no less than thirty thousand people. We have received feedback and advise from these professionals and also from the Ministry of Education officials.
Ten thousand hours of voice and audio aligned and scrupulously reviewed is the minimum necessary size of one of the parallel corpora of only one language. A fairly approximate calculation shows that USD 6 million are needed to build such corpus, this number is obtained by multiplying:
(10000 hours) x (20 man-hours / hour processed) x (USD 30 / man-hour)
Our hypothesis is that to build the corpus of languages with few linguistic resources in a more feasible and efficient way, we must meet three strategies: the automation of the collection, crowdsourcing, and massification.
To achieve automation in corpus collection we have created HUQARIQ, an app mobile to record voices in online and offline mode. This tool allows users (native speakers) to listen to a phrase (prompt), then, record their voice repeating the phrase, and finally, send the recording to the server or delete it. The app jumps to a next phrase automatically after sending a previous record to the server, and it repeats the process until all the phrases are listened and recorded. The offline mode is designed thinking of zones without or deficient Internet coverage, it allows to store the information made by the user (recordings, transcriptions and metadata) momentarily in the cell phone, which will automatically communicate with the application server to send all the information stored in the cell phone when it has Internet access. HUQARIQ have a friendly and very intuitive interface. In addition, to avoid pranksters, HUQARIQ will have a ID verifier linked to official organizations’ databases and will have an automatic preprocessing method of recordings, which will 1) eliminate the background noise, 2) recognize a silence-only recording, 3) cut very long silences and 4) save the audio in a standard format file.
Our strategies, crowdsourcing and massification, are the greatest innovation presented in this field in this century. We propose LINGUATHON (SIMINCHIKKUNARAYKU MARATHON or QICHWATHON en Peruvian case), a media campaign that would touch practically a entire nation to 1) encourage native speakers of endangered languages (Quechua or Aymara languages in Peruvian case) to record their voices and 2) make awareness and revive the interest of citizens who are non-native speakers but that with very high probability their parents or grandparents indeed were. We bet on to promote for ten months in advance a central date of corpora collection. The campaign will necessarily be informative and inspiring.
- Provide equitable access to learning and training programs regardless of location, income, or connectivity throughout Latin America and the Caribbean
- Support and build the capacity of formal and informal educators to better prepare Latin American and Caribbean learners of all ages for the jobs of today and tomorrow
- Prototype
Given the number of languages involved and the amount of financial
resources and human effort required for the creation, annotation,
preservation, and dissemination of transparent records of a language, it
is not realistic expect that the documentary linguistics community
will be able to document all these languages without disruptive
approaches.
Our hypothesis is that to build the corpus of languages with few linguistic resources in a more feasible and efficient way, we must meet three strategies: the automation of the collection, crowdsourcing, and massification.
People use digital tools that make sense to them, then, our approach is the right one: dealing with the main problem since the very beginning, that is, verbal communication. Trying to start with written communication would have zero impact, considering the high illiteracy rates among Quechua speakers.
As it was stated, there is no other alternative product to compare with. However, we know that the company Nuance charges USD 6000 annually for a license that enables up to 1000 people to use their Dragon Professional Group software for automatic transcription of the English language. It follows that the authorization of one million users would cost USD six million. Our strategy is to unleash QILLQAQ free of charge, relying on indirect incomes.
Related to QILLQAQ, the neuronal model was created in Python using the Google Tensorflow library and based on deep learning algorithms. This is based on the model presented by Baidu (2016), which has several layers of convolutional networks, recurrent networks and a fully connected layer.
One of the main barriers to creating Quechua language technology is the lack of linguistic resources. So, 1) our first objective is to break down that barrier. After this stage, 2) not only our team but several others will be empowered to develop products and solutions, so on the one hand, jobs are created in the orange and blue economies and on the other hand the supply in the market increases. 3) The early adopters make use of these tools and popularize them, 4) public servants use them appropriately and massively and the final population feels the benefit of it and 5) also becomes a user and demands more and better products. Finally, 6) Quechua non-speakers are interested in learning it, closing a virtuous circle of increasing demand and supply.
- Women & Girls
- Children & Adolescents
- Rural Residents
- Low-Income
- Minorities/Previously Excluded Populations
- Peru
- Bolivia
- Peru
- Bolivia
Currently we’re serving to nobody, our beta of QILLQAQ has a WER (word error rate) over 70%, then it is not useful. We need feed QILLQAQ with enough corpora, for that reason, HUQARIQ y QICHWATHON are so important.
In the short time, we focus on our Andean community with Quechua and Aymara as languages target. There are around 7’800,000 Quechua speakers in South America (Argentina, Bolivia, Brazil, Chile, Colombia, Ecuador, and Peru) and 2’200,000 Aymara speakers in Bolivia and Peru.
We expect that in five years our development of linguistic corpus and language technology tools will impact the social-economic inclusion and cohesion of at least all eight million Quechua speakers due to the use of this technology in the provision of public services. Moreover, we expect that Quechua language will become mainstream, and rates of literacy in these languages would have at least doubled.
After one year, after QICHWATHON success, we expect to serve at least around half million of children, students of bilingual intercultural education in Peru.
For the next year, our main goal is the media campaign, QICHWATHON, implemented.
After five years of research, we expect a full digital processing of Quechua languages which will allow the creation of an effective translator with speech recognition and morphological analyzer programs, that means a clear footprint of this languages in the digital world.
Also after 5 years, we expect that LINGUATHON could be a franchise
deployable in countries which have “minority languages” with millions of
speakers like India, China, Congo, South Africa, Zimbabwe, Senegal,
Russian Federation, Uganda, Tanzania, Pakistan, Philippines, Paraguay,
Bolivia. Nigeria, Kenya, Iran, Indonesia, Mozambique, Morocco, Guinea,
Ghana or Ethiopia.
The lack of large linguistic corpora is the main barrier that impedes the availability of new technology, dealing with, it's the core of our proposal, but in turn, the lack of market is the main barrier to sustainability.
Language technology of native American languages only will be developed if an enabling environment exists, this environment would be the output of serious and well-funded language policy and this policy only will be issued if decision-makers are convinced of return of investment on native American languages. It’s sad and unfair, but it is real, “language as a right” has not worked and it’s needed to change to a new approach: “language as a resource”.
In order to assure viability for our initiative, in partnership with public agencies, we must deal with three questions connected. At the surface, the question is: How large could be the contribution of language technology, focused in the native ones, to the economy of a LAC country?. Digging deeper into this topic, it’s found the second question: what is the cost-benefit ratio to carry out a real multilingual language policy?. Finally, in the core, the fundamental question: what is the value proposition of the native languages of LAC?.
Then, for the next year, the challenge is to involve important institutions, both public and private, to support QICHWATHON and to assure high publicity by radio and TV in order to reach a wide audience in a short time.
Over five years, the challenge is finding sustainability beyond public subsidies.
First at all, additionally to our search of funds for our core tasks, we are also looking for funds to assess and make visible the economical profit of public investment in language technology of Peruvian native languages.
we are looking for an agreement with the Peruvian National Television (TV Peru). The agreement would allow us to get free or very cheap TV publicity. If the agreement is not signed, we will need another sponsor to pay the publicity, we already have identified some ones.
Another risk is people resigned, the risk is low if we reach flow of money, anyway, all of us will write a blog in order to make easy the transfer of activities in case of someone renounces.
- My solution is already being implemented in Latin America/Caribbean
Until now, we unleashed first versions of two smartphone applications, QILLQAQ and HUQARIQ and we maintain several open repositories in GitHub.
By now, without publicity, we have collected 55 hours of text-voice parallel
corpus of Quechua language, this data was provided by around 1000 users. Conclusions were quite obvious, HUQARIQ must be enhanced with new characteristics to assure easiness for users, deal with large number of requests, avoid misuse by trolls and increase combination of phonemes.
We are planning to expand our solution to Asia, Africa and Russia.
After 5 years, we expect that LINGUATHON be a franchise deployable in countries which have “minority languages” with millions of speakers like India, China, Congo, South Africa, Zimbabwe, Senegal, Russian Federation, Uganda, Tanzania, Pakistan, Philippines, Paraguay, Bolivia. Nigeria, Kenya, Iran, Indonesia, Mozambique, Morocco, Guinea, Ghana or Ethiopia.
- Hybrid of for-profit and nonprofit
6 part-time staff
Siminchikkunarayku’s leader, Luis Camacho, has 20 years of experience leading ICT4D (information and communication technologies for human development), he uses tools like PMBOK, TRIZ, DevOps and especially Goldratt’s Theory of Constraints. In December 1998, he founded the Rural Telecommunications Research Group [GTR] and along 20 years he lead the deployment of wireless communications for the sake of villagers living in extreme rural zones of Amazon rainforest. In 2016, Luis was awarded by Linux Foundation as a “Developer Do Gooder".
Luis Camacho has been working on indigenous language documentation for more than five years. He has been documenting Southern Quechua (Cusco, Ayacucho, Puno). He is thoroughly familiar with best practice principles, from fieldwork to archiving. Luis Mujica is one of the world leading experts on a Quechua language and he has written on phonology, morphosyntax, and typology. Roger Gonzalo is one of the few phoneticians who has worked on Quechua language. Rodolfo Zevallos is a computational linguist who has already contributed several programs to this project. Roxana Quispe is one of the few individuals capable of an accurate transcription of a Quechua language. Finally, a key member of our team is Liz Camacho, native Quechua speaker, youtuber and community manager in charge of engagement of population. Together, we offer complementary skills that have been key to producing a unique set of materials for Quechua language.
We wish including all, from the public sector, private sector to native communities. One of our objectives is to involve as allied to the National and Subnational Governments in the planning and deployment of QICHWATHON in order to have greater coverage, reliability, and guarantee.
The government institutions that should be actively involved are Ministries of Education and Culture. The Ministry of Education promotes quality education for children, and adolescents who belong to indigenous communities and who speak native languages as first or second language. The Ministry of Culture is the entity that manages culture and cultural industries, as well as the ethnic plurality and culture of the nation, even more, National Broadcasting Television (TV Perú) is an agency within Ministry of Culture. Officers of both Ministries have indicated verbal assent.
In the private sector, we expect that TV and radio broadcasters and telecommunications companies will play an important role, due to the wide coverage they have at the national level.
Finally, the NGO Chirapaq promotes the affirmation of identity and recognition of indigenous rights and strengthening citizenship. Chirapaq will be our ally in direct communication to native communities about the importance of revitalizing our languages.
We bet on free-for-service or low-income-client, focusing on providing access to those who couldn’t otherwise afford languages services due to the social, economical and technological barriers. Incomes should come selling social services directly to a third-party payer. Creative distribution systems, lower production and marketing costs will enable us high operating efficiencies.
Our monetization model is one-pay-all-enjoy-free and we would expect to raise USD 300,000 annually from a public procurement or a private sponsorship in exchange for the right for any Peruvian to download, install and use the QILLQAQ software for free. This way, we would be enabled to cover all our expenses in order to sustain and grow this initiative.
At beginning QILLQAQ will be enabled to be used only as transcriptor, as such, it could be used not only for personal issues but also for business cases, we expect nothing less than the emergency of a new industry: press media in Quechua language, and of course, we expect to get additional revenues from them.
Next, in the short term, in order to generate more incomes, we expect unleash new services to the market, for example, automatic dubbing of programs for TV companies enabling them to broadcast simultaneously in Spanish and Quechua.
Finally, we also bet on turning QICHWATHON in LINGUATHON, a franchise deployable in any country with under resourced languages spoken by millions of people.
Although the economic prize is far from being sufficient, it helps us a little.
Undoubtedly, the greatest merit of obtaining the award is the prestige it carries, the award is provided neither more nor less than by MIT, the best university in the world. It will give us a lot of prestige in the academic world, something in the productive sector and perhaps that will attract the press and with much luck perhaps it will attract the attention of some public official of the highest level.
- Incubation & Acceleration
- Funding
At the international level, we would like to have the financial support of the Organization of Ibero-American States (OEI) and/or the Andean Community. OEI is leading the creation of an International Institute of Native American Languages (IILI), we would like to be the technological arm of said institute.
We would also appreciate mentorship from organizations like MIT SOLVE or NESST, from them, we expect training which leads us to reach venture capital
Project Manager