SIMINCHIKKUNARAYKU
Language barriers still hamper cross-lingual communication, human mobility, and the free flow of knowledge, ideas, commerce, administrative, cultural, and political exchanges in Latin America. We are also facing a widening technology gap between widely-used and lesser-used languages that deepens the digital divide for economically less powerful linguistic communities.
Language technology powered by artificial intelligence is a part of the
solution but the lack of linguistics resources prevents us to use so. To fill the gap, we propose automation of corpus gathering and a nationwide media campaign to build the parallel voice/text corpus of the Southern Quechua language.
At the local and global scale, the revitalization of endangered languages matters to advance rural land rights and reduce climate change. It's important to focus global attention on their significance for investment, sustainable development, reconciliation, good governance, and peacebuilding.
Digital technologies, in particular, language technology, content development and dissemination, play a growing role in influencing societal development and contributing to the intergenerational transmission of indigenous languages from older to younger generations, rather than fostering their disappearance in today’s world. But clearly, the lack of a large scale linguistic corpora is the main barrier that impedes the availability of new technology, electronic commerce and content, which in turn prevents the relevant provision of public services to Quechua speakers.
In 2017, the Quechua language was spoken by 3,799,780 speakers, 13.6% of the total population over 5 years old in Peru. Quechua is also spoken in Bolivia and Ecuador, according to Ethnologue, they are 8 million speakers altogether. But, benefits as peaceful development, inclusion, and cohesion, which come with respect for linguistic rights, are not only for Quechua speakers but for all Andean citizens, 60 million people.
Looking the entire world, for only 40 of about 350 languages with more than one million speakers, the situation concerning text resources is comfortable; the remaining languages need for both corpora and tools, clearly, the citizens and governments, where these 310 languages are spoken, could be receptors of a positive global impact of our initiative.
Ten thousand hours of voice aligned with text is the minimum necessary size of parallel corpora of one language. A fairly approximate calculation shows that USD 6 million are needed to build such corpus, this number is obtained by multiplying:
(10000
hours) x (20 man-hours / hour processed) x (USD 30 / man-hour)
We propose to build the corpus of under-resourced languages with much less money, meeting three strategies: automation of the collection, crowdsourcing, and massification.
To
achieve automation in corpus collection we have created HUQARIQ,
an app mobile to record voices in online and offline mode. This tool
allows users (native speakers) to listen to a phrase (prompt), then,
record their voices repeating the phrase, and finally, send the
recording to the server.
The next step is LINGUATHON, a media campaign
to 1) encourage native speakers of endangered languages (Quechua or
Aymara languages in Peruvian case) to record their voices and
2) make awareness and revive the interest of citizens who are
non-native speakers but which parents
or grandparents indeed were. We bet on promoting for ten months in
advance a central date of corpora collection. The campaign will
necessarily be informative and inspiring.
One of the most impactful but, at the same time, less developed creative
industries in Latin America (LAC) is the set of language services.
Human beings use language to express, store, access, share, manipulate,
interpret, and search massive amounts of information. Digital technologies facilitate these interactions, moreover, every digital product uses and is dependent on language; language technology is not anymore an option but the key enabler and solution to boosting future growth.
Language Technology is one of the most important drivers behind the current boom in AI. But, currently, almost all software developments are in the English language only. Most Latin American citizens will be excluded from these game-changing technologies if we do not ensure that they have multilingual capability.
Without economic power to exert, and no political groundswell to demand change, linguistic communities of endangered languages do not even dream of a meaningful presence in the digital sphere. Thus, until now, one either does or does not access technology through a language in which it is already well developed, without demanding or conceiving of services in a local language.This explains why the market is not created and demand is asleep. These facts don’t scare us, on the contrary, we believe the current situation gives us the first-mover advantage and a great opportunity for growth in our initiative.
Yes, multilingual and not monolingual is the right answer, then, our focus is not the expansion of language technology for Spanish, Portuguese, or English but the hidden market of language technology for the most-spoken native languages of Latin America dealing with the entry barrier: the lack of linguistic resources.
We are not external observers, we are immersed, we all identify ourselves as members of the First Nations of Latin America and we are part of the Quechua linguistic community. At the same time, we are among the early adopters of new technology due to our access to higher education. We continually ask our peers for feedback to make informed decisions
- Support teachers and educational institutions with teaching and learning methodologies, tools, and resources that help develop future skills for students
We have identified the early adopters, half a million children from Intercultural Bilingual schools and one million adults from out-of-school related programs. It happens that the corpus collected will feed an Automatic Speech Recognition system of Quechua language, called QILLQAQ, which first is a tool for literacy and standardization of the written format of the language and therefore will benefit the mentioned people. QILLQAQ is also a writing tool and therefore will benefit the community of writers and journalists, unions that group thirty thousand people, we've received feedback and advice from these professionals and also from the Ministry of Education.
- Prototype: A venture or organization building and testing its product, service, or business model.
We already have the main tool, the app HUQARIQ, we have tested it but not at a large scale, then, we consider it is still a prototype. If we'd win the prize, we'll move quickly to the pilot stage.
- A new technology
Ten thousand hours of voice and audio aligned and scrupulously reviewed is the minimum necessary size of one of the parallel corpora of only one language. A fairly approximate calculation shows that USD 6 million are needed to build such corpus, this number is obtained by multiplying:
(10000 hours) x (20 man-hours / hour processed) x (USD 30 / man-hour)
To build the corpus of languages with few linguistic resources in a more feasible and efficient way, we propose three strategies: the automation of the collection, crowdsourcing, and massification.
To achieve automation in corpus collection we have created HUQARIQ, an app mobile to record voices in online and offline mode. Our strategies, crowdsourcing and massification, are the greatest innovation presented in this field in this century. We propose LINGUATHON a media campaign that would touch practically an entire nation to 1) encourage native speakers of endangered languages (Quechua or Aymara languages in Peruvian case) to record their voices and 2) make awareness and revive the interest of citizens who are non-native speakers.
In the frontend HUQARIQ is an Android app. In the backed, HUQARIQ is an API and a server developed in a Linux Ubuntu 16.04 LTS environment using 1) the python programming language version 2.7 and 2) Django Web Framework version 2.0.7 for the design of the model-view-controller architecture and the front-end. Additionally, the languages used for the view layer were HTML5, CSS, and JavaScript, and the PyWaveSurfer library for the creation of audio interfaces (record, pause, speed, etc).
The inputs collected by HUQARIQ automatically are put In a pipeline to feed our automatic speech recognition of Quechua language, called QILLQAQ.
Related to QILLQAQ, the neuronal model was created in Python using the
Google Tensorflow library and based on deep learning algorithms. This
is based on the model presented by Baidu in 2016, which has several
layers of convolutional networks, recurrent networks, and a fully
connected layer.
I must set crystal clear we will use the prize 1) to finish our app HUQARIQ, to move from prototype to production, and 2) to make a media campaign to promote HUQARIQ dissemination among the target population.
HUQARIQ is in some way like https://soundcloud.com/, but HUQARIQ not only reproduces audio but also records voice and has more specific options. I mean, HUQARIQ is unique but is based on widely used technology.
Beyond the scope of this contest, we will use the corpus collected to feed QILLQAQ, the first speech-to-text of the Quechua language. This is more complex and sophisticated technology based on Recurrent Neural Networks, a subset of Deep Learning.
Our own research is published here:
https://orcid.org/0000-0001-65...
Our code is published here:
https://github.com/tawantinsuy...
- Artificial Intelligence / Machine Learning
- Big Data
- Software and Mobile Applications
There is no risk, nor privacy neither security concerns. Volunteer users of HUQARIQ will be warned about the use of their voices to build a parallel corpus which in turn will be used to code language technology and unleash future tools for free.
- Women & Girls
- Pregnant Women
- Children & Adolescents
- Rural
- Peri-Urban
- Urban
- Poor
- Low-Income
- Minorities & Previously Excluded Populations
- Peru
- Bolivia
Currently, we’re serving to nobody, our beta of QILLQAQ has a WER (word error rate) over 70%, then it is not useful. We need to feed QILLQAQ with enough corpora, for that reason, HUQARIQ and QICHWATHON are so important.
In the short time, we focus on our Andean community with Quechua and Aymara as languages target. There are around 7’800,000 Quechua speakers in South America (Argentina, Bolivia, Brazil, Chile, Colombia, Ecuador, and Peru) and 2’200,000 Aymara speakers in Bolivia and Peru.
We expect that in five years our development of linguistic corpus and language technology tools will impact the social-economic inclusion and cohesion of at least all eight million Quechua speakers due to the use of this technology in the provision of public services. Moreover, we expect that the Quechua language will become mainstream, and rates of literacy in these languages would have at least doubled.
After one year, after QICHWATHON success, we expect to serve at least around half million of children, students of bilingual intercultural education in Peru.
For the next year, our impact goal is the larger corpus collected thanks to a successful media campaign to promote HUQARIQ's use.
After five years of research, development and innovation duly funded, we expect full digital processing of Quechua language, which will allow the creation of an effective translator with speech recognition and morphological analyzer programs, that means a clear footprint of this language in the digital world.
Also after 5 years, we expect that LINGUATHON could be a franchise deployable in countries that have “minority languages” with millions of speakers like India, China, Congo, South Africa, Zimbabwe, Senegal, Russian Federation, Uganda, Tanzania, Pakistan, Philippines, Paraguay, Bolivia. Nigeria, Kenya, Iran, Indonesia, Mozambique, Morocco, Guinea, Ghana or Ethiopia.
For this specific contest, the main indicator is clear: the number of hours collected.
Additional indicators are:
quantity of downloads of HUQARIQ from Google Play
quantity of users of HUQARIQ
time of use of each HUQARIQ user
number of spots and impact of each one
the ratio of hours collected over the number of spots
- Hybrid of for-profit and nonprofit
Luis Camacho, our team leader, works full time in our initiative. The rest of the team: Luis Mujica, Roger Gonzalo, Roxana Quispe and Liz Camacho, they work part time.
Siminchikkunarayku’s leader, Luis Camacho, has 20 years of experience leading ICT4D (information and communication technologies for human development), he uses tools like PMBOK, TRIZ, DevOps, and especially Goldratt’s Theory of Constraints. In December 1998, he founded the Rural Telecommunications Research Group [GTR] and for over 20 years he leads the deployment of wireless communications for the sake of villagers living in extreme rural zones of the Amazon rainforest. In 2016, Luis was awarded by Linux Foundation as a “Developer Do-Gooder".
Luis Camacho has been working on indigenous language documentation for more than five years. He has been documenting Southern Quechua (Cusco, Ayacucho, Puno). He is thoroughly familiar with best practice principles, from fieldwork to archiving. Luis Mujica is one of the world-leading experts on a
Quechua language and he has written on phonology, morphosyntax, and typology. Roger Gonzalo is one of the few phoneticians who has worked on
Quechua language. Roxana Quispe is one of the few scholars capable of accurate transcription of a Quechua language. Finally, a key member of our team is Liz Camacho, native Quechua speaker, YouTuber, and community manager in charge of the engagement of the population. Together, we offer complementary skills that have been key to producing a unique set of materials for the Quechua language.
Well, this project itself embodies the values of diversity, equity, and inclusion. This is a social business whose major purpose is the sake of the first nations of Latin American, we are not outside observers but we are immersed and totally aligned, we are part of the first nations of Latin American, our ethnicity is Native American. Then, we treat our languages with due respect.
We have a good gender balance and we will always maintain it. We also promote that all of us develop coding skills.
- Government (B2G)
Although the economic prize is far from being sufficient, it helps us a little.
Undoubtedly, the greatest merit of obtaining the award is the prestige it carries, the award is provided neither more nor less than by MIT, the best university in the world. It will give us a lot of prestige in the academic world, something in the productive sector, and perhaps that will attract the press and with much luck perhaps it will attract the the attention of some public sector officials of the highest level.
- Business model (e.g. product-market fit, strategy & development)
- Financial (e.g. improving accounting practices, pitching to investors)
- Public Relations (e.g. branding/marketing strategy, social and global media)
- Product / Service Distribution (e.g. expanding client base)
Beyond the scope of this contest, in the short term, we need seed capital to cover the first three stages of our complete initiative, which means cover 178 man-months salaries, about US$ 600k. If we reach that goal, we would escape the "death valley", could get revenues, and reach stability and sustainable growth.
Then all the help we need right now must be aligned to that goal, then, we have identified we need financial support, develop a clear business model and public relations.
At the international level, we would like to have the financial support of the Organization of Ibero-American States (OEI) and/or the Andean Community. OEI is leading the creation of an International Institute of Native American Languages (IILI), we would like to be the technological arm of said institute.
We would also appreciate mentorship from organizations like MIT SOLVE or NESST, from them, we expect training which leads us to reach venture capital.
Finally, It would be really great to seal a deal with Mozilla Foundation
Project Manager