Saving Tongues: Protecting Unwritten Languages
Recording indigenous and minority communities' languages through an online oral dictionary powered by an AI languages model.
Every two weeks a language goes extinct. According to the UNESCO Atlas of Languages in Danger, there are 6,700 languages spoken in the world, 40 percent of which are in danger of disappearing. This worldwide phenomenon, fueled by pressures to abandon ancestral dialects in exchange for economic growth, contributes to the destruction of indigenous and minority identities while eliminating a link to our human past which allows us to understand who we are as well as our species neurological and psychological capabilities. The extinction of these languages would result in the loss of a massive corpus of ancient knowledge and human genius which contains everything from stories as great as the Iliad to knowledge of the qualities, medicinal or poisonous, of every plant and toadstool in the region as in Cherokee. The elimination of such knowledge has potentially devastating consequences ranging from the loss of medicinal knowledge which could be crucial to identifying life-saving drugs to the loss of engineering secrets as great as Damascus steel, locked in languages we can no longer comprehend. Additionally, the extinction of languages means the elimination of minority languages' strong role in promoting economic growth and facilitating cross border trade according to the Council of Europe. By increasing their use, officials can give their regions a strong competitive advantage in the economic sphere.
The problem of language extinction is particularly manifest in the fact that once languages die, they often cannot be revived because of a lack of information about the language. Even written languages often cannot be deciphered without native speakers- text without audio has little value (take Linear B, the Indus value, and, to a lesser extent, Sanskrit). As a result, these languages get thrown to the ash heap of history, never to be understood, much like the language of the old Indus Valley. Many languages, such as Gondi in India, various Chinese dialects, and various languages in the Amazon Rainforest lack a writing system, making it exceedingly difficult for translators to save such languages. Models such as Google's BERT model require text making it difficult for speakers of unwritten languages to preserve their ancestral tongues. Without a written machine, it is difficult for researchers to perform Natural Language Processing on these languages. Additionally, even if these languages are given a written script, no substantial corpus of work exists for machines to sort through in order to perform language processing. As a result, these languages tend to get ignored in favor of endangered languages with written scripts such as Cherokee.
This issue exists in nearly every country in the world, exacerbated by a history of colonization, imperialism, and unchecked globalization not balanced by protections against cultural decimation. This global destruction of languages eventually reaches every home as we become poorer in diverse perspectives and our understanding of the natural world declines. This is how languages die. This is how cultures die. This is how peoples die.
The main problem with the current situation is the inability to process audio data from unwritten languages. The solution, thankfully, is quite simple. If we take speech to text, audio coming from a person's mouth creates a certain set of vibrations in line with its frequency. These vibrations are then picked up by the machine which converts them into sound waves and breaks them down into phenome- essentially sounds. These phenomes are then mapped against a library of words, sentences, and phrases via a mathematical model which gives the most likely conversion.
Next, in examining models such as Chat-GPT and Google Translate, we see that these models gather incredible amounts of data from a variety of sources which are then fed into a machine which analyzes words and compares their use in different different contexts to gain understanding of the languages grammar and the various contexts in which a word might be used. From that, it is able to guess the definition of a word. Utilizing these two technologies, namely NLP and Speech-to-Text, it is possible to crowdsource data from recordings produced by indigenous communities who record conversations which are data for the machine. These machines can then map the sound vibrations from one person's speech to another person's speech, creating an understanding of grammar and context through an entirely audio-based approach.
Essentially, an NLP model would be used on audio to create an oral dictionary where the computer returns oral definitions and translations for illiterate communities. Due to a lack of written words, this model would not create boundaries between for less-educated and underserved communities and would actually give language as it is used in its more natural, oral context, as opposed to a potentially correct, but less utilized, formal, literary version in cases where that does exist.
This product primarily serves indigenous and minority communities mainly centered in countries developing countries such as India, Brazil, and the Philippines but also in wealthier countries such as Canada and the United States. These communities are currently underserved by a lack of language services, translation, and preservation of history, dooming their culture to decline without genuine intervention taken to protect them against linguistic and cultural assimilation. For example, take the Gondi tribal community in India who are spread out among a variety of states with various different state languages. Without a unified political presence or a traditional written language, their language has been in decline with only one-fifth of Gondis speaking their ancestral language as they have been replaced by larger, more established languages such as Hindi and Telugu. As with the indigenous peoples of Mexico and Peru along with cultures around the world, they lack the government and private resources required to protect their language against extinction. Very few people are currently working to solve this issue and those who are are often forced to use tremendous effort to manually document these languages.
The solution solves this problem by severely decreasing the need for manual resources to defend these languages. While the status quo's main issue is the lack of resources, the solution decreases the resource threshold needed to stop linguistic decline, allowing for a smaller economic commitment, fewer foreign personnel, and a greater number of languages to be saved with a smaller time commitment. As such, the solution solves the current issues in the status quo which stem primarily from a lack of commitment.
My experience is primarily defined by my Indian heritage. Being a native speaker of the Tamil language, I am uniquely aware of the influence of larger languages and cultures on my own culture. Our language, even in India, has been taken over by English and Hindi, with good education being a primarily English medium endeavor. Further, all intellectual discussion occurs in English, with Tamil's literary elegance and prominence being reduced to a home language with little true prestige. Indeed, Tamil has been so corrupted that literary Tamil is no longer My mother's native Bengaluru has been occupied by a new wave of Hindi settlers, with Kannada being reduced to a tertiary language under an English elite and a Hindi settler culture. And these are comparatively large languages in comparison to many of the smaller languages which face extinction in the modern world.
Seeing my own language being conquered by English has heightened my understanding of the very real challenges facing the many indigenous communities of the world who are seeing their culture being destroyed. Additionally, a feature of my native Tamil is Diglossia: the spoken version of the language is markedly different from the written version. This is combined with the fact that much of my culture and religion, especially in regards to the Vedas, is oral. With this in mind, I decided to develop a method of documentation that is not reliant on the written word because the written word does not fully capture the reality or essence of a language.
I have done research and engaged in conversations with members of my community in regards to the need for the conservation of languages, especially unwritten ones. Additionally, I have been in a situation where I understand the challenges of maintaining my language and my connection to my language without an adequate support system or resources in place.
- Improving learning opportunities and outcomes for learners across their lifetimes, from early childhood on (Learning)
- Concept: An idea being explored for its feasibility to build a product, service, or business model based on that idea.
While other forms of language preservation focus primarily on creating written dictionaries with audio components little better than text-to-speech, this project aims to expand translation and dictionary services to unwritten languages without the benefit of a written script. Additionally, by performing this process in a purely auditory manner and by crowdsourcing the data, this solution requires far fewer resources, workers, and conscious work to record the data. Instead, the information could be obtained via recording people's daily conversations without any need for to precious human hours to be spent recording conversations.
Due to the lower resource threshold required for saving a language, the solution has the potential to transform this field as more and more previously unrecorded languages can be saved and documented, preserving past knowledge for generations to come while protecting native cultures from extinction and allowing them to maintain their own ethnic identity.
These are the impact goals:
- The fundamental technology powering Saving Tongues is completed and fully functional.
- Saving Tongues will have at least one local partner allowing us to work with them.
- Our local partner will be provided the recording equipment necessary for the project.
To achieve these impact goals, the following actions will be implemented:
- Work will continue on the technology, aiming for launch by September.
- A variety of local partners will be contacted for work on the project.
- A sponsor for the costs of the recording equipment will be found.
The core technology is, as stated earlier, a combination of Natural Language Processing paired with audio recording software similar to speech-to-text features. This technology will be fed audio data which will then be converted into a set of vibrations with assigned frequencies. These frequencies would then continue to be stored by the machine as it collects more data from the conversations it records. When the machine finally receives enough data, it will start processing the language using models similar to Chat-GPT and Google Translate to sort through the data to find similar sound waves. Having found similar sound waves and having ascertained their position in a sentence, the machine- along with grammatical guidance by those fluent in the target language- would assign definitions to words that correspond with its usage. The use of crowdsourced, audio-recording data, would allow the machine to map the usage of words and phrases between people, accounting for differences in usage. Essentially, this is Translate's and Chat-GPT's language model used on audio to create an oral dictionary where the computer returns oral definitions and translations for illiterate communities.
- Artificial Intelligence / Machine Learning
- Audiovisual Media
- Big Data
- Crowd Sourced Service / Social Networks
- Software and Mobile Applications
- United States
Having not yet launched the service, the solution is targeting, between 1-2 communities in the next year with hopefully 20-30 users actively recording conversational data for the project for the 2023-2024 year.
The project foresees the following as potential challenges.
- A lack of willingness by native leaders to partner up with foreign workers.
- Potential distrust of conversation recording technology.
- A perceived lack of time on the part of indigenous people to participate in conservation efforts.
- Fund depletion
- Resurgence of COVID-19
- Discrimination and legal restrictions by countries for whom eradicating languages is a national policy against native groups.
- A lack of funding for the technologies necessary to advance the project.
Currently none
Key Resources:
- Community Participation and Outreach
- Developed oral dictionary
- Social media Marketing
- Community Outreach Marketing
Key Activities:
- Developing the oral dictionary
- Recording conversations among indigenous languages speakers
- Expanding the solution to more communities
Type of Intervention:
- Product: Oral Dictionary
- Service: Language Recording and Community Education
Segments:
- Beneficiaries: Communities whose languages and histories are preserved
- Customers: Governments who recognize the need for translation services to languages who have old native speakers who might not speak the dominant language
Partners + Key Stakeholders:
- Indigenous Communities
- Local governments who wish to preserve native culture or who require translation services
- Schools who require translation services
- Universities who wish to preserve endangered languages
- United Nations
Channels:
- Local officials
- Community outreach/word of mouth
- Internet
- Local partners
- Local communities
Cost Structure:
- Currently: Technology to design the underlying technology
- Future: Partnerships, hiring local language experts, supplying audio-recording technology, maintaining the product
Surplus:
- Hire more staff
- Expand to other communities
- Improving mapping capabilities
- Expanding into educational opportunities for indigenous kids
Revenue:
- Applying for grants (governmental and university): 50%
- Applying for competitions: 15%
- Donations by individuals: 20%
- Local community contributions: 15%
Initially the project will be sustained via grants and donors who see the value in the project. After that, the project hopes to transition to expanding beyond just serving native communities and provide translating resources via contracts with governments for the education of children in countries where English-language education is low. However, the solution will utilize a Free-for-Service business model for the communities we help such as to respect their labors and their history.