Language Data Collection for Marginalized Language Communities
- United States
- Nonprofit
The head of Microsoft Research in India said that, while the digital divide is getting smaller, the digital language divide is growing larger. For four billion people who speak marginalized languages, the digital transformation will be meaningless because language technology is not available in their languages. This hinders access to information and the ability of communities to learn and grow within their own context. The lack of access to information is particularly stark in crises, when people, particularly the most marginalized, desperately need information. Call centers, helplines, chatbots, IVR systems - all of the ways that people get information and ask questions are not accessible if people do not speak the right language. And even less useful if they are based on written text; many of the most marginalized either have low literacy or there is little ability to be educated in their languages.
To bridge this gap, speech language data, the main pillar for developing voice-based language technology, is needed. Ensuring the right approach to the collection of language data is critical to ensuring that the technology is less biased, and less potentially harmful.
While several organizations have initiated efforts to create voice datasets, either through localized community drives or through wide-access platforms, challenges such as cost, lack of engagement, and inability to replicate these efforts have resulted in disjointed and non-scalable initiatives. It has also meant that the speakers of those languages are not granted agency to contribute to and share inclusive language technology.
There is a need to ensure that language data collection can be community-driven and owned by speakers of marginalized languages AND is useful to language technology developers in the global South.
Democratizing access to dataset creation empowers the communities and ensures that AI models and other digital tools truly reflect the diversity of voices and serve the needs of those diverse communities.
As speakers of marginalized languages acquire skills to collect language data, including the ability to build and mobilize communities, manage complex projects, support local initiatives, and learn new technologies, they can thrive in their community and help build solutions that allow their communities to access information, learn, and thrive.
Our project proposes a revolutionary approach by leveraging our Translators without Borders (TWB) network, a community of over 100,000 language informants.
Our solution has two pillars:
Developing the Learning Center: Training and supporting language communities to create language data and ensure diversity and lack of bias in the data collected
Developing LangCo: Providing the tool to collect language data ensuring consistent, high quality language data that is easy for marginalized language speakers to use
TWB Learning Center
In many marginalized language communities, linguists can’t access comprehensive training and resources to enhance their professional skills, hindering their ability to effectively contribute to developing language data in their language.
The one-of-a-kind TWB Learning Center already empowers those from marginalized language communities, by providing comprehensive training and resources to enhance their professional skills. Through collaboration, we will diversify our courses to encompass a wider range of linguistic skills. In 2023, we significantly enhanced this platform, introducing a revamped interface and adding new courses. This resulted in a substantial increase in engagement, with 20,094 new accounts created: a threefold year on year increase. A total of 31,000 users from about 100 countries have benefited from our training courses, including in Egypt, Turkey, Argentina, Nigeria, China, Ukraine with approximately 250 users each. This underscores the platform's accessibility and the interest of marginalized language speakers to engage.
With this project, we will introduce courses in text and voice data collection, validation and language technology, equipping community members with practical skills and knowledge essential for success in the language services industry and to engage in developing AI-based language solutions.
Our expansion efforts will prioritize mobilizing communities from marginalized languages, accessibility and inclusivity, ensuring that users from diverse backgrounds and linguistic communities can benefit from our resources. We plan to introduce live webinars and virtual mentorship programs, providing learners with opportunities to interact with experienced professionals and receive personalized guidance and constructive feedback.
The LanCo Platform:
The Language Data Collection Tool (LanCo), is a pioneering open-source platform designed to collect, validate, and utilize language data from a diverse range of languages, especially those underrepresented in the digital domain.
LanCo operates on a simple yet powerful premise: it enables our established community to log in using their existing credentials and participate in linguistic tasks tailored to their language expertise. The process is gamified with recognition points, leaderboards, and incentives, making it engaging and rewarding for participants.
Unique Features of LanCo:
Community-Driven Data Collection: By integrating with the TWB platform, LanCo taps into our network of language informant ready to contribute to language data collection and validation.
Open-Source and Transparent: LanCo is open-source, granting full transparency and control over the data collection process. Communities can set up their projects, manage access levels, and retain ownership of the data.
Inclusivity and Accessibility: The platform is designed to be intuitive and accessible to a wide range of users through browser and mobile, with no prerequisite technical skills required for participation and clear user interface
Target population
While the project as a whole targets language informants who speak over 200 languages, initially, we will focus on language communities in northeast Nigeria. The TWB community already includes over 400 Hausa speakers and 200 Kanuri speakers, in addition to hundreds of speakers of Fulfulde and Shuwa Arabic.
Our recent experience in developing language data has demonstrated the importance of involving local communities in gathering language data, as it enhances the development of language technologies that truly address the needs of that community. We aim to build on this by empowering more linguists to contribute to language data collection, as part of our training plans. A key aspect of our expansion strategy is providing online training to local communities, beyond our current community of linguists. By doing this, we empower linguists to develop usable, inclusive, equitable, scalable, and adaptable solutions. We piloted a community-based approach to collecting and validating Kinyarwanda language data for tech solutions in education and tourism. Our recent blog post is here.
Through the project, the language informants learn how to use digital tools, understand the importance of language data, and develop new skills that they can also use to become translators, voice artists or other language professionals.
Impact of the Solution
Developing digital skills: By providing training and skills development, the project enables language informants to gain knowledge and skills on digital data development and development of language technology, giving them pathways to a career as a language professional
Democratizing Language Data Collection: By enabling community-driven data collection, the project empowers individuals to contribute to and expand the digital presence of their languages. This grassroots approach ensures that the digital landscape becomes more inclusive.
Fostering Innovation and Accessibility: Developers and researchers gain access to a rich repository of language data, catalyzing the development of AI services and products that cater to a broader range of languages. This innovation will make technologies like speech recognition available in marginalized languages, breaking down barriers to information and services.
Enhancing Digital Inclusivity: By providing the foundational data needed to support technology development in underrepresented languages, the project enables marginalized language speakers to access digital tools and services that increase productivity, improve access to information, and facilitate cross-cultural and cross-linguistic connections. This inclusivity fosters a more equitable digital future, where no one is left behind because of their language.
In essence, this is not just a tool but a movement towards a more linguistically diverse and equitable digital world. By addressing the acute need for language data in marginalized communities, CLEAR Global is laying the groundwork for a future where digital benefits are universally accessible, fostering global connections and empowering all individuals, regardless of their language, to participate fully in the digital age.
Our expertise lies within our in-depth knowledge of the humanitarian and development sectors and our on-the-ground experience in responding to the challenges of multilingual communication. Our language services team has proven experience in community mobilization and training and has built customized online courses to support our global community of over 100,000 linguists. Our team also includes a unique combination of aid professionals, natural language processing experts, computational linguists, data scientists, researchers, design thinking experts, and communicators.
In practice, we have also successfully developed an early stage of this solution as elaborated in the details below on the selection of solution stage.
- Provide the skills that people need to thrive in both their community and a complex world, including social-emotional competencies, problem-solving, and literacy around new technologies such as AI.
- 3. Good Health and Well-Being
- 4. Quality Education
- 5. Gender Equality
- 9. Industry, Innovation, and Infrastructure
- 10. Reduced Inequalities
- Pilot
In a previous project, CLEAR Global was given a gamified tool, Sentence Society, that supports data donation projects for machine translation. The tool had its limitations but we were able to update it and use it in crowdsourcing parallel sentences for a bi-directional machine-translation model in English <-> Kinyarwanda. The project entailed developing the tool, building and mobilizing the community, training linguists to create language data for the use case, and rewarding linguists through an incentive program.
To begin creating langage technology like machine translation, we needed language data - in this case parallel sentences in the right languages. Even languages with millions of speakers like Kinyarwanda may not have language datasets that are good enough to create accurate, viable, and domain-specific language technology capacity - yet.
In order to build machine translation capacity in Kinyarwanda, we mobilized speakers from our Translators without Borders Community and partnered with local organizations. We took a collaborative approach by sharing information about the project goals, the tool they would use to collect and validate language data, and the project’s intended impact. We aimed to ensure our community members had full transparency about the project and how their language data would be used. Demonstrating our commitment to transparency and open communication helped strengthen relationships and foster a sense of ownership between the community and our project team. This approach helped build trust in the technology and the overall project.
We used the data created to build and integrate a Machine Translation Plugin into Moodle, an online learning management system so users could switch between Kinyarwanda and English as they navigated content on entrepreneurship and digital literacy. The whole initiative was owned by the local communities, for both the tech and language services, and there are plans to replicate it for other use cases independently from CLEAR Global.
The current project aims to expand the tool to overcome its many limitations and to enable crowdsourcing speech data and transcriptions, given that voice-enabled solutions provide better access to information particularly within low-literacy contexts.
More importantly, the current project will tackle new communities and new languages, and will be integrated with our existing community of over 100,000 linguists so that it expands their skills and gives them agency over technology in their languages.
CLEAR Global has been developing language-based solutions for many years, including language technology. As far back as five years ago (an eternity in the language tech world), we recognized that voice-based solutions would be where we needed to focus; but the tech wasn’t advanced enough for low resource languages. However, recently, the ability to develop voice-based solutions for languages with few digital resources has become possible. While our team is up to the task, we could use support, guidance, and mentorship to really take our solutions to the next level.
We believe the MIT Solve community and resources will help give us the solid basis from which we can grow and develop our voice-based solutions.
- Business Model (e.g. product-market fit, strategy & development)
- Product / Service Distribution (e.g. delivery, logistics, expanding client base)
- Technology (e.g. software or hardware, web development/design)
- A new business model or process that relies on technology to be successful
- Artificial Intelligence / Machine Learning
- Crowd Sourced Service / Social Networks
- Software and Mobile Applications