Helen
Helen is a wearable camera (and now iOS app) that performs automated lip reading using deep learning. It can supplement hearing aids in noisy environments (by lipreading and transcribing spoken content), and enable audio independent communication where audio-speech recognition fails.
The hearing impaired have a tough time communicating, either because hearing aids can't isolate voices from crowds, or because such smart hearing aids cost around $3000. A current solution to this problem is sign language; but its relatively small user base limits its proliferation. After verifying these claims with hearing institutes, we realised that we could enhance communication for the hearing impaired by using visual information to capture speech, rather than relying solely on audio. We studied that automated lip reading could be achieved by using novel AI research, and set out trying to bring this solution out of labs and papers and into the hands of those who would truly benefit from it.
Helen has a simple 3-stage workflow: 1. Using a RaspberryPi, Helen records video of speaker. 2. It then transmits this video information to a system running a visual speech recognition model. The models used to actually perform lipreading are our own adapted implementations of the LipNet and Visual-to-Phonemes models theorised by Oxford, DeepMind and CIFAR. Using spatiotemporal convolutional neural networks, it encodes the changes in visual information over time to map sequences of lip movements to words. It then uses bi-directional gated recurrent units to determine how much information to persevere and forget, so that the starting and ending of words can be demarcated. Multilayer perceptrons then aggregate all of this information together to finally output a transcription of spoken content. This transcription can then be converted into audio, or even into braille (for those who are visually impaired as well).
Every stage of Helen's development was carried out keeping the hearing impaired in mind. The hearing impaired have a tough time communicating, either because hearing aids can't isolate voices from crowds, or because such smart hearing aids cost more than $3000. With Helen, hearing impaired individuals would merely need to do point our device (or their iPhone) at a person in order to read a transcription of what is being said. This completely circumvents the need to buy expensive hearing aids (on the part of hearing impaired individuals), or acquire fluency in sign language (on the parts of both the hearing impaired and those conversing with them).
In addition to bridging the communication gap that affects the hearing impaired, Helen also aims to increase their inclusivity in the workforce. By opening up an alternate dimension of audio independent communication, Helen can enable hearing impaired individuals to perform jobs that originally required them to be able to communicate with customers (clerks, tellers and so on).
All of these claims have been verified by the United Kingdom's Association of Teachers of Lip Reading to Adults (ATLA), and other agencies, all of whom have validated our claims and solution.
- Equip workers with technological and digital literacy as well as the durable skills needed to stay apace with the changing job market
Per Cornell, 60% of hearing impaired individuals are unemployed. This is either because the sign language interpreters assigned to them during interviews are unreliable, or because they are turned away from jobs due to their inability to hear customers. Often times, those hearing impaired individuals who are hired, are paid below minimum wage (https://www.google.com/amp/s/www.forbes.com/sites/sarahkim/2019/10/24/sub-minimum-wages-disability/amp/)
Since Helen doesn’t rely on audio, the hearing impaired can use it to interview with more ease and success. Also, since Helen opens up an alternate dimension of visual communication, it also opens up jobs that previously mandated a natural ability to communicate with customers.
- Concept: An idea being explored for its feasibility to build a product, service, or business model based on that idea
- A new technology
Helen opens up an entirely new dimension of communication that is completely independent of audio.
As described in other answers, Helen's primary aim is to make communication easier for hearing impaired individuals. Currently, the hearing impaired face restricted communication in multiple domains, as their hearing aids do not function well in noisy environments, and not everyone they communicate with can be expected to know sign language. Helen removes these dependencies on pristine, noiseless surroundings and fluency over sign language by facilitating communication that is based solely on visual information that is unperturbed by background noise.
The ability of Helen to provide hearing impaired individuals a transcription of spoken content even in the noisiest of environments is not replicated in any other device currently on the market, and is what makes Helen truly unique. We do not have direct competitors who perform automated lipreading; our closest competitors are hearing aid manufacturers. However, we see Helen as a device that can complement hearing aids, rather than compete with them.
Helen has a simple 3-stage workflow: 1. Using a RaspberryPi (or an iPhone camera), Helen records video of speaker. 2. It then transmits this video information to a system running a visual speech recognition model. The models used to actually perform lipreading are our own adapted implementations of the LipNet and Visual-to-Phonemes models theorise d by Oxford, DeepMind and CIFAR. Using spatiotemporal convolutional neural networks, it encodes the changes in visual information over time to map sequences of lip movements to words. It then uses bi-directional gated recurrent units to determine how much information to persevere and forget, so that the starting and ending of words can be demarcated. Multilayer perceptrons then aggregate all of this information together to finally output a transcription of spoken content. This transcription can then be converted into audio, or even into braille (for those who are visually impaired as well).
- Artificial Intelligence / Machine Learning
- Audiovisual Media
- Internet of Things
- Software and Mobile Applications
- Persons with Disabilities
- 3. Good Health and Well-Being
- 9. Industry, Innovation, and Infrastructure
- 10. Reduced Inequalities
- Hong Kong SAR, China
- Hong Kong SAR, China
Within the next year:
- Acquire deeper, more diverse datasets in order to increase the accuracy and reliability of our lipreading models.
- Grow the tech team in order to make Helen accessible on more platforms, and decrease the latency in producing transcriptions
Within the next five years:
- Scale up marketing and distribution efforts in order to push Helen into the market
- Collaborate with the following institutions to quickly allow hearing impaired individuals to utilise Helen:
- Hearing and Lipreading Institutions
- Special Needs Schools
- Geriatric Care Centers
- Collaborate with hearing aid manufacturers and/or smartphone manufacturers to build Helen directly into their equipment. This could mean modifying Helen as an add-on to existing hearing aids, or building Helen into the existing accessibility infrastructure on iOS and Android devices.
Technical:
Availability of data is a core technical issue that needs to be rectified. Currently, lipreading datasets aren't voluminous enough to support very large vocabularies.
Financial:
We anticipate heavy expenditure in the following areas:
- Infrastructure for data collection
- Recruitment of technical specialists and developers
- Cloud computing fees
- Marketing and customer acquisition
As undergraduate students in Hong Kong, financial resources to fulfil these undertakings are currently unavailable to us.
We do not anticipate cultural barriers that would hinder our product's progression; lipreading is a fairly self explanatory concept to grasp (thus avoiding the need to educate the market about our technology), while the lipreading model itself can handle multiple languages - provided sufficient data is available (thus eliminating social and linguistic barriers)
Our needs to circumvent the aforementioned barriers are relatively simple.
The unavailability of large datasets can be remedied by embarking upon a massive data collection exercise, to gather crowd sourced data on which our models can be trained and refined to improve their accuracy and generalisability.
Meanwhile, receiving external investments, grants and endorsements would not only solve our financial issues but also lend verisimilitude and veracity to our efforts
- Not registered as any organization
Currently the team consists of Amrutavarsh S Kinagi and Padmanabhan Krishnamurthy
As Computer Science Undergraduate Researchers at the Hong Kong University of Science and Technology, we have spent over a year researching the untapped domain of real-time audio-visual speech recognition, and have iterated over multiple models, use cases and product designs. Our collective expertise in Computer Science, Mathematics, Robotics, Humanities and Linguistics, coupled with our experiences working with auditory and visually impaired individuals from childhood, motivates us to pursue this project until its commercial fruition.
An important factor that adds to Helen's viability and gives our team the confidence to undertake further development is the recognition Helen has been fortunate to receive since its debut in April 2019. This includes:
Winner of Institute of Engineering and Technology (IET) Present Around the World (PATW) Hong Kong, Asia Pacific and Global Finals (2019)
- Invited talk at Re-Work Applied AI and Deep Learning Summit, San Francisco (2020)
- National Runner Up, Hong Kong, James Dyson Award (2019)
- Semi Finalist, Finalist and (youngest) Winner of the HKUST President’s Cup - HKUST's largest research and innovation competition.(2019)
- Best Innovation Award of 80+ entries across undergraduate, graduate and industrial categories at IET Young Professionals Exhibition and Competition (YPEC) (2019)
- Undergraduate Champion at IET YPEC (2019)
We currently partner with the Hong Kong University of Science and Technology for the various opportunities it provides to showcase our product and utilise AWS to power our cloud services.
The product lineup consists of an affordable wearable or the free app through which audio visual data is captured.
The ML model requires high compute systems which would be run by separate cloud servers for individual customers or could utilise established servers in organisations in which case the subscription service would run for a reduced cost. By allowing the organisations to bear the cost of the service, it not only makes the service affordable to its employees but will also prove to be more economical in the long run.
The subscription model works in this case as we could then promise constant updates to the system, its usability, accuracy and enhancing its accessibility over time with newer iterations of the service.
- Organizations (B2B)
As mentioned earlier, we are facing technical barriers (unavailability of large datasets), and financial barriers (costs of cloud computing, customer acquisition etc.). While the solutions to the technological barriers are straightforward and apparent to us, we are unable to pursue them at this juncture due to the high costs associated with their implementation. Consequently, the technological barriers can be subsumed under the financial barriers.
Should our solution be selected, the earnings from prize money would be sorely valuable to facilitate the following tasks:
- Paying licensing fees to news agencies and media houses for videos of speakers along with subtitles and transcriptions.
- Setting up the technical infrastructure for crowd sourcing additional data.
- Covering costs of developer acquisition and cloud computing fees.
- Establishing marketing and distribution channels.
- Covering patent and other IP costs.
- Business model
- Funding and revenue model
- Marketing, media, and exposure
One of Helen's primary aims is to increase the proportion of the hearing impaired in the workforce by making it easier for them to interview for jobs and communicate with customers.
Not only would the monetary grant enable us to rapidly scale development and deployment of Helen, being able to brainstorm with engineers who have a track record of stellar product development would help us assess suitable domains of deployment which we had not considered. Being able to potentially collaborate with GM to deploy Helen across its different verticals, right from the factory floor to corporate offices would not only aid Helen's development, but also set the stage for lipreading technology to be incorporated into industry accessibility guidelines.
Helen's mission is to make communication easier for both the hearing impaired and those communicating with them, by eliminating dependencies on noiseless backgrounds or costly hearing aids.
Not only is Helen opening up an alternate dimension of communication for the hearing impaired, it is doing so at a substantially lower cost than market leading hearing aids aiming to solve similar issues. If Helen sees the light of day across global markets, it could have far reaching implications on society, right from increasing the proportion of the hearing impaired in the workforce, to being incorporated as a privacy-conscious alternative to voice-dictation in smartphones.
The AI for Humanity grant would not only allow us to accelerate our research into machine learning methods for visual speech recognition (as is evident from our papers linked in earlier sections), but would also facilitate setting up the infrastructure needed to make this research generalisable and ubiquitous.
Helen's mission is to make communication easier for both the hearing impaired and those communicating with them, by eliminating dependencies on noiseless backgrounds or costly hearing aids.
Not only is Helen opening up an alternate dimension of communication for the hearing impaired, it is doing so at a substantially lower cost than market leading hearing aids aiming to solve similar issues. If Helen sees the light of day across global markets, it could have far reaching implications on society, right from increasing the proportion of the hearing impaired in the workforce, to being incorporated as a privacy-conscious alternative to voice-dictation in smartphones.
The Future Planet Capital Prize would not only allow us to accelerate our research into machine learning methods for visual speech recognition (as is evident from our papers linked in earlier sections), but would also facilitate setting up the infrastructure needed to make this research generalisable and ubiquitous.
