India turns to AI in ‘a special effort’ to capture its 121 languages
- Few of India’s many languages are covered by natural language processing, the branch of AI that enables computers to understand text and spoken words
- Hundreds of millions of Indians are thus excluded from useful information and many economic opportunities. Governments and start-ups are trying to bridge this gap
For a few weeks this year, villagers in the southwestern Indian state of Karnataka read out dozens of sentences in their native Kannada language into an app as part of a project to build the country’s first AI-based chatbot for Tuberculosis.
But few of these languages are covered by natural language processing (NLP), the branch of artificial intelligence that enables computers to understand text and spoken words.
Hundreds of millions of Indians are thus excluded from useful information and many economic opportunities.
“For AI tools to work for everyone, they need to also cater to people who don’t speak English or French or Spanish,” said Kalika Bali, principal researcher at Microsoft Research India. “But if we had to collect as much data in Indian languages as went into a large language model like GPT, we’d be waiting another 10 years. So what we can do is create layers on top of generative AI models such as ChatGPT or Llama.”
The villagers in Karnataka are among thousands of speakers of different Indian languages generating speech data for tech firm Karya, which is building data sets for firms such as Microsoft and Google to use in AI models for education, healthcare and other services.