缅北禁地

Skip to main content Skip to search

YU News

YU News

Researchers Train AI to Understand the World鈥檚 Most Overlooked Languages

Left to right, Dr. David Li, director of the M.S. in Data Analytics and Visualization, and Hang Yu and Ruiming Tian, both students in the M.S. in Artificial Intelligence, presented their study in March at IEEE SoutheastCon 2025 in Charlotte, N.C.

By Dave DeFusco

In our globally connected world, technology needs to understand people鈥攏o matter what language they speak. Whether you're using voice assistants, translating documents or asking questions online, artificial intelligence is increasingly expected to work across dozens of languages. But while tools like Google Translate and chatbots handle English and other major languages well, they often stumble when dealing with those that aren鈥檛 widely spoken or studied.

So how can machines learn to understand languages that don鈥檛 have large libraries of digital text or annotated examples? That鈥檚 the challenge a team of Katz School researchers led by Dr. David Li, director of the M.S. in Data Analytics and Visualization, set out to solve with a new framework that significantly improves how AI understands 鈥渓ow-resource languages鈥濃攍anguages that lack the massive training datasets available for English, Spanish or Mandarin.

The Katz School team presented their study, 鈥淐ross-Lingual Text Augmentation: A Contrastive Learning Approach for Low-Resource Languages, in March at IEEE SoutheastCon 2025 in Charlotte, N.C.

鈥淥ur work centers on a field called cross-lingual natural language understanding, which involves building systems that can learn from high-resource languages and apply that knowledge to others,鈥 said Dr. Li. 鈥淥ur approach combines clever data techniques and training methods to help machines 鈥榯ransfer鈥 what they鈥檝e learned in one language to many others鈥攚ithout needing massive amounts of new information.鈥

At the heart of today鈥檚 language AI are models like XLM-RoBERTa and mBERT鈥攑owerful tools trained on text from dozens of languages. These models are surprisingly good at capturing patterns that are shared across languages, such as sentence structure or word meaning. But their performance drops dramatically when they deal with languages that have little training data because these models rely heavily on examples.

If a language doesn鈥檛 have many labeled datasets鈥攕entences paired with their meanings or categories鈥攖he model can鈥檛 learn the nuances it needs to perform well. And it鈥檚 not just about having enough data. Sometimes the data that is available comes from a narrow field鈥攕ay, medical journals or government documents鈥攕o the model can鈥檛 apply it easily to other domains like news articles or casual speech.

Traditional fixes, like creating synthetic data through back-translation鈥攖ranslating a sentence into another language and back again鈥攐r swapping in synonyms, help to a degree. But for truly underrepresented languages, even these strategies fall short鈥攅specially if good translation models don鈥檛 exist for them. That鈥檚 where this new research takes things a step further.

The researchers designed a multi-pronged strategy that makes language models more flexible, efficient and accurate in multilingual settings. Their approach focuses on four main innovations:

  • Better Data Augmentation: Instead of relying on just one method, the team combined several: back-translation, synonym swapping and even changing sentence structures. This mix of methods helped create more diverse, higher-quality training examples without introducing too much noise or error.
  • Contrastive Learning: The model is trained to recognize when two sentences in different languages mean the same thing鈥攁nd when they don鈥檛. This strengthens the model鈥檚 ability to match meanings across languages, even if the surface words look nothing alike.
  • Dynamic Weight Adjustment: When learning multiple languages, AI often either overgeneralizes or misses the subtle features of each language. This feature lets the model dynamically balance general knowledge with language-specific quirks, keeping it accurate without losing sensitivity to detail.
  • Adaptation Layers: These are like special filters added to the model that help it tune its responses to a specific task or language. They make the model more flexible and help it perform well even with just a small amount of labeled data.

To see how their system measured up, the researchers tested it on three large datasets used in multilingual AI research: XNLI, which checks whether a model can understand logical relationships in sentences, like contradiction or agreement, across 15 languages; MLQA, which tests how well models answer questions in seven languages; and XTREME, a mega-benchmark covering 40 languages and a variety of tasks, from classification to structured prediction.

鈥淚n all cases, our new framework outperformed traditional methods, especially in low-resource settings,鈥 said Hang Yu, a co-author of the study and student in the M.S. in Artificial Intelligence. 鈥淭he biggest gains came when contrastive learning and augmentation were combined鈥攕howing that giving the model diverse, quality examples and helping it link meanings across languages are both essential.鈥

Even more impressive, the improvements came with only a small increase in computing power and memory use. That makes the framework a practical option for real-world applications, where resources and time are often limited.

To understand what really made the difference, the researchers ran an ablation study鈥攅ssentially, turning off one component at a time to see what impact it had. Here鈥檚 what they found:

  • Removing contrastive learning caused a noticeable drop in performance, confirming it was key to helping the model distinguish between similar and different meanings.
  • Without cross-lingual feature mapping, accuracy dropped the most, proving that directly aligning features across languages is critical.
  • Language-specific adapters and dynamic weight adjustments also played an important role, especially in preserving unique language traits.

鈥淭his research isn鈥檛 just academic. In real-world scenarios鈥攕uch as disaster response, global health communications or inclusive tech development鈥攗nderstanding low-resource languages can have life-saving consequences,鈥 said Ruiming Tian, a co-author of the study and student in the M.S. in Artificial Intelligence. 鈥淚t also matters for cultural preservation, giving digital tools access to languages that might otherwise be ignored in the AI revolution.鈥

The framework developed here offers a scalable, efficient way to close the gap between high- and low-resource languages. It shows that with the right techniques, AI can learn to understand not just the 鈥渂ig鈥 languages but all the voices of the world.

鈥淎s AI becomes more deeply embedded in daily life, from phone apps to government services, ensuring it works well for everyone is both a technical and moral challenge,鈥 said Dr. Li. 鈥淭his research moves us one step closer to that goal.鈥

Share

FacebookTwitterLinkedInWhat's AppEmailPrint

Follow Us