AI’s Dirty Secret: It Mostly Speaks English

2 weeks ago 4

Véronique Özkaya is CEO of DATAmundi.ai, delivering high-quality human data for leading global AI labs and enterprises.

getty


​At first glance, AI is viewed as a global technology. However, if you look at its linguistic foundations, AI remains far from global.

Of course, AI generates content and writes in dozens of languages, translates instantly and powers products used across continents. The trouble is that most AI systems still think in one language. You guessed it: English.​

Despite the frequent claim that today’s models are “multilingual,” the reality is that modern AI has largely been built on English. As highlighted by the World Economic Forum, most AI systems are trained on only a small subset, roughly 100 languages, of the approximately 7,000 languages spoken worldwide.​

Analyses of large public training datasets for large language models (LLMs) show a strong dominance of English. For example, studies such as Meta’s LLaMA 2 paper indicate that roughly 90% of training tokens are English, while broader web data suggests English still accounts for nearly half of online content. If AI models such as ChatGPT are primarily trained on internet data, this imbalance inevitably shapes and skews how they understand and represent the world.​

How Did We Get Here?

Several structural forces have shaped AI’s English-centric trajectory. The early internet was largely built in the U.S., and much of its foundational infrastructure, from domain systems to major content platforms, was developed in English.

Today, many of the frontier AI labs remain U.S.-based, and widely used evaluation benchmarks such as the MMLU benchmark were originally developed in English. Data pipelines tend to follow the path of least resistance: English content is more abundant, more standardized and more readily accessible at scale for model builders.

Over time, this has created a reinforcing loop. AI systems perform best in English, users adapt by interacting with them in English and even more English-language data is generated as a result.​

Why Multilingual AI Isn't Just Translation​

One of the most persistent misconceptions is that multilingual AI capability can be solved through translation alone. Much of what is labeled “multilingual AI” relies on English as a pivot language, meaning models frequently process information in English internally before translating it into other languages.

​​When a model primarily reasons in English and then translates its output, it often carries English logic, structure and implicit assumptions into the target language. Most LLMs typically understand queries by converting multilingual inputs into English for task-solving and, as a result, can overlook cultural nuance and locally specific concepts. For example, when responding to legal or compliance queries, models may default to English-speaking jurisdictions and assumptions, producing answers that are fluent but not grounded in the relevant local context. The result may be grammatically correct, but the model isn't thinking or reasoning in the target language or country.​

Even leading models reflect this imbalance. Many are trained on multilingual datasets but evaluated primarily in English. Benchmarks, leaderboards and performance reports frequently center on English-language tasks, masking measurable performance gaps elsewhere, including higher error rates, increased hallucinations and weaker reasoning accuracy in underrepresented languages, where models often underperform on equivalent tasks.​

Beyond language lies another complex issue: Cultural bias and content that moves beyond literal meaning. Tone, humor, idioms, cultural references, politeness norms and embedded stereotypes are all part of everyday communication and shape how models interpret prompts and generate responses. This leads to outputs being technically accurate yet subtly misaligned. These misalignments can have serious consequences in high-stakes domains such as healthcare, finance and legal. A healthcare chatbot may be medically and linguistically accurate but fail to interpret culturally nuanced symptom descriptions. In Japan, for example, patients may describe symptoms indirectly, downplaying their chest pain discomfort as “a little heaviness” rather than acute pain. This reflects typical norms of restraint and not causing concern, seen across many Eastern Asian cultures. But heaviness in the chest can be a life-threatening situation, regardless of where you’re from. If signals are misinterpreted, the model may underestimate clinical urgency.

True multilingual intelligence requires models that are trained, evaluated and optimized across languages and cultures from the outset.​

The Business Opportunity For Multilingual AI​

According to the International Telecommunication Union, in 2024, roughly 5.5 billion people were online, with the fastest growth coming from Asia, Africa, Latin America and the Middle East, where English isn't the primary language.

For companies building AI-powered products, language isn't a peripheral feature of product design. It's a core driver of adoption and economic value. Overlooking linguistic and cultural nuance risks optimizing products for only a fraction of their addressable market.

The Role Of Human Data

Real progress in multilingual AI doesn’t come from scaling translated corpora. It comes from natively sourced, culturally grounded human data. This is already playing out in real world AI systems. In one large-scale client deployment, the DATAmundi team supported automatic speech recognition (ASR) development, building a data training pipeline by converting thousands of multilingual audio files in 14 languages into text using AI transcription. Linguists then validated and annotated the data, ensuring it reflected how people actually speak in real-world, culturally diverse conditions. The result was an ASR model trained not on idealized language but on the complexity of real speech.

Original content created by native speakers across registers, domains and contexts provides the semantic richness models need to reason effectively. Human data collected in-language teaches models how tone, politeness and authority function at a local level. Language-specific and human-driven annotation and feedback loops correct subtle but consequential errors.

Building AI For The World

The next generation of AI models will require exponentially more data. But the next frontier isn't just more data. It’s better, more representative data.

By training models natively, rather than translating them, and evaluating the models in all the languages you support, you can unlock scalable performance gains and build AI that genuinely reflects and understands your global users.​​


Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?


Read Entire Article