AI

Tokens are a big reason today’s generative AI falls short

Comment

LLM word with icons as vector illustration. AI concept of Large Language Models
Image Credits: Getty Images

Generative AI models don’t process text the same way humans do. Understanding their “token”-based internal environments may help explain some of their strange behaviors — and stubborn limitations.

Most models, from small on-device ones like Gemma to OpenAI’s industry-leading GPT-4o, are built on an architecture known as the transformer. Due to the way transformers conjure up associations between text and other types of data, they can’t take in or output raw text — at least not without a massive amount of compute.

So, for reasons both pragmatic and technical, today’s transformer models work with text that’s been broken down into smaller, bite-sized pieces called tokens — a process known as tokenization.

Tokens can be words, like “fantastic.” Or they can be syllables, like “fan,” “tas” and “tic.” Depending on the tokenizer — the model that does the tokenizing — they might even be individual characters in words (e.g., “f,” “a,” “n,” “t,” “a,” “s,” “t,” “i,” “c”).

Using this method, transformers can take in more information (in the semantic sense) before they reach an upper limit known as the context window. But tokenization can also introduce biases.

Some tokens have odd spacing, which can derail a transformer. A tokenizer might encode “once upon a time” as “once,” “upon,” “a,” “time,” for example, while encoding “once upon a ” (which has a trailing whitespace) as “once,” “upon,” “a,” ” .” Depending on how a model is prompted — with “once upon a” or “once upon a ,” — the results may be completely different, because the model doesn’t understand (as a person would) that the meaning is the same.

Tokenizers treat case differently, too. “Hello” isn’t necessarily the same as “HELLO” to a model; “hello” is usually one token (depending on the tokenizer), while “HELLO” can be as many as three (“HE,” “El,” and “O”). That’s why many transformers fail the capital letter test.

“It’s kind of hard to get around the question of what exactly a ‘word’ should be for a language model, and even if we got human experts to agree on a perfect token vocabulary, models would probably still find it useful to ‘chunk’ things even further,” Sheridan Feucht, a PhD student studying large language model interpretability at Northeastern University, told TechCrunch. “My guess would be that there’s no such thing as a perfect tokenizer due to this kind of fuzziness.”

This “fuzziness” creates even more problems in languages other than English.

Many tokenization methods assume that a space in a sentence denotes a new word. That’s because they were designed with English in mind. But not all languages use spaces to separate words. Chinese and Japanese don’t — nor do Korean, Thai or Khmer.

A 2023 Oxford study found that, because of differences in the way non-English languages are tokenized, it can take a transformer twice as long to complete a task phrased in a non-English language versus the same task phrased in English. The same study — and another — found that users of less “token-efficient” languages are likely to see worse model performance yet pay more for usage, given that many AI vendors charge per token.

Tokenizers often treat each character in logographic systems of writing — systems in which printed symbols represent words without relating to pronunciation, like Chinese — as a distinct token, leading to high token counts. Similarly, tokenizers processing agglutinative languages — languages where words are made up of small meaningful word elements called morphemes, such as Turkish — tend to turn each morpheme into a token, increasing overall token counts. (The equivalent word for “hello” in Thai, สวัสดี, is six tokens.)

In 2023, Google DeepMind AI researcher Yennie Jun conducted an analysis comparing the tokenization of different languages and its downstream effects. Using a dataset of parallel texts translated into 52 languages, Jun showed that some languages needed up to 10 times more tokens to capture the same meaning in English.

Beyond language inequities, tokenization might explain why today’s models are bad at math.

Rarely are digits tokenized consistently. Because they don’t really know what numbers are, tokenizers might treat “380” as one token, but represent “381” as a pair (“38” and “1”) — effectively destroying the relationships between digits and results in equations and formulas. The result is transformer confusion; a recent paper showed that models struggle to understand repetitive numerical patterns and context, particularly temporal data. (See: GPT-4 thinks 7,735 is greater than 7,926).

That’s also the reason models aren’t great at solving anagram problems or reversing words.

So, tokenization clearly presents challenges for generative AI. Can they be solved?

Maybe.

Feucht points to “byte-level” state space models like MambaByte, which can ingest far more data than transformers without a performance penalty by doing away with tokenization entirely. MambaByte, which works directly with raw bytes representing text and other data, is competitive with some transformer models on language-analyzing tasks while better handling “noise” like words with swapped characters, spacing and capitalized characters.

Models like MambaByte are in the early research stages, however.

“It’s probably best to let models look at characters directly without imposing tokenization, but right now that’s just computationally infeasible for transformers,” Feucht said. “For transformer models in particular, computation scales quadratically with sequence length, and so we really want to use short text representations.”

Barring a tokenization breakthrough, it seems new model architectures will be the key.

More TechCrunch

Amazon is considering shifting its payments offerings in India into a standalone app, three sources familiar with the matter told TechCrunch, as the e-commerce giant aims to boost usage of…

Amazon considers moving Amazon Pay into a standalone app in India

Root helps food and beverage companies collect primary data on their agricultural supply chains. 

As CO2 emissions from supply chains come into focus, this startup is aiming at farms

In May, the African fintech processed up to $70 million in monthly payment volume.

Waza comes out of stealth with $8M to power global trade for African businesses

This post contains spoilers for the movie “Alien: Romulus” In the long-running “Alien” movie franchise, the Weyland-Yutani Corporation can’t seem to let go of a terrible idea: It keeps trying…

Digitally resurrecting actors is still a terrible idea

Thomas Ingenlath is having perhaps a little too much fun in his Polestar 3, silently rocketing away from stop signs and swinging through tightening bends, grinning like a man far…

With the Polestar 3 now “weeks” away, its CEO looks to make company “self-sustaining”

Some parents have reservations about the South Korean government’s plans to bring tablets with AI-powered textbooks into classrooms, according to a report in The Financial Times. The tablets are scheduled…

South Korea’s AI textbook program faces skepticism from parents

Featured Article

How VC Pippa Lamb ended up on ‘Industry’ — one of the hottest shows on TV

Season 3 of “Industry” focuses on the fictional bank Pierpoint blends the worlds — and drama — of tech, media, government, and finance.

How VC Pippa Lamb ended up on ‘Industry’ — one of the hottest shows on TV

Featured Article

Selling a startup in an ‘acqui-hire’ is more lucrative than it seems, founders and VCs say

Selling under such circumstances is often not as poor of an outcome for founders and key staff as it initially seems. 

Selling a startup in an ‘acqui-hire’ is more lucrative than it seems, founders and VCs say

While the rapid pace of funding has slowed, many fintechs are continuing to see growth and expand their teams.

These  fintech companies are hiring, despite a rough market in 2024

This is just one area of leadership where Parker Conrad takes a contrarian approach. He also said he doesn’t believe in top-down management.

Rippling’s Parker Conrad says founders should ‘go all the way to the ground’ to run their companies

Congresswoman Nancy Pelosi issued a statement late yesterday laying out her opposition to SB 1047, a California bill that seeks to regulate AI. “The view of many of us in…

Nancy Pelosi criticizes California AI bill as ‘ill-informed’

Data analytics company Palantir has faced criticism and even protests over its work with the military, police, and U.S. Immigration and Customs Enforcement, but co-founder and CEO Alex Karp isn’t…

Palantir CEO Alex Karp is ‘not going to apologize’ for military work

Timo Resch is basking in the sun. That’s literally true, as we speak on a gloriously clear California day at the Quail, one of Monterey Car Week’s most prestigious events.…

Why Porsche NA CEO Timo Resch is betting on ‘choice’ to survive the turbulent EV market

Made by Google was this week, featuring a full range of reveals from Google’s biggest hardware event. Google unveiled its new lineup of Pixel 9 phones, including the $1,799 Pixel…

Google takes on OpenAI with Gemini Live

I’ve been playing around with OpenAI’s Advanced Voice Mode for the last week, and it’s the most convincing taste I’ve had of an AI-powered future yet. This week, my phone…

OpenAI’s new voice mode let me talk with my phone, not to it

X, the social media platform formerly known as Twitter, said today that it’s ending operations in Brazil, although the service will remain available to users in the country. The announcement…

X says it’s closing operations in Brazil

One of the biggest questions looming over the drone space is how to best use the tech. Inspection has become a key driver, as the autonomous copters are deployed to…

Ikea expands its inventory drone fleet

Brands can use Keychain to look up different products and see who actually manufactures them.

Keychain aims to unlock a new approach to manufacturing consumer goods

In this post, we explain the many Microsoft Copilots available and what they do, and highlight the key differences between each.

Microsoft Copilot: Everything you need to know about Microsoft’s AI

A hack on UnitedHealth-owned tech giant Change Healthcare likely stands as one of the biggest data breaches of U.S. medical data in history.

How the ransomware attack at Change Healthcare went down: A timeline

Gogoro has deferred its India plans over delay in government incentives, but the Taiwanese company has partnered with Rapido for a bike-taxi pilot.

Gogoro delays India plans due to policy uncertainty, launches bike-taxi pilot with Rapido

On Friday, the venture firm Andreessen Horowitz tweeted out a link to its guide on how to “build your social media presence” which features advice for founders.

A16z offers social media tips after its founder’s ‘attack’ tweet goes viral

OpenAI has banned a cluster of ChatGPT accounts linked to an Iranian influence operation that was generating content about the U.S. presidential election, according to a blog post on Friday.…

OpenAI shuts down election influence operation that used ChatGPT

Apple is reportedly shifting into the world of home robots after the wheels came off its electric car. According to a new report from Bloomberg, a team of several hundred…

Apple reportedly has ‘several hundred’ working on a robot arm with attached iPad

Welcome to Startups Weekly — your weekly recap of everything you can’t miss from the world of startups. I’m Anna Heim from TechCrunch’s international team, and I’ll be writing this newsletter…

Another week in the circle of startup life

MIT this week showcased tiny batteries designed specifically for the purpose of power these systems to execute varied tasks.

Researchers develop hair-thin battery to power tiny robots

Rimac revealed Friday during The Quail, a Motorsports Gathering at Monterey Car Week the Nevera R, an all-electric hypercar that’s meant to push the performance bounds of its predecessor.

The Nevera R all-new electric hypercar can hit a top speed of 217 mph, and it only starts at $2.5 million

While the ethics of AI-generated porn are still under debate, using the technology to create nonconsensual sexual imagery of people is, I think we can all agree, reprehensible. One such…

A hellish new AI threat: ‘Undressing’ sites targeted by SF authorities

Almost two weeks ago, TechCrunch reported that African e-commerce giant Jumia was planning to sell 20 million American depositary shares (ADSs) and raise more than $100 million, given its share…

African e-commerce company Jumia completes sale of secondary shares at $99.6M

We’re entering the final week of discounted rates for TechCrunch Disrupt 2024. Save up to $600 on select individual ticket types until August 23. Join a dynamic crowd of over…

Only 7 days left to save on TechCrunch Disrupt 2024 tickets