AI

The History of LLMs and How They Crossed the Usefulness Threshold

From statistical language models to GPT-4 and beyond, tracing how large language models evolved from research curiosities into indispensable tools that crossed the threshold of genuine usefulness.

Mar 6, 2026|10 min readaillmshistory
The History of LLMs and How They Crossed the Usefulness Threshold

// quick_hits

  • >Language models existed for decades before transformers, but they couldn't do anything genuinely useful at scale.
  • >The 2017 transformer architecture was the single biggest unlock in modern AI.
  • >The capability threshold was crossed in 2022-2023 when models got smart enough. The practical usefulness threshold fell in Sep 2025 when speed, cost, and tooling caught up.
  • >Today's LLMs aren't just better autocomplete. They reason, write code, and operate as semi-autonomous agents embedded in real workflows.

Before Transformers: Decades of "Almost Useful"

Language modeling is not new. Researchers have been building statistical models of language since the early 1990s. N-gram models, hidden Markov models, and early recurrent neural networks all attempted the same fundamental task: predict the next word given previous context.

These models worked, technically. They could autocomplete short phrases, power basic spell-checkers, and assist with machine translation. But they had a hard ceiling. They couldn't hold context beyond a few dozen words, they couldn't reason about what they were saying, and they certainly couldn't produce anything a human would mistake for thoughtful writing.

The RNN and LSTM Era

Recurrent Neural Networks and their improved variant, Long Short-Term Memory networks, pushed the boundary further in the 2010s. LSTMs could theoretically maintain context over longer sequences, and they powered real products like improved Google Translate and voice assistants.

But "theoretically" did a lot of heavy lifting in that sentence. In practice, LSTMs struggled with long documents, trained slowly because they processed tokens sequentially, and hit diminishing returns when you tried to scale them up. You could make them bigger, but they didn't get proportionally better.

The field felt stuck. Language AI was useful in narrow, controlled applications, but nobody was using it to draft emails, write code, or answer complex questions. The technology was a tool for engineers, not for everyone.

Word Embeddings: A Quiet Revolution

One genuinely important advance from this era was word embeddings. Word2Vec (2013) and GloVe showed that you could represent words as vectors in high-dimensional space, where geometric relationships captured semantic meaning. "King minus man plus woman equals queen" became the canonical example.

This was more than a party trick. Embeddings gave neural networks a way to understand that words have relationships, not just frequencies. Every modern LLM builds on this insight, even if the specific technique has been superseded.

The Transformer Breakthrough

In 2017, a team at Google published "Attention Is All You Need." The paper introduced the transformer architecture, and it changed everything.

The key innovation was self-attention: a mechanism that lets every token in a sequence attend to every other token simultaneously, rather than processing them one at a time. This solved the sequential bottleneck of RNNs and allowed models to capture long-range dependencies efficiently.

Why Transformers Changed the Game

Three properties of transformers made them uniquely powerful:

Parallelism. Unlike RNNs, transformers process all tokens simultaneously during training. This meant you could throw more GPUs at the problem and actually get proportional speedups. Training time went from "months" to "days" for the same model size.

Scalability. Transformers exhibited a remarkable property: they kept getting better as you made them bigger. This wasn't true of previous architectures, which hit diminishing returns. With transformers, researchers discovered scaling laws that predicted performance improvements from adding more parameters, more data, and more compute.

Flexible attention. The self-attention mechanism let transformers learn which parts of the input matter for each part of the output. This made them excellent at capturing the kind of nuanced, context-dependent patterns that language requires.

GPT-1 and GPT-2: Proof of Concept

OpenAI's GPT-1 (2018) demonstrated that you could pre-train a transformer on a large corpus of text and then fine-tune it for specific tasks. The results were good but not earth-shattering.

GPT-2 (2019) was when people started paying attention. With 1.5 billion parameters, it could generate surprisingly coherent paragraphs of text. OpenAI initially withheld the full model over concerns about misuse, which generated enormous publicity and signaled that something qualitatively different was happening.

But GPT-2 was still firmly below the usefulness threshold. It could produce plausible-sounding text, but it hallucinated freely, couldn't follow complex instructions, and had no real understanding of what it was saying. It was a fascinating demo, not a tool.

Scaling Laws and Emergent Abilities

GPT-3 arrived in 2020 with 175 billion parameters, over 100x larger than GPT-2. It demonstrated something the field had theorized but never seen at this scale: emergent abilities.

What "Emergent" Actually Means

Emergent abilities are capabilities that appear in larger models but are essentially absent in smaller ones. They don't improve gradually as you scale up. Instead, there's a sharp transition: below a certain model size, the ability doesn't exist; above it, the ability works reliably.

GPT-3 showed emergent few-shot learning. You could give it a few examples of a task in the prompt, and it would generalize to new instances without any fine-tuning. This was qualitatively different from anything that came before. Previous models needed explicit training for each task. GPT-3 could learn tasks from context alone.

The Scaling Hypothesis

This period established the scaling hypothesis: the idea that simply making models bigger, training them on more data, and using more compute would continue to unlock new capabilities. Research from OpenAI and others showed smooth, predictable power-law relationships between compute and loss (the model's prediction error).

The implications were staggering. If the scaling laws held, you didn't need algorithmic breakthroughs to get dramatically better models. You needed bigger GPU clusters and more training data. AI progress became, in part, an engineering and capital allocation problem.

Crossing the Capability Threshold (2022–2024)

There are really two thresholds in this story, and conflating them is one of the most common mistakes people make when talking about AI progress.

The first is the capability threshold: the point where models became smart enough to do real work. This happened between late 2022 and early 2024. The models could reason, write code, and handle complex tasks. But capability alone wasn't enough.

ChatGPT and the Interface Revolution

When OpenAI released ChatGPT in November 2022, the underlying model (GPT-3.5) wasn't dramatically more capable than GPT-3. What changed was the interface. By wrapping the model in a conversational chat UI and fine-tuning it with Reinforcement Learning from Human Feedback (RLHF), OpenAI made it accessible to anyone who could type a question.

ChatGPT reached 100 million users in two months, the fastest adoption of any consumer technology in history. People weren't using it because it was perfect. They were using it because it was capable enough to be interesting. It could draft emails, explain code, summarize documents, and brainstorm ideas at a level that hinted at real productivity.

But "capable enough to be interesting" is not the same as "useful enough to depend on."

GPT-4 and the Reliability Jump

GPT-4 (March 2023) crossed a second capability milestone: reliability. While GPT-3.5 was capable but frequently wrong, GPT-4 was capable and often right. It could pass the bar exam, score in the 90th percentile on the SAT, write working code for complex tasks, and reason through multi-step problems.

GPT-4 was the first model that many professionals genuinely wanted to use every day. But there was a gap between "this model is smart" and "this model is deployed in my workflow." GPT-4 was slow, expensive ($60–75/M tokens for the best models), and the tooling to integrate it into real work barely existed. Most people interacted with it through a chat window and copy-pasted results.

The Open Source Surge

Simultaneously, the open-source community exploded. Meta's LLaMA models, Mistral, and dozens of fine-tuned variants proved that you didn't need OpenAI's scale to build capable models. This mattered because it meant AI capability couldn't be bottlenecked by a single company.

By 2024, organizations could run capable language models on their own infrastructure, fine-tuned on their own data, without sending anything to an external API. This addressed the security and privacy concerns that had been blocking enterprise adoption.

Crossing the Practical Usefulness Threshold (Fall 2025)

The capability threshold and the practical usefulness threshold are different things. Capability asks: "Can the model do the task?" Practical usefulness asks: "Can a real person, in a real workflow, actually depend on this model to do the task faster, cheaper, and more reliably than doing it themselves?"

That second threshold requires more than raw intelligence. It requires speed (responses in seconds, not minutes), cost (affordable enough to run thousands of queries daily), and tooling (IDE integrations, agent frameworks, CI/CD pipelines, APIs that just work).

By mid-2025, the models were brilliant but still too slow, too expensive, and too clunky for widespread deployment. o3 scored incredibly on benchmarks but cost a fortune to run. Claude Opus 4 achieved 72.5% on SWE-bench but at $75/M tokens with limited agent tooling. These models were like Formula 1 cars: extraordinarily capable, but impractical for the daily commute.

The practical usefulness threshold was crossed in September 2025. Claude Sonnet 4.5 shipped with mature Claude Code integration, fast inference, and dramatically lower cost. For the first time, a model was simultaneously capable enough, fast enough, cheap enough, and well-tooled enough that teams could embed it into their actual daily workflows. Not as a demo. Not as a side experiment. As infrastructure.

What followed was rapid: Claude Opus 4.5 (67% cheaper than its predecessor), GPT-5.2 (400K context, fast, reliable), and a wave of mature tooling that made deployment straightforward. The ecosystem crossed the threshold together — models, tooling, cost, and speed all converging at once.

Where We Are Now

As of early 2026, LLMs have firmly crossed the usefulness threshold and are deep into the integration phase. The question is no longer "are these useful?" but "how do we build reliable systems around them?"

From Chat to Agents

The most significant shift since 2024 has been the move from chat-based interaction to agentic systems. Modern LLMs don't just answer questions. They use tools, write and execute code, browse the web, manage files, and coordinate multi-step workflows. Claude, GPT-4, and other frontier models can now operate as semi-autonomous agents that accomplish complex tasks with minimal human oversight.

This is the real threshold crossing. A chatbot that answers questions is useful. An agent that does work is transformative.

The Reasoning Revolution

Models like Claude and OpenAI's o-series have demonstrated genuine reasoning capabilities: the ability to think through problems step by step, consider alternatives, catch their own errors, and arrive at correct answers for problems that require multi-step logic.

This matters because reasoning is what separates "fancy autocomplete" from "useful thinking partner." When a model can reason, it can handle novel situations it wasn't explicitly trained on. That's the capability that makes AI useful for real work rather than just pattern-matching against training data.

What Hasn't Changed

For all the progress, some fundamental challenges remain. LLMs still hallucinate, though less frequently. They still struggle with tasks that require precise counting, complex math, or perfect factual recall. They work best when humans stay in the loop, reviewing outputs and providing course corrections.

The models are also expensive to run at scale, and the environmental cost of training is significant. These are real constraints, even if they're being actively worked on.

The Two Thresholds Behind Us

The most important thing to understand about LLMs in 2026 is that both thresholds are behind us. The capability threshold fell in 2022–2023 when models got smart enough. The practical usefulness threshold fell in September 2025 when the full stack — capability, speed, cost, and tooling — converged.

We're not debating whether AI is useful anymore. We're debating how to deploy it responsibly, how to integrate it into existing workflows, and how to build organizations that can actually capture the value these models create.

That last question, how to build organizations that succeed with AI, turns out to be harder than building the models themselves. The technology works. The challenge now is everything around it: the people, the processes, the data, the tooling, and the measurement. Getting that right is its own discipline entirely.

// references

  • https://arxiv.org/abs/1706.03762
  • https://openai.com/research/gpt-4
  • https://arxiv.org/abs/2001.08361

// faq

What is the usefulness threshold for LLMs?

There are two thresholds. The capability threshold (2022-2023) is when models got smart enough to do real work. The practical usefulness threshold (Sep 2025) is when models became simultaneously capable, fast, affordable, and well-tooled enough to embed into daily workflows at scale.

When were large language models invented?

Statistical language models date back to the 1990s, but the modern era of large language models began with the transformer architecture in 2017 and the subsequent scaling of models like GPT-2 (2019), GPT-3 (2020), and beyond.

Why did LLMs suddenly get useful?

A combination of the transformer architecture, massive compute scaling, better training data, and techniques like RLHF converged to push models past a capability threshold where emergent abilities appeared that were not present in smaller models.

Are LLMs still improving?

Yes. Models continue to improve in reasoning, tool use, multimodal understanding, and reliability. The rate of improvement shows no clear signs of slowing as of 2026.

// key_takeaway

LLMs crossed two thresholds: the capability threshold (2022-2023) when models got smart enough, and the practical usefulness threshold (Sep 2025) when speed, cost, and tooling converged. The second threshold is what changed everything — capability without deployability was just an impressive demo.