A short history of AI, from Turing to today's LLMs
Seventy years of trying to make machines think — and how a single architecture from 2017 finally cashed the check that 1950s AI wrote.
Why it exists
If you only encountered AI in the last few years, the story looks like a flash: ChatGPT appeared in late 2022 and within a year everyone was talking about it. But the idea of a thinking machine is older than the transistor, and the people who chased it spent most of the intervening decades being wrong in interesting ways.
The history matters because almost every “new” idea you’ll read about — agents, reasoning, alignment, even the fear of superintelligence — was proposed, tried, abandoned, and revived at least once before. Knowing the arc helps you tell the genuinely new ideas from rebrands, and it explains why the current moment feels both inevitable and surprising at the same time.
Why it matters now
The current crop of LLMs inherited their architecture, their training tricks, and most of their vocabulary from work done long before the GPT era. The reason today’s models work isn’t a single breakthrough — it’s the accumulation of about five generations of partial successes plus, finally, enough compute and data to make the whole stack pay off.
Understanding that lineage is the difference between treating AI as magic and treating it as engineering. It also tells you where the next bend is likely: the field has a habit of solving a long-standing problem and then immediately discovering that the solution unlocks a new one.
The short answer
AI history ≈ symbolic logic → statistical learning → scaled neural nets
For seventy years, AI swung between two big ideas: write down rules that encode human reasoning (symbolic AI), or have machines learn patterns from data (statistical / connectionist AI). The second idea kept losing because the hardware and datasets weren’t there. Around 2012 they finally were, and the rest of the story is what happens when one approach starts working faster than anyone can keep up with.
How it works
A rough timeline, with the load-bearing ideas called out.
1950 — Turing’s question. Alan Turing publishes Computing Machinery and Intelligence, asks “can machines think?”, and proposes the imitation game as a way to dodge the philosophy and get to engineering. No machine could play it for decades, but the framing stuck.
1956 — Dartmouth, and the name “AI”. A summer workshop at Dartmouth coins the term artificial intelligence and sets the optimistic tone: participants believed a serious dent in the problem was a few years off. Early systems could prove geometry theorems and play checkers.
1960s–70s — Symbolic AI. The dominant bet was that intelligence = symbol manipulation. If you could express knowledge as rules and run logic over them, you’d get reasoning. This produced expert systems that worked impressively in narrow domains (MYCIN for infections, DENDRAL for chemistry) and failed completely outside them.
Meanwhile — the connectionist underdog. Frank Rosenblatt’s perceptron (1958) showed a single-layer neural network could learn simple patterns from examples. Minsky and Papert’s 1969 book Perceptrons proved it couldn’t learn things as basic as XOR, and funding for neural nets evaporated. The first AI winter followed.
1986 — Backpropagation, take two. Rumelhart, Hinton, and Williams popularize backpropagation, which lets multi-layer networks actually learn. In principle this answered Minsky’s objection. In practice the networks were tiny, slow, and beaten by simpler statistical methods. A second AI winter arrived in the early 90s when expert systems also failed to scale.
1990s–2000s — Statistical machine learning takes over. Translation, spam filtering, search ranking, speech recognition — all the practical “AI” wins came from classical machine learning: support vector machines, random forests, hidden Markov models. Quietly useful, not glamorous.
2012 — AlexNet and the GPU moment. A neural network called AlexNet wins the ImageNet image-recognition contest by a huge margin, using GPUs to train a model nobody could have trained on CPUs. Deep learning — neural nets with many layers — suddenly works. Within three years it dominates vision, speech, and translation.
2014–2016 — Sequence models. RNNs and especially LSTMs become the standard for text and speech. Useful, but they process tokens one at a time, which is slow to train and bad at long-range dependencies.
2017 — Attention is all you need. A Google paper introduces the transformer. Instead of stepping through a sequence, every token attends to every other token in parallel. This is the unlock: it trains far better on GPUs and captures long-range structure that LSTMs missed. Almost every modern AI model — text, image, audio, video — is now a transformer or a close cousin.
2018–2020 — GPT and the scaling laws. OpenAI’s GPT-1, GPT-2, GPT-3 demonstrate something quietly radical: if you make the model bigger, feed it more text, and train longer, capabilities improve predictably. GPT-3 (175 billion parameters, 2020) could write essays, code, and pastiches with no task-specific training. The scaling laws turned AI progress from research lottery into capital expenditure.
2022 — ChatGPT and RLHF. GPT-3 was powerful but raw. ChatGPT bolted on RLHF: fine-tune the base model to follow instructions, then use human ratings to shape its behavior into a helpful, mostly-honest assistant. This wasn’t a capability leap so much as a product leap — and it broke containment. A hundred million users in two months.
2023–2025 — Multimodality, tools, and agents. Models learn to see images (GPT-4V, Claude 3), generate them (DALL·E, Midjourney, Stable Diffusion), and use external tools by calling them as functions. This is when the agent pattern crystallizes: an LLM in a loop, deciding what to do next and calling tools to do it. See agent.
2024–2026 — Reasoning and long horizons. OpenAI’s o-series and DeepSeek-R1 introduce models that think before answering — generating long internal chains of reasoning and being trained, via reinforcement learning, to do this well. Context windows reach a million tokens; agents start completing work that takes hours instead of seconds. Open-weights models catch up with frontier closed ones on most benchmarks. The current frontier is no longer “can the model answer?” but “can it stay coherent across a day of work?”
Famous related terms
- Symbolic AI —
symbolic AI = rules + logic engine— the “write down what you know” school. Dominant 1956–~1990, still alive in formal verification. - Connectionism —
connectionism ≈ many tiny units learning together— the neural-net family. Lost the early debates, won the war. - AI winter —
AI winter ≈ funding collapse after over-promising— periodic collapses in funding and interest after the field over-promised. There have been at least two big ones (~1974 and ~1990). - Perceptron —
perceptron = single layer + threshold function— the original learnable neural unit (Rosenblatt, 1958). - Backpropagation —
backprop = chain rule + propagate errors backward through layers— the algorithm that makes deep networks trainable; forgotten, rediscovered, and now the foundation of every model in use. - AlexNet —
AlexNet = deep CNN + GPU training + ImageNet (2012)— 2012 image classifier that proved GPUs + deep nets + lots of data could beat anything else. The starting gun for the deep learning era. - Transformer —
transformer ≈ stack of (attention + feed-forward) layers— the architecture under every modern frontier model. - Scaling laws — the empirical observation that bigger model + more data + more compute keeps improving performance, predictably. Why labs spend billions on training runs.
- RLHF —
RLHF = supervised fine-tune + reward model + RL loop— the alignment step that turned a raw next-token predictor into something that feels like an assistant.
Going deeper
- Computing Machinery and Intelligence (Turing, 1950) — the founding paper. Shorter and more readable than you’d expect.
- Perceptrons (Minsky & Papert, 1969) — the book that ended the first neural net era. Worth reading for how confidently it was wrong about scale.
- Attention Is All You Need (Vaswani et al., 2017) — the transformer paper.
- Scaling Laws for Neural Language Models (Kaplan et al., 2020) — the paper that turned model size into a budgeting question.
- Rich Sutton’s The Bitter Lesson (2019) — a one-page essay arguing that every time we tried to build in human knowledge, raw scale eventually beat us. The clearest summary of what the last seventy years actually taught.
- Genius Makers (Cade Metz, 2021) — narrative history of the deep learning revival, organized around the people.