The chief technology officer of a robotics startup told me earlier this year, “We thought we’d have to do a lot of work to build ‘ChatGPT for robotics.’ Instead, it turns out that, in a lot of cases, ChatGPT is ChatGPT for robotics.”
Until recently, AI models were specialized tools. Using AI in a particular area, like robotics, meant spending time and money creating AI models specifically and only for that area. For example, Google’s AlphaFold, an AI model for predicting protein folding, was trained using protein structure data and is only useful for working with protein structures.
So this founder thought that to benefit from generative AI, the robotics company would need to create its own specialized generative AI models for robotics. Instead, the team discovered that for many cases, they could use off-the-shelf ChatGPT for controlling their robots without the AI having ever been specifically trained for it.
I’ve heard similar things from technologists working on everything from health insurance to semiconductor design. To create ChatGPT, a chatbot that lets humans use generative AI by simply having a conversation, OpenAI needed to change large language models (LLMs) like GPT3 to become more responsive to human interaction.
But perhaps inadvertently, these same changes let the successors to GPT3, like GPT3.5 and GPT4, be used as powerful, general-purpose information-processing tools—tools that aren’t dependent on the knowledge the AI model was originally trained on or the applications the model was trained for. This requires using the AI models in a completely different way—programming instead of chatting, new data instead of training. But it’s opening the way for AI to become general purpose rather than specialized, more of an “anything tool.”
Now, an important caveat in a time of AI hype: When I say “general purpose” and “anything tool,” I mean in the way CPUs are general purpose vs, say, specialized signal-processing chips. They are tools that can be used for a wide variety of tasks, not all-powerful and all-knowing. And just like good programmers don’t deploy code to production without code review and unit tests, AI output will need its own processes and procedures. The applications I talk about below are tools for multiplying human productivity, not autonomous agents running amok. But the important thing is to recognize what AI can usefully do.
So to that end, how did we get here?
Fundamentals: Probability, gradient descent, and fine-tuning
Let’s take a moment to touch on how the LLMs that power generative AI work and how they’re trained.
LLMs like GPT4 are probabilistic; they take an input and predict the probability of words and phrases relating to that input. They then generate an output that is most likely to be appropriate given the input. It’s like a very sophisticated auto-complete: Take some text, and give me what comes next. Fundamentally, it means that generative AI doesn’t live in a context of “right and wrong” but rather “more and less likely.”
Being probabilistic has strengths and weaknesses. The weaknesses are well-known: Generative AI can be unpredictable and inexact, prone to not just producing bad output but producing it in ways you’d never expect. But it also means the AI can be unpredictably powerful and flexible in ways that traditional, rule-based systems can’t be. We just need to shape that randomness in a useful way.
Here’s an analogy. Before quantum mechanics, physicists thought the universe worked in predictable, deterministic ways. The randomness of the quantum world came as a shock at first, but we learned to embrace quantum weirdness and then use it practically. Quantum tunneling is fundamentally stochastic, but it can be guided so that particles jump in predictable patterns. This is what led to semiconductors and the chips powering the device you’re reading this article on. Don’t just accept that God plays dice with the universe—learn how to load the dice.
The same thing applies to AI. We train the neural networks that LLMs are made of using a technique called “gradient descent.” Gradient descent looks at the outputs a model is producing, compares that with training data, and then calculates a “direction” to adjust the neural network’s parameters so that the outputs become “more” correct—that is, to look more like the training data the AI is given. In the case of our magic auto-complete, a more correct answer means output text that is more likely to follow the input.
Probabilistic math is a great way for computers to deal with words; computing how likely some words are to follow other words is just counting, and “how many” is a lot easier for a computer to work with than “more right or more wrong.” Produce output, compare with the training data, and adjust. Rinse and repeat, making many small, incremental improvements, and eventually you’ll turn a neural network that spits out gibberish into something that produces coherent sentences. And this technique can also be adapted to pictures, DNA sequences, and more.
Gradient descent is a big deal. It means that creating AI models is iterative. You start with a model that produces mostly wrong (less likely) output and train it until it produces mostly right (more likely) output. But gradient descent also lets you take an existing model that’s already been trained and tweak it to your liking.
This is what enables one of the most powerful techniques in modern AI: fine-tuning. Fine-tuning is a way of using gradient descent to take an AI model that’s already been trained and specializing it in a particular way by training it on a curated set of data. The training uses gradient descent from an already working model to make the AI better at managing or producing that specific kind of data.
For example, people have taken the Stable Diffusion model, which creates images from text, and fine-tuned it to be especially good at making anime images or landscapes. And of course there are people fine-tuning language models designed for text to work better with advertising copy, legal documents, and so forth.
But fine-tuning goes well beyond getting AI models to specialize in particular subject areas. It can also be used to train AI models how to respond and produce output. It’s a powerful tool that has played a big role in moving generative AI forward, including the two key innovations that led to the creation of ChatGPT.
Innovation 1: Following commands
When OpenAI released GPT2 back in 2019, it was an exciting curiosity that could write realistic stories based on short prompts. For example, it wrote some cool ones about unicorns in the Andes.
But using GPT2 was like playing a word-association game, in that output followed input randomly but not usefully. For example, say you wanted that article about unicorns in the Andes. But what if you wanted a long-form article? Or what if you wanted a light, breezy, bulleted list instead?
Working with GPT2 was like trying to make a spray paint painting with a spray can but no stencils and with a paint nozzle that doesn’t let you control the width of the spray line. It’s possible to create art, but it’s very difficult to make the specific painting you have in mind.
The problem of getting AI to do what a human user wants is called “AI alignment.” You might have heard people talking about the “Big-A” alignment problems of building AI into society in such a way that it aligns with our ethics and doesn’t kill us all. But there’s also a “small-A” alignment problem: How do you make the output of a generative AI system more controllable by the human user? This is also sometimes called “steerability.”
GPT3 was a step forward from GPT2 in the length and complexity of the text that it could generate. But just as importantly, it featured a breakthrough in alignment: GPT3 could explicitly follow commands. OpenAI’s researchers realized that by fine-tuning GPT3 with examples of commands paired with responses to those commands, they could make GPT3 understand how to explicitly follow commands and answer questions.
This is a natural extension of the “auto-complete” capability—training the AI that the next words to come after a question should be an answer rather than an extension of the question. And the next words after a command like “write me a poem” should be the poem being asked for and not a longer version of the command.
Say you want an article about scientists discovering unicorns in the Andes. Instead of simply writing a summary prompt and letting the AI fill out a few paragraphs, you could tell GPT3 explicitly, “Write a short article about scientists discovering unicorns in the Andes, in the style of a Buzzfeed list.” And it would do so.
Or you could tell it, “Write an article about scientists discovering unicorns in the Andes in the style and voice of a long-form New Yorker article.” And it would write something more appropriate to that prompt.
This was a huge step forward because it meant that controlling the AI could be done simply and directly in human language instead of writing a computer program. And OpenAI didn’t have to explicitly build in the ability to follow commands in the structure of GPT3; instead, it just built a flexible, powerful language model and fine-tuned it with examples of commands and responses.
Innovation 2: Taking feedback
Following commands made GPT3 a lot easier to work with than GPT2. But it was still limited. Working with an AI like this is a bit like playing a slot machine. Pull the lever (give an input), get an output. Maybe it’s good, maybe it’s not. If not, try again. But as every artist, programmer, or writer knows, that’s not how creation works. Creation, like gradient descent for AI training, is best when it’s iterative.
The same is true of working with generative AI. If the AI produces something that’s almost right, you don’t want to start over. You want it to start from the output the AI already gave you so you can guide it from bad to good to great. You need to train the AI how to understand feedback.
There is a way to do this. The input to an AI model is called the context window. You can think of the context window as the text that our magic auto-complete takes in and then continues from. One way to work with an AI is to feed its own output back into the context window so that each input isn’t just a command but a command plus a “history” to apply that command to. This way, you can get the AI to modify its past output into something better. But you need the AI to understand how to take commands to make edits and not just new output.
This is what OpenAI accomplished with a version of GPT3 called InstructGPT. To address the problem of iteration, OpenAI made GPT3 better at following commands to make changes to an existing body of text it was given. This meant training the AI to respond more like a human when receiving feedback. To do this, OpenAI applied a technique called reinforcement learning with human feedback (RLHF). RLHF is a way of training AIs to mimic human preferences based on training examples from a human.
InstructGPT introduced another new pattern to working with GPT. Instead of “here’s a command, give me an output,” it tells the AI, “Here’s the previous output you gave me, here’s my feedback on what to give me next based on what you gave me before, now give me a new output.”
This turned working with an AI into a conversational experience; instead of command and response, it’s a conversation with history. Every time the AI produced an output, it used the history of the entire conversation so far as a basis, not just the latest command it received. It could remember what you told it before and use it as a basis for its output.
Just as with obeying commands, RLHF is an example of fine-tuning. OpenAI created InstructGPT from GPT3 by fine-tuning GPT3 with example data from human testers rating the output of AIs responding to feedback. This was another big step forward in AI alignment. By making the AI good at responding to feedback, it made the workflow of using GenAI iterative.
Analogy time again: Working with earlier AI models was like playing darts. You could adjust your form and positioning, trying to hit the bullseye, but each throw was a separate shot. You might hit the bullseye on the first throw, or you might never hit the bullseye. But with InstructGPT, it’s more like playing golf. You start each stroke from where you left off. It might take a few tries, but you’ll get in the hole eventually, or at least reach the green.
OpenAI combined InstructGPT with GPT3 to create GPT3.5. It then put GPT3.5 behind a web interface that lets anyone communicate with it from a browser, creating ChatGPT. (A pedantic but useful distinction: GPT3, GPT3.5, and GPT4 are models, neural networks that take input from a context window and produce output. ChatGPT is an application, in this case a webpage, that lets a human interact with an AI model under the hood—either GPT3.5 or GPT4—with a chat interface.)
On that day, a phenomenon was born: ChatGPT was supposed to be a research preview to test the chat interface; instead, it reached 100M users faster than any app in history. The chat interface obviously helped, but so did something else. In creating the interactive chat interface for ChatGPT, OpenAI also made GPT3.5 and its successors like GPT4 into powerful general-purpose processing tools.
The new framework: “processing” instead of “chat”
We’ve seen how gradient descent led to fine-tuning, which led to following commands, which then led to feedback and interaction. That ultimately led to an incredibly responsive AI chatbot in the form of ChatGPT. But the breakthrough happening now is the creation of an entirely new and different framework for working with LLMs: using them not as a chatbot that a human is talking to that uses its own knowledge to produce words and answers, but rather as processing tools that can be accessed by other software to work with data the model has never seen.
Here’s how this works.
Consider the simplest, platonic ideal of a computer: a CPU, some working memory, and a program. The CPU takes the data in the working memory, follows the instructions given to it via the program, and processes that data to produce something useful (or not, depending on the skill of the programmer). The key is that the CPU is generic and general-purpose, it doesn’t have to “know” anything in particular about the data it’s working with, so the same CPU can be used with all kinds of data.
Now look at how ChatGPT works. We have an AI model (GPT4) that has the ability to take input via a context window. We also have the ability to treat part of the input as commands and part of the input as history, or memory, that the commands apply to. If we use GPT4 to power ChatGPT, we’ve created a chatbot that uses that memory for a record of the conversation it has been having.
But there’s no reason you couldn’t fill the context window—that short-term, working memory—with other information. What if, Instead of a conversation history, you gave the AI some other data to act on? Now, instead of having a conversation, you’ve turned working with the AI model into a data processing operation.
This is incredibly powerful because, unlike a CPU, the AI model can take commands in natural, human language instead of binary code. And it has enough understanding and flexibility to work with all kinds of information without the need for you to specify and explain things in detail.
Let’s translate this into some concrete, real-world examples. How would you use an LLM with working memory and a knack for following commands to do something productive?
Example 1: Text and documents
We’ll start with the most straightforward application for a language model: handling text.
Imagine you’re a health insurance company grappling with a massive amount of policy documents and customer information. You want to build an AI tool that lets users ask questions about that data and get useful answers: Does my insurance cover elective procedures? What’s my deductible? Is tattoo removal covered?
How would you go about building this?
The old approach would be to take an AI model and fine-tune that model over all of your documents and user data so it learns that information. Then a user could just ask questions of the AI directly. But this creates all sorts of difficulties, the most obvious of which is that every time there’s any sort of change, you’d have to retrain and redeploy a whole new language model! Plus, your model would know the policy information for all users. How do you keep it from telling User A about User B’s private information if User A asks?
Fortunately, there’s another way. We don’t have to rely on the model to “memorize” the information. Instead, we could give the LLM the documents and a specific user’s policy information, then command it to answer the user’s question using that specific data. But the context window is limited; we can’t feed the LLM all of the documents we have. How do we give it just the relevant information it needs to produce an answer?
The answer is to use something called “embeddings.”
Remember, LLMs work with text by transforming words into math and probabilities. As a byproduct of training, these models also learn a numerical representation of how words relate to concepts and other words.
These internal numerical representations of words and concepts are called embeddings. It’s like a library filing system for words and concepts: You can look up a concept if you know its embedding, and vice versa. You can modify an LLM so that, instead of producing words, it can report to you its embedding for words and phrases. OpenAI and other AI companies often have special versions of their models to do precisely this.
So back to our insurance example. First, we pre-process our documents and user data, breaking them into small chunks of a few sentences or paragraphs and then assigning each chunk an embedding. The embedding acts as a numerical tag that says what each chunk is about. And because embeddings are created by LLMs, these “tags” can be flexible and comprehensive, crossing between different concepts and taking subtleties of language into account, like knowing that a chunk of text that talks about a user’s payments might be relevant to both reimbursements and deductibles and out-of-network coverage.
With embeddings, we can use an LLM to build a search engine! Now we can take a user’s question, compute an embedding for that question, and use the question embedding to look up which parts of which documents are relevant to that question and may contain the answer. This way of using LLMs is called a “retriever model” for obvious reasons—we retrieve the data for the AI to use.
So here’s how our Q&A system works. A user asks a question like, “What is my coverage for out-of-network procedures?” The Q&A system first assigns the question an embedding. We then use our database of embeddings to look up which chunks of which policy documents are relevant to this question, as well as any relevant data from the user’s own policy. We compose all of these chunks, along with the user’s original question, into a prompt for our LLM: Answer this question based on this information.
And then the LLM can give us an answer.
This is a very flexible and powerful way of working with LLMs that solves many of the practical problems. You can use the same generic LLM to process questions for insurance paperwork, legal documents, or historical records. New information is easy to handle because you don’t have to update the underlying model, just the inputs. And you can handle a lot of input. Models are being pushed to have ever-larger context windows; AI companies like Anthropic have pushed their models to have context windows that can handle entire books!
It’s also much simpler to handle privacy and security; instead of trusting the AI to decide what questions it can answer, you can manage privacy by only giving the AI information for the authorized user in its context window.
Having an LLM base its answers on information fed to it is called “grounding.” This biases the LLM toward trusting the information in the context window more and is a powerful way to reduce the problem of letting the model make up answers.
This framework is able to do a lot more than just answer questions.
Example 2: Robotics
Take an application that’s completely different from text: robotics. We’ve made remarkable strides in low-level control of robots, the mechanical tasks of sensing, moving, and manipulating objects. Look, for example, at the videos Boston Dynamics has produced of robots acrobatics. But these are basic actions. To do something useful, the robot needs context and direction.
Imagine a robot in an office, tasked with picking up and delivering tools, packages, and even making coffee. Say you wanted the robot to pick up a package from Alice and deliver it to Bob’s office on the third floor and drop it off on his desk. Traditionally, a human would need to program the robot with specific instructions: which package to retrieve, which office is Bob’s, which floor Bob’s office is on, and so on. But you want the robot to do many different tasks and be able to take commands simply in human language. “Take a package to Bob” turns into a different set of waypoints and actions than “pick up coffee for me.” Getting a robot to handle all these human commands without having to program it specifically each time has always been a challenge.
LLMs are not so good at the low-level robot tasks of sensing and motion planning, as these are an entirely different paradigm. But LLMs are great at putting together words into meaningful sentences according to the rules of grammar. And what is programming but putting words into sentences for a computer, or in this case, a robot, to understand? We just need to give the LLM a grammar to work with.
So we can specify the set of actions that the robot can undertake, such as moving to a destination, picking up an object, or placing an object. Think of it as a throwback to classic text-based adventure games like Zork, where you could command your character to move north, south, east, and so on.
We can then use this basic robotic grammar to create a set of example programs to accomplish different tasks, like dropping off a package, making coffee, etc. And now we can do the same thing as before. When a human user wants the robot to do something, we can use our retriever model to look up similar example programs and then ask the LLM to write a new program to do what the user wants. It can take example programs of dropping off a package on Bob’s desk and modify them to write new instructions for taking a package to Alice, and so on. LLMs aren’t great at creating from scratch, but given examples, they are adept at modifying, adapting, and interpolating.
The key here isn’t that the programs are complicated. It’s that we now have an easy, automated way to generate new programs without tedious human work. The example above is a toy problem, of course; there are many details to manage before letting a robot loose in an office. But this type of technique is already finding use in more structured robotics applications with less unpredictability—for example, in industrial robotics, where a stationary robot may be tasked with picking and placing parts in a manufacturing process. Or controlling drones by setting and modifying sequences of waypoints.
Example 3: Semiconductors
One last example: semiconductor design. You’d think that silicon is far from anything these language models can handle off the shelf, right? But they’re also powerful here because LLMs can be great at programming.
Most modern chips, like the CPUs and GPUs you may be familiar with, work with digital logic in 1’s and 0’s. This is different from chips that handle analog waveforms, like the chips that manage radio signals or power electronics. These digital chips are designed using a process called logic synthesis. This is a technique where a human designer writes code to specify logical operations like addition, multiplication, moving data in memory, and so on. Then sophisticated software automatically translates the desired logical operations into patterns of components like transistors and capacitors, and then other sophisticated software transforms those components into the patterns of silicon and copper that are etched into a chip.
Logic synthesis was a revolution in chip design. It meant that chip designers could think about “what should this chip do” rather than “how do I build this circuit.” It’s the same breakthrough that happened when computer programmers could write in high-level programming languages instead of low-level binary code. And it turned chip design into writing code.
Generative AI can take digital synthesis and turbocharge it. We can apply the same principle of creating embeddings on this kind of code, allowing us to turn vast libraries of existing chip designs into examples for the AI to remix and modify. This can drastically speed up chip design work. In chip design, as in much of programming, the bulk of new design work is often in taking old work and modifying it. Now, a few designers using generative AI can quickly create many variants of the same core chip design to cover all these cases.
Behind the scenes, there are already many chip companies experimenting with or adopting LLMs into their design workflows. We’re far from an AI being able to design an entire chip, but we’re quickly getting to a point where small teams can equal the output of much larger ones—not by taking away the interesting, innovative chip design work but by speeding up the repetitive yet critical “grunt work” that takes innovation into implementation.
You can see how flexible the approach of turning LLMs designed for chat interaction into general-purpose processing tools can be. This forms a framework that can be applied across many different domains:
- Take a generic LLM and a data set of examples in text form.
- Use embeddings to build a retriever model that can take a user’s input and pull relevant chunks out of that example set.
- Feed the chunks along with a user’s input into the LLM to have the LLM give a useful output based on modifying or drawing information from those examples.
The fact that LLMs can work with programming languages makes this extraordinarily powerful: We’ve spent the last few decades turning everything into writing code.
The applications I mentioned in this article are only the beginning. For example, we have HTML, a text-based language, that describes how images and text are arranged on a website. Now imagine creating a scripting language for buildings that describes the layout of doors, windows, walls, and other architectural elements. What could we do with generative AI then? We could create tools that help architects speed up building design by enabling faster iterations and modifications—or, at the very least, speed up permitting reviews, design mark-ups, and modifications.
The magic is that so much can be done without having to customize the AI models. In some ways, generative AI is like the humble spreadsheet. Just as many businesses develop complex customized databases, many businesses with specialized needs will still choose to train or fine-tune their own AI models. But you can get quite far simply using a spreadsheet, and generic AI models can be similar. And if you’re a small business, or if you’re in a niche domain where tech titans aren’t spending billions building models specifically for your use case, you can use generative AI right away (if you have some programming chops, that is).
AI is undergoing a transformation similar to the one semiconductors made many decades ago, moving from specialized components designed for specific tasks to generalized tools that anyone can use. In the early days, chips were specialized products custom designed for specific purposes, like calculating the trajectory of a rocket or amplifying an audio signal. Now, CPUs and GPUs are generalized processing units that can be applied to everything from streaming video to writing a recipe to running artificial intelligence.
AI is going down a similar path.