Toward a Mathematical Understanding of Foundation Models

ChatGPT, Gemini, and similar AI systems have amazed the public with fluent answers and rapid problem- solving. Under the hood, these systems belong to a broader class called “foundation models,” which are trained on wide-ranging data so they can be adapted to many tasks, from
writing and coding to analyzing scientific measurements. As these models spread into research labs and industry, especially in AI for science to accelerate simulations, screen materials, or analyze datasets, a practical question grows urgent: What exactly can these models do, and where are their limits?

Fei Lu, an associate professor in the Department of Mathematics, approaches this question with tools from computational mathematics and learning theory. His research aims to build a mathematical understanding of foundation models by clarifying assumptions, deriving error bounds,
and identifying limitations so that scientists know when to trust a model’s output, when to gather more data, and when simpler methods may suffice.

Looking for the limits

Lu’s research is zeroing in on “in-context learning,” the way a pretrained foundation model learns from the prompt context without changing its parameters. In a probabilistic view, pretraining equips the model with a cross-task prior (broad expectations about which patterns are likely), and the prompt context provides new evidence. In effect, the model performs what is known as a Bayesian update, combining prior knowledge in the training datasets with the context users supply to make a reasoned prediction.

Inverse problems, common across science and engineering, fit naturally into this Bayesian picture. These problems use incomplete or noisy observations to infer hidden signals or governing laws. In a recent study, Lu and a collaborator analyzed a controlled testbed, called inverse linear regression, to explain both successes and failure modes.

The researchers demonstrated how transformers—a kind of model embedded within overarching foundation models—deal with particularly challenging cases where there is scant information in a given context. Transformers can still deliver quality responses in these instances by essentially relying on learned prior knowledge (from training data) to seek plausible responses, thus helping to fill in the gaps.

To illustrate the concept, Lu offers a medical imaging analogy: “Two brain scans share many features; only a small set of differences matters,” he says. “The transformer captures the
shared structure and then uses the new context to identify those small, low-dimensional
differences.”

New frontiers

However, when the remaining differences are large—that is, when the task dimension is
high relative to the available context length—accurate predictions are out of reach for any model, the new analysis showed. The takeaway, Lu says, is that “low task complexity relative to context is crucial for the success of foundation models. Quantifying that complexity, and the performance limits it implies, is a new research frontier with many important open questions.”

Lu is enjoying delving into challenging, societally relevant problems posed by modern AI. Earlier in his academic career, he had worked in more theoretical, abstract mathematics. During his postdoctoral period prior to joining Johns Hopkins in 2017, though, an advisor prodded him into giving applied computational mathematics a go.

“I was hesitant at first, and the transition from pure math was difficult,” says Lu. “But I survived and
got lucky to get into this area.”

Explore Related Topics: