← Back to Blog

Research & Engineering · May 10, 2025

ICL — The path to generalizable robotics today

By Owen Burns, Hanyang Gu, Aaryan Patil

Over the past few years, Vision-Language-Action (VLA) models have become the new foundation for robot control. Trained on enormous corpora of robot demonstrations spanning dozens of embodiments and hundreds of tasks, these models acquire broad intuitions about how objects behave, how tools should be grasped, and how multi-step workflows unfold. In many ways they are impressive — but when deployed in practice, two problems surface immediately which greatly limit their practical use.

The Problem(s) with Today's VLAs

The first problem is adaptation. VLA models are trained once and then frozen. Ask the robot to do something outside its training distribution — a slightly different object, a new workspace layout, a task it has never seen — and performance drops sharply, often to zero. The standard fix is fine-tuning: collect fresh demonstrations, rent GPU time, retune the model, redeploy. This cycle takes days to weeks, must be repeated for every new task at every new site, and produces only a brittle specialist. The second problem is memory. Current VLA models decide what to do based only on what they can see right now. There is no awareness of what happened earlier in the same task. Ask the robot to remember which drawer it already checked, to notice that its last attempt failed, or to act on information from thirty seconds ago — and it simply cannot.

How We Solve It

We augment a frontier VLA with two complementary memory modules that share a single mechanism: cross-attention conditioning on past observation-action pairs. The In-Context Learning (ICL) module encodes an expert demonstration into a context bank before the episode begins, while Short-Term Memory (STM) module accumulates the robot's own observations during execution into a bounded memory bank, enabling recall of earlier events when they become relevant. Both modules inject context into the same architectural point and compose without joint training: an ICL module trained on standard manipulation data can be frozen and combined with an STM module trained on an entirely separate memory benchmark, improving performance on the hardest memory tasks by 22 percentage points over models without memory.

Architecture overview showing ICL Bank, STM Bank, VLM Backbone, Enrichment module, and Action Expert
The full system: expert demonstrations are encoded into the ICL bank before the episode; the robot's own observations accumulate into the STM bank during execution. Both enrich the VLM backbone features via cross-attention before they reach the action expert.

What's Actually Happening Under the Hood

The key mechanistic insight is this: the ICL module doesn't teach the model new capabilities. It sharpens the action distribution the model already has. The pretrained VLA already knows how to perform the task — the motor primitives, the grasping strategies, the motion trajectories are all present in the weights. The problem is that without additional context, the flow-matching action expert denoises too inconsistently to reliably execute any single skill. Starting from random noise, the model can converge to many different valid action trajectories — and without a strong enough signal constraining the target distribution, each rollout lands somewhere different in a broad output manifold. Context injection constrains that target distribution. By conditioning on a demonstration of the target task (or past steps from the same task), we ensure the denoising process consistently converges to the desired skill — achieving the same effect as fine-tuning, but without touching a single weight.

Action Expert
Output Space
Denoising convergence under different conditioning regimes. Context injection constrains the output distribution comparably to fine-tuning without modifying model weights.

One surprising observation from our analysis: fine-tuning tightens convergence for the first few predicted timesteps but actually loosens it for later ones in the action horizon — the model commits harder to the immediate next step while becoming less consistent about where the full trajectory ends up. We did not expect this asymmetry, and it warrants further study into how weight-level adaptation interacts with the temporal structure of flow-matching generation. Context injection, by contrast, constrains convergence uniformly from the first predicted timestep through the last — an encouraging property that translates directly into smoother, more coherent multi-step execution.

The Path Forward

We see a future where robots learn from demonstrations the way humans learn from each other — watch once, adapt immediately, remember what matters. Every deployment site becomes a source of new capability rather than a one-time engineering project. New tasks are a single demonstration away, not a retraining cycle. And the robot's own experience during execution feeds back into its behavior in real time, enabling the kind of adaptive, memory-driven performance that distinguishes a skilled worker from a script.

We're building this future and looking for people who want to help. If this resonates, reach out at jobs@engramrobotics.com.