Imitation learning (IL) has shown to be a powerful framework for acquiring robotic manipulation skills from demonstrations. However, standard approaches fall short in long-horizon or unseen scenarios due to compounding errors and a lack of goal-directed reasoning. Crucially, IL does not equip agents with the ability to predict and evaluate future outcomes—a capability essential for robust, generalizable behavior. Recent advances in world modeling suggest that such predictive capabilities can be captured in task-agnostic world models.
We propose ForesightIL, a novel method that brings predictive reasoning to imitation learning attest time. Our approach leverages multimodal IL policies as generative priors over action sequences, combined with latent-space imaginations in a learned world model. Instead of directly executing actions, we sample multiple candidate trajectories, simulate their outcomes using the world model, and select the most promising one for execution. This setup enables goal-conditioned predictive control, allowing adaptive action selection toward specified targets. Furthermore, by simulating trajectories from different diffusion policies, each trained on single tasks, our approach enables compositional skill integration via world model planning. ForesightIL achieves a 29% performance gain across three simulated and two real-world robotics tasks, while demonstrating robust generalization to visual disturbances without any online adaptation or additional supervision.
Our approach trains a diffusion policy as a generative prior over action sequences, and uses a shared latent world model to simulate and rank candidate trajectories. Crucially, instead of executing the policy output directly, we sample multiple candidate trajectories, imagine their outcomes in latent space, and select the most promising one for execution. This allows us to reuse generic knowledge from the training set at test time—without additional supervision or interaction.
Visual comparison of planning outcomes between ForesightIL and baseline methods. By leveraging latent-space rollouts and predictive reasoning in a learned world model, ForesightIL consistently selects goal-directed actions, leading to more successful task completions across diverse scenarios.