EgoSim: Egocentric Exploration in Virtual Worlds with Multi-modal Conditioning

Wei Yu, Songheng Yin, Steve Easterbrook, Animesh Garg

Interactive Video Demo

Prompts:

Abstract

Recent advancements in video diffusion models have established a strong foundation for developing world models with practical applications. The next challenge lies in exploring how an agent can leverage these foundation models to understand, interact with, and plan within observed environments. This requires adding more controllability to the model, transforming it into a versatile game engine capable of dynamic manipulation and control. To address this, we investigated three key conditioning factors: camera, context frame, and text, identifying limitations in current model designs. Specifically, the fusion of camera embeddings with video features leads to camera control being influenced by those features. Additionally, while textual information compensates for necessary spatiotemporal structures, it often intrudes into already observed parts of the scene. To tackle these issues, we designed the Spacetime Epipolar Attention Layer, which ensures that egomotion generated by the model strictly aligns with the camera’s movement through rigid constraints. Moreover, we propose the CI2V-adapter, which uses camera information to better determine whether to prioritize textual or visual embeddings, thereby alleviating the issue of textual intrusion into observed areas. Through extensive experiments, we demonstrate that our new model EgoSim achieves excellent results on both the RealEstate and newly repurposed Epic-Field datasets.

3D Camera Control for Dynamic Scenes

Frame & Caption Input
Reference Trajectory Video
Camera Controlled Generation
Caption: a sailboat sailing in rough seas with a dramatic sunset, waves are surging
Caption: pouring honey onto some slices of bread
Caption: rotating view, small house
Caption: time-lapse of a blooming flower with leaves and a stem
Caption: fireworks display

Precise Camera Control

First GIF Second GIF

Interact With World

First GIF Second GIF Thrid GIF Fourth GIF Fifth GIF Sixth GIF Seventh GIF Eighth GIF Nineth GIF Tenth GIF