Differentiable Robot Rendering

Anonymous Authors
Differentiable Robot Rendering

We introduce Differentiable Rendering of Robots (Dr. Robot), a robot self-model which is differentiable from its visual appearance to its control parameters. With it, we can perform control and planning of robot actions through image gradients provided by visual foundation models.

Abstract

Vision foundation models trained on massive amounts of visual data have shown unprecedented reasoning and planning skills in open-world settings. A key challenge in applying them to robotic tasks is the modality gap between visual data and action data. We introduce differentiable robot rendering, a method allowing the visual appearance of a robot body to be directly differentiable with respect to its control parameters. Our model integrates a kinematics-aware deformable model and Gaussians Splatting and is compatible with any robot form factors and degrees of freedom. We demonstrate its capability and usage in applications including reconstruction of robot poses from images and controlling robots through vision language models. Quantitative and qualitative results show that our differentiable rendering model provides effective gradients for robotic control directly from pixels, setting the foundation for the future applications of vision foundation models in robotics.

Video

Method

method
Our robot model is composed of 3 differentiable components. Forward kinematics projects a pose vector into a skeleton, Implicit LBS projects 3D Gaussians to the robot surface, and Appearance Deformation adjusts appearance of 3D Gaussians.

Rendering Quality

Rendering Quality
Dr. Robot can be trained from the URDF file of any robot across various form factors and any number of degrees of freedom.

Single View Inverse Dynamics

From an input video (left), we perform optimization to reconstruct the joint angles of the robot and re-render the robot (right).

Visual MPC

We perform optimization of joint angles of a Shadow Hand to maximize the CLIP similarity between the rendered image and text prompt. We show the optimization process (left) as well as final outputs of different prompts (right).

Planning with Text2Video Model

We show that robot actions can be extracted from videos predicted by a finetuned text-to-video model and be executed for robot control to achieve language-conditioned robot planning.

Motion Retargetting

We perform optimization on robot action trajectories to minimize the chamfer distance between the point tracks in the rendered video and a demonstration video, allowing us to transfer motion across the embodiment gap.