We propose TesserAct, the first open-source and generalized 4D World Model for robotics, which takes input images and text instructions to generate RGB, depth, and normal videos, reconstructing a 4D ...
Abstract: 3D visual language multi-modal modeling plays an important role in actual human-computer interaction. However, the inaccessibility of large-scale 3D-language pairs restricts their ...