Overview
In this work, we propose a method to generate a sequence of natural human-scene interaction events in real-world complex scenes as illustrated in this figure. The human first walks to sit on a stool (yellow to red), then walk to another chair to sit down (red to magenta), and finally walk to and lie on the sofa (magenta to blue).
Method
We formulate synthesizing human behaviors in 3D scenes as a Markov decision process with a latent action space, which is learned from motion capture datasets. We train scene-aware and goal-driven agent policies to synthesize various human behaviors in indoor scenes including wandering in the room, sitting or lying on an object, and sequential combinations of these actions.
Illustration of our proposed human-scene interaction synthesis framework, which consists of learned motion primitive (actions), alongside locomotion and interaction policies generating latent actions conditioned on scenes and interaction goals. By integrating navigation mesh-based path-finding and static human-scene interaction generation methods, we can synthesize realistic motion sequences for virtual humans with fine-grained controls.
Illustration of the scene-aware locomotion policy network. The locomotion policy state consists of the body markers, the goal-reaching feature of normalized direction vectors from markers to the goal pelvis, and the interaction feature of a 2D binary map indicating the walkability (red: non-walkable area, blue: walkable area) of the local 1.6m * 1.6m square area. The locomotion policy network employs the actor-critic architecture and shares the state encoder.
Illustration of the object interaction policy network. The interaction policy state consists of the body markers, the goal-reaching features of both distance and direction from current markers to the goal markers, and the interaction features of the signed distances from each marker to the object surfaces and the signed distance gradient at each marker location. Such interaction features encode the human-object proximity relationship.
Video
Results
Our method generalize to novel objects and real world scenes.
Sitting on a novel chair with unique shape and low height.
Inhabiting reconstructed scenes.
Walking to sit on a bus stop bench in a Paris street point cloud from Paris-CARLA-3D.
Walking to sit on a bed in a reconstructed room from PROX.
Interaction with unusual-shaped objects generated by text-to-3D method Shap-E .
Sit on avacado-shaped chairs.
Sit on dog-shaped chairs.
Policy learning process.
Limitations.
As a kinematics-based method, human-scene penetration remains observable in synthesis results.
Insufficient training motion data for lying leads to degraded motion results with unnatural poses.
Comparison with more related works.
Citation
@inproceedings{Zhao:ICCV:2023, title = {Synthesizing Diverse Human Motions in 3D Indoor Scenes}, author = {Zhao, Kaifeng and Zhang, Yan and Wang, Shaofei and Beeler, Thabo and Tang, Siyu}, booktitle = {International conference on computer vision (ICCV)}, year = {2023} }
Related Projects
This project is developed on top of prior works, please also consider citing the following projects:
GAMMA: The Wanderings of Odysseus in 3D Scenes@inproceedings{zhang2022wanderings, title={The Wanderings of Odysseus in 3D Scenes}, author={Zhang, Yan and Tang, Siyu}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={20481--20491}, year={2022} }COINS: Compositional Human-Scene Interaction Synthesis with Semantic Control
@inproceedings{Zhao:ECCV:2022, title = {Compositional Human-Scene Interaction Synthesis with Semantic Control}, author = {Zhao, Kaifeng and Wang, Shaofei and Zhang, Yan and Beeler, Thabo and Tang, Siyu}, booktitle = {European conference on computer vision (ECCV)}, year = {2022} }