COINS

Compositional Human-Scene Interaction Synthesis with Semantic Control


1ETH Zürich     2Google   
European Conference on Computer Vision (ECCV) 2022

Paper      ArXiv      Dataset      Code



Overview

We propose COINS, for COmpositional INteraction Synthesis with Semantic Control.

Given a pair of action and object instance as the semantic specification, our method generates virtual humans naturally interacting with the object (first row). Furthermore, our method retargets interactions on unseen action-object combinations (second row) and synthesizes composite interactions without requiring any corresponding composite training data (third row).

teaser


Method

Given a 3D scene and semantic specifications of actions (e.g., “sit on”, “touch”) paired with object instances (highlighted with color), COINS first generates a plausible human pelvis location and orientation and then generates the detailed body.

pipeline

We leverage transformer-based generative models for human pelvis and body generation. The articulated 3D human body surface points and 3D objects are jointly encoded in a unified latent space, and the semantics of the interaction between the human and objects are embedded via positional encoding.

body_vae

PROX-S Dataset

teaser

The PROX-S dataset is a human-scene interaction dataset annotated on top of PROX and PROX-E, which contains:
(1) Scene instance segmentation, provided as per mesh vertex labels of instance ID and mpcat40 category ID
(2) Interaction semantic annotation, provided as per-frame SMPL-X body fitting with semantic labels in the format of a list of action-object pairs.
We provide the dataset here and please refer to the code repository for utility scripts.

Results


Our method COINS generates more physically feasible and human-like results than baselines.

pipeline


Novel interactions of unseen action-object pairs created by COINS.

pipeline


We explicitly control COINS to generate bodies of varing shape sizes, including unseen extremely thin and heavy bodies.

pipeline


Synthesized interactions with noisily segmented objects (visualized as red).

pipeline

Citation

@inproceedings{Zhao:ECCV:2022,
   title = {Compositional Human-Scene Interaction Synthesis with Semantic Control},
   author = {Zhao, Kaifeng and Wang, Shaofei and Zhang, Yan and Beeler, Thabo and and Tang, Siyu},
   booktitle = {European conference on computer vision (ECCV)},
   month = oct,
   year = {2022}
}

Contact


For questions, please contact Kaifeng Zhao:
kaifeng.zhao@inf.ethz.ch