COINS

Compositional Human-Scene Interaction Synthesis with Semantic Control


1ETH Zürich     2Google   
European Conference on Computer Vision (ECCV) 2022

Paper      arXiv      Dataset      Code



Overview

We propose COINS, for COmpositional INteraction Synthesis with Semantic Control.

Given a pair of action and object instance as the semantic specification, our method generates virtual humans naturally interacting with the object (first row). Furthermore, our method retargets interactions on unseen action-object combinations (second row) and synthesizes composite interactions without requiring any corresponding composite training data (third row).

teaser

Using COINS, we populated a modern loft scene with virutal humans performing diverse interactions.

loft1loft2 loft3loft4

Demo

Here we show an interactive demo where you can choose a list of interactions in the format of 'action-object category-object id' and browse randomly generated interactions.
This demo may take some time to load at the first time. Please click the 'visualize' button to show the next generated interaction. You can rotate and zoom the scene by mouse or touching just like most popular 3D viwers.



Method

Given a 3D scene and semantic specifications of actions (e.g., “sit on”, “touch”) paired with object instances (highlighted with color), COINS first generates a plausible human pelvis location and orientation and then generates the detailed body.

pipeline

We leverage transformer-based generative models for human pelvis and body generation. The articulated 3D human body surface points and 3D objects are jointly encoded in a unified latent space, and the semantics of the interaction between the human and objects are embedded via positional encoding.

body_vae

PROX-S Dataset

teaser

The PROX-S dataset is a human-scene interaction dataset annotated on top of PROX and PROX-E, which contains:
(1) Scene instance segmentation, provided as per mesh vertex labels of instance ID and mpcat40 category ID
(2) Interaction semantic annotation, provided as per-frame SMPL-X body fitting with semantic labels in the format of a list of action-object pairs.
We provide the dataset here and please refer to the code repository for utility scripts.

Results


Our method COINS generates more physically feasible and human-like results than baselines.

pipeline


Novel interactions of unseen action-object pairs created by COINS.

pipeline


We explicitly control COINS to generate bodies of varing shape sizes, including unseen extremely thin and heavy bodies.

pipeline


Synthesized interactions with noisily segmented objects (visualized as red).

pipeline

Video

Citation

@inproceedings{Zhao:ECCV:2022,
   title = {Compositional Human-Scene Interaction Synthesis with Semantic Control},
   author = {Zhao, Kaifeng and Wang, Shaofei and Zhang, Yan and Beeler, Thabo and Tang, Siyu},
   booktitle = {European conference on computer vision (ECCV)},
   year = {2022}
}

Contact


For questions, please contact Kaifeng Zhao:
kaifeng.zhao@inf.ethz.ch