TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding

Yun Liu1,2,3, Haolin Yang4, Xu Si1, Ling Liu5, Zipeng Li1, Yuxiang Zhang1, Yebin Liu1, Li Yi1,2,3
1Tsinghua University, 2Shanghai Artificial Intelligence Laboratory, 3Shanghai Qi Zhi Institute, 4Beijing University of Posts and Telecommunications, 5Beijing Institute of Technology

CVPR 2024

TACO is a large-scale bimanual hand-object manipulation dataset covering extensive tool-action-object combinations in real-world scenarios, supporting test-time generalization to unseen object geometries and novel behavior triplets and benchmarking various generalizable research topics, e.g., action recognition, motion forecasting, and cooperative grasp synthesis.

Diversity in Object Shapes

Brush

Plate

Diversity in Interaction Tool-Action-Object Triplets

Examples of interaction tool-action-object triplets.

Action: Dust




Kettle

Brush

Roller




Plate




Box

Action: Stir




Pan

Spatula

Spoon




Plate




Bowl

Examples of data sequences under two action types. Hand-object mesh annotations are rendered in these videos.

Abstract

Humans commonly work with multiple objects in daily life and can intuitively transfer manipulation skills to novel objects by understanding object functional regularities. However, existing technical approaches for analyzing and synthesizing hand-object manipulation are mostly limited to handling a single hand and object due to the lack of data support.

To address this, we construct TACO, an extensive bimanual hand-object-interaction dataset spanning a large variety of tool-action-object compositions for daily human activities. TACO contains 2.5K motion sequences paired with third-person and egocentric views, precise hand-object 3D meshes, and action labels.

To rapidly expand the data scale, we present a fully automatic data acquisition pipeline combining multi-view sensing with an optical motion capture system.

With the vast research fields provided by TACO, we benchmark three generalizable hand-object-interaction tasks: compositional action recognition, generalizable hand-object motion forecasting, and cooperative grasp synthesis. Extensive experiments reveal new insights, challenges, and opportunities for advancing the studies of generalizable hand-object motion analysis and synthesis.

Contact Us

If you have any questions or suggestions, please contact Yun Liu (yun-liu22@mails.tsinghua.edu.cn) or Li Yi (ericyi0124@gmail.com).

BibTeX

@article{liu2024taco,
      title={TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding},
      author={Liu, Yun and Yang, Haolin and Si, Xu and Liu, Ling and Li, Zipeng and Zhang, Yuxiang and Liu, Yebin and Yi, Li},
      journal={arXiv preprint arXiv:2401.08399},
      year={2024}
    }