SonicAloha

Abstract

Sound plays a vital role in daily life, especially when visual information is limited. However, incorporating sound into robotic manipulation remains challenging. To address this, we propose SonicAloha, a system that integrates airborne audio sensing into the Aloha platform to enable multimodal manipulation through End2End imitation learning. We collect synchronized audio segment, vision, and joint data at each timestep. Concatenated audio segments are paired with the corresponding visual and joint inputs for training. To encode the airborne audio information, we introduce an Audio Encoding Module using a pretrained Audio Spectrogram Transformer (AST) fine-tuned with Weight-Decomposed Low-Rank Adaptation (DoRA). To fuse encoded audio and vision features, we propose the Transformer-based Bidirectional Cross-Attention Module, which combines two parallel cross-attention branches that enable dynamic mutual attention between modalities, and feeds the fused features into the Action Chunking with Transformers (ACT) policy to generate predicted actions. Experimentally, we introduce three long-horizon, audio-visual bimanual tasks: alarm shutting, box locking, and stapler checking. Our method, which fuses vision with airborne audio, outperforms vision-only and existing audio-visual baselines, and shows advantages over contact-audio methods. Moreover, it demonstrates robustness to closed-loop control, audio-visual noise, and zero-shot generalization.

SonicAloha: Learning Auditory-enhanced Bimanual Manipulation with Audio-Visual Transformer

Abstract

Approach Overview

Autonomous Bimanual Skills with Audio Information

1. Alarm Shutting

Our Method (Audio + Vision)

2. Box Locking

Our Method (Audio + Vision)

3. Stapler Checking

Our Method (Audio + Vision) With No Staple

Our Method (Audio + Vision) With Staple

Robustness

Closed-loop Human Interaction

Our Method (Audio + Vision)

Audio-visual Noise

Our Method (Audio + Vision)

Zero-shot

Our Method (Audio + Vision) With New Staple

Representative Failure Cases

Our Method with 2s Audio Length for Box Locking (Robot arm re-presses the already locked box)

Contact Audio Method (When the alarm rings, the contact microphone cannot capture the object's emitted sound, and the robot fails to respond.)

ACT with only-Vision for Stapler Checking (The policy failed to learn either of the two branches and fell into the same failure pattern.)