SonicAloha: Learning Auditory-enhanced Bimanual Manipulation with Audio-Visual Transformer

Abstract

Sound plays a vital role in daily life, especially when visual information is limited. However, incorporating sound into robotic manipulation remains challenging. To address this, we propose SonicAloha, a system that integrates airborne audio sensing into the Aloha platform to enable multimodal manipulation through End2End imitation learning. We collect synchronized audio segment, vision, and joint data at each timestep. Concatenated audio segments are paired with the corresponding visual and joint inputs for training. To encode the airborne audio information, we introduce an Audio Encoding Module using a pretrained Audio Spectrogram Transformer (AST) fine-tuned with Weight-Decomposed Low-Rank Adaptation (DoRA). To fuse encoded audio and vision features, we propose the Transformer-based Bidirectional Cross-Attention Module, which combines two parallel cross-attention branches that enable dynamic mutual attention between modalities, and feeds the fused features into the Action Chunking with Transformers (ACT) policy to generate predicted actions. Experimentally, we introduce three long-horizon, audio-visual bimanual tasks: alarm shutting, box locking, and stapler checking. Our method, which fuses vision with airborne audio, outperforms vision-only and existing audio-visual baselines, and shows advantages over contact-audio methods. Moreover, it demonstrates robustness to closed-loop control, audio-visual noise, and zero-shot generalization.


Approach Overview

Approach Overview

Fig. 1. Approach Overview: We sample observations from the collected dataset. Each observation at time t includes top-view, left-arm, and right-arm camera visual images, audio recorded over the past L seconds before t, and the robot's joint states at time t. For the visual input, the three camera observations are encoded by three separate ResNet, each producing feature representations shown as stacked blue feature maps, which are then tokenized into vision tokens, depicted as hollow blue squares. For the L-second audio waveform, we first downsample it from 48kHz to 16kHz, and then convert it into Mel spectrograms. The mel spectrograms are encoded with an AST model initialized from AudioSet pre-trained weights. During training, the AST backbone is frozen, and only the DoRA modules are fine-tuned on our task-specific data, producing audio feature tokens shown as hollow orange squares. The extracted visual and audio features are fed into a Transformer-based bidirectional cross-attention mechanism, which enables both modalities to adaptively attend to each other's informative regions. The resulting fused features are represented as blue and orange solid squares. These encoded and cross-attention-enhanced features are then combined with the current robot joint values and are input into the ACT architecture's policy. The ACT policy outputs the predicted next m timesteps of robot joint actions and computes the reconstruction loss between the predicted actions and the ground-truth joint values.

Autonomous Bimanual Skills with Audio Information

🔊 Please turn on your speaker for full experience

1. Alarm Shutting

Our Method (Audio + Vision)

2. Box Locking

Our Method (Audio + Vision)

3. Stapler Checking

Our Method (Audio + Vision) With No Staple

Our Method (Audio + Vision) With Staple


Robustness

Closed-loop Human Interaction

Our Method (Audio + Vision)

Audio-visual Noise

Our Method (Audio + Vision)

Zero-shot

Our Method (Audio + Vision) With New Staple

Representative Failure Cases

Our Method with 2s Audio Length for Box Locking (Robot arm re-presses the already locked box)

Contact Audio Method (When the alarm rings, the contact microphone cannot capture the object's emitted sound, and the robot fails to respond.)

ACT with only-Vision for Stapler Checking (The policy failed to learn either of the two branches and fell into the same failure pattern.)