From Words to Actions: Language-Guided Hierarchical RL for Object Rearrangement

Abstract

We explore the use of natural language to guide hierarchical reinforcement learning (HRL) agents in long-horizon indoor rearrangement tasks. Using the Habitat Lab framework and the ReplicaCAD rearrange_easy benchmark, we train a high-level policy to select among pre-trained low-level skills based on both visual observations and language instructions. To fuse multimodal inputs, we evaluate two techniques: FiLM and cross-attention. Language instructions are generated using a large language model and embedded via a pretrained CLIP encoder. Our results show that incorporating language slightly improves task performance; however, the overall success rate remains low due to compounding errors from unreliable low-level skills. This suggests that while language offers valuable context, its benefits are limited without reliable underlying skills. We discuss key challenges and propose directions for mitigating cascading failures.

Intro

Teaching embodied agents to perform everyday, complex tasks in 3D environments is a core challenge in embodied AI
Reinforcement learning (RL) shows promise but is difficult to scale to long-horizon tasks due to sparse rewards & large action spaces
HRL offers a natural solution by decomposing tasks into reusable low-level skills orchestrated by a high-level policy
Multimodal learning research shows that natural language can specify high-level goals for agents
We've extended the Habitat Lab framework to incorporate natural language instructions into the high-level controller to pick low-level skills conditioned on language & visual observations
We've exploreed methods such as feature-wise linear modulation (FiLM) and cross-attention mechanisms to fuse language with vision
We've built on ReplicaCAD's "rearrange_easy" benchmark, auto-generating scene descriptions to create paired language-visual tasks
We've evaluated how language-conditioned HRL agents perform against purely vision-based baselines

Goal

We aim to create a hierarchical agent that can:

Interpret a natural language goal & use it to guide skill selection

Combine language & visual input to improve task performance & generalization

Train & evaluate on realistic 3D rearrangement tasks with varying scenes

Methods

Dataset

ReplicaCAD offers a "FRL apartment" environment for Habitat simulation
"rearrange_easy" subset has a training split of 50,000 episodes of rearrangement problems over 63 scenes & validation split with 1,000 episodes from another 21 scenes
Language augmentation pipeline:
1. Extract target object and goal locations from each episode
2. Map provided indices to scene receptacle locations
3. Generate template sentences for each episode
4. Reword into natural language via Google Gemini 2.0 Flash
5. Transform into feature vectors with OpenAI's ViT-B/32 CLIP model

Low-Level Skills Training

Habitat Lab provides framework for training low-level skills composed by a high-level policy in a HRL environment
These skills correspond to separate, reusable behaviors (like picking or placing objects) & are trained independently using Proximal Policy Optimization (PPO)
We use the following 3 low-level skills provided by Habitat Lab:
1. pick
2. place
3. navigate

HRL Training

We train a HRL agent based on Habitat Lab's benchmark to create a high-level policy that picks between pre-trained low-level skills
High-level policy receives visual observations and language instructions as input, and outputs sequence of skills with appropriate parameters
The model is also given the target object + position along with goal position
We experiment with a baseline with no language input, visual feature conditioning through FiLM, and cross-attention visual & language feature fusion
Baseline: benchmark provided by Habitat Lab; visual observations transformed into features through the provided ResNet-18 encoder, and then fed into a recurrent policy network trained with PPO (agent is not conditioned on language input)
FiLM: we insert FiLM layers into every ResNet residual block that take CLIP language embeddings and produce per-block γ (scale) and β (shift) parameters to modulate the visual feature maps before the ReLU, enabling instruction-conditioned extraction of task-relevant cues
Cross-Attention: We integrate a cross-attention block into the ResNet encoder that uses language embeddings as the query and ResNet visual features as key/value pairs, enabling selective attention to spatial regions most relevant to the instruction.

Reward Function Design

Habitat Lab uses "pddl_subgoal_reward" as its reward signal, which is a sparse reward based on the completion of intermediary goals
During training we notice the agent struggles to learn from sparse rewards, so we augment the original reward with dense terms to build a dense reward signal, "CompositeReward":

Results

Low-Level Skills

Training low-level skill policies was challenging due to data-hungry policies and limited GPU hardware
We achieved ≈60%, ≈45%, and ≈80% on pick, place, navigate, respectively
Below are our success rate and training curves:

"action_loss" and "value_loss" were unrepresentative of training progress as expected, unlike "reward"
Agent excels at learning "navigate" but struggles with "pick" and "place", often plateauing in success rate
Most likely since "navigate" is a high-level locomotion task that requires less degrees of freedom to solve, whereas "pick" & "place" suffer from higher-order complexities

High-Level Neural Policy

Compounding errors from low-level policies led to difficulties with training high-level policy
Achieved ≈5% success rate for rearrangement task, whereas the baseline achieves ≈5.9%
Baseline training success rates can be seen on the left below, while our those of our cross-attention methods are shown on the right:

We infer the agent learns to prefer inaction to avoid negative rewards, as during training, the agent hones in on the "navigate" skill and achieves positive reward learning progress, but progress then halts at a success rate of 5% which then plummets, all while rollout durations and object collisions reduce drastically

Discussion

Adding language input somewhat decreases task performance, but compounding errors as a result of hierarchical policy skill chaining make observed results difficult to interpret
Performance differences may be due to addition of language or incidental correlations & artifacts in the training data
The Habitat Challenge in 2022 provides a 30% success rate baseline but this uses a fixed high-level policy that is not learned (reduced complexity & potential errors)
Another studied implementation utilizes mobile manipulation skills to reduce compounding errors, achieving a success rate of 64%
These findings suggest to not only improve language grounding but to also address the structural instability caused by skill chaining
Mitigating compounding errors can be done by prerequisites such as improving skill reliability, refining high-level decision-making, and incorporating corrective feedback mechanisms

Challenges

We faced numerous bugs when working with the Habitat Lab repository due to version-incompatability issues, which greatly hindered our progress
The majority of these bugs involved tensor shape issues with the recurrent neural network state encoders used in the policy
Training also took significantly longer than expected as aforementioned, with some skills taking multiple days to train
Environment issues also stopped us from running on multiple devices

Conclusion

We integrated natural language into hierarchical RL agents for indoor environment rearrangement, combining FiLM and cross-attention to fuse vision and instructions
Incorporating language into high-level policy decisions led to modest performance decreases, although the overall success rate remained low due to compounding errors in multi-step tasks
Our findings underscore the need for robust low-level skill execution to support language-conditioned high-level policies
Language-conditioned policies by themselves are insufficient; architectural and algorithmic strategies are required to mitigate error accumulation
Future work should investigate tighter coupling between high- and low-level components and develop more effective training methodologies
Enhancing the richness and realism of language annotations will help determine whether agents genuinely understand instructions rather than exploit dataset-specific biases.