From Words to Actions: Language-Guided Hierarchical RL for Object Rearrangement

Joe Lin, Allen Luo, Nishant Ray
UCLA
COM SCI 188 - Final Project



Abstract

We explore the use of natural language to guide hierarchical reinforcement learning (HRL) agents in long-horizon indoor rearrangement tasks. Using the Habitat Lab framework and the ReplicaCAD rearrange_easy benchmark, we train a high-level policy to select among pre-trained low-level skills based on both visual observations and language instructions. To fuse multimodal inputs, we evaluate two techniques: FiLM and cross-attention. Language instructions are generated using a large language model and embedded via a pretrained CLIP encoder. Our results show that incorporating language slightly improves task performance; however, the overall success rate remains low due to compounding errors from unreliable low-level skills. This suggests that while language offers valuable context, its benefits are limited without reliable underlying skills. We discuss key challenges and propose directions for mitigating cascading failures.

Video Demo

Intro

  • Teaching embodied agents to perform everyday, complex tasks in 3D environments is a core challenge in embodied AI
  • Reinforcement learning (RL) shows promise but is difficult to scale to long-horizon tasks due to sparse rewards & large action spaces
  • HRL offers a natural solution by decomposing tasks into reusable low-level skills orchestrated by a high-level policy
  • Multimodal learning research shows that natural language can specify high-level goals for agents
  • We've extended the Habitat Lab framework to incorporate natural language instructions into the high-level controller to pick low-level skills conditioned on language & visual observations
  • We've exploreed methods such as feature-wise linear modulation (FiLM) and cross-attention mechanisms to fuse language with vision
  • We've built on ReplicaCAD's "rearrange_easy" benchmark, auto-generating scene descriptions to create paired language-visual tasks
  • We've evaluated how language-conditioned HRL agents perform against purely vision-based baselines

Goal

We aim to create a hierarchical agent that can:

Interpret a natural language goal & use it to guide skill selection

Combine language & visual input to improve task performance & generalization

Train & evaluate on realistic 3D rearrangement tasks with varying scenes

Methods

Dataset

  • ReplicaCAD offers a "FRL apartment" environment for Habitat simulation
  • "rearrange_easy" subset has a training split of 50,000 episodes of rearrangement problems over 63 scenes & validation split with 1,000 episodes from another 21 scenes
  • Language augmentation pipeline:
    1. Extract target object and goal locations from each episode
    2. Map provided indices to scene receptacle locations
    3. Generate template sentences for each episode
    4. Reword into natural language via Google Gemini 2.0 Flash
    5. Transform into feature vectors with OpenAI's ViT-B/32 CLIP model

Low-Level Skills Training

  • Habitat Lab provides framework for training low-level skills composed by a high-level policy in a HRL environment
  • These skills correspond to separate, reusable behaviors (like picking or placing objects) & are trained independently using Proximal Policy Optimization (PPO)
  • We use the following 3 low-level skills provided by Habitat Lab:
    1. pick
    2. place
    3. navigate

HRL Training

  • We train a HRL agent based on Habitat Lab's benchmark to create a high-level policy that picks between pre-trained low-level skills
  • High-level policy receives visual observations and language instructions as input, and outputs sequence of skills with appropriate parameters
  • The model is also given the target object + position along with goal position
  • We experiment with a baseline with no language input, visual feature conditioning through FiLM, and cross-attention visual & language feature fusion
  • Baseline: benchmark provided by Habitat Lab; visual observations transformed into features through the provided ResNet-18 encoder, and then fed into a recurrent policy network trained with PPO (agent is not conditioned on language input)
  • FiLM: we insert FiLM layers into every ResNet residual block that take CLIP language embeddings and produce per-block γ (scale) and β (shift) parameters to modulate the visual feature maps before the ReLU, enabling instruction-conditioned extraction of task-relevant cues
  • Cross-Attention: We integrate a cross-attention block into the ResNet encoder that uses language embeddings as the query and ResNet visual features as key/value pairs, enabling selective attention to spatial regions most relevant to the instruction.
Vision language pipeline

Reward Function Design

  • Habitat Lab uses "pddl_subgoal_reward" as its reward signal, which is a sparse reward based on the completion of intermediary goals
  • During training we notice the agent struggles to learn from sparse rewards, so we augment the original reward with dense terms to build a dense reward signal, "CompositeReward":
Dense reward signal

Results

Low-Level Skills

  • Training low-level skill policies was challenging due to data-hungry policies and limited GPU hardware
  • We achieved ≈60%, ≈45%, and ≈80% on pick, place, navigate, respectively
  • Below are our success rate and training curves:
Dense reward signal
  • "action_loss" and "value_loss" were unrepresentative of training progress as expected, unlike "reward"
  • Agent excels at learning "navigate" but struggles with "pick" and "place", often plateauing in success rate
  • Most likely since "navigate" is a high-level locomotion task that requires less degrees of freedom to solve, whereas "pick" & "place" suffer from higher-order complexities

High-Level Neural Policy

  • Compounding errors from low-level policies led to difficulties with training high-level policy
  • Achieved ≈5% success rate for rearrangement task, whereas the baseline achieves ≈5.9%
  • Baseline training success rates can be seen on the left below, while our those of our cross-attention methods are shown on the right:
Rearrangement task success rates
  • We infer the agent learns to prefer inaction to avoid negative rewards, as during training, the agent hones in on the "navigate" skill and achieves positive reward learning progress, but progress then halts at a success rate of 5% which then plummets, all while rollout durations and object collisions reduce drastically

Discussion

  • Adding language input somewhat decreases task performance, but compounding errors as a result of hierarchical policy skill chaining make observed results difficult to interpret
  • Performance differences may be due to addition of language or incidental correlations & artifacts in the training data
  • The Habitat Challenge in 2022 provides a 30% success rate baseline but this uses a fixed high-level policy that is not learned (reduced complexity & potential errors)
  • Another studied implementation utilizes mobile manipulation skills to reduce compounding errors, achieving a success rate of 64%
  • These findings suggest to not only improve language grounding but to also address the structural instability caused by skill chaining
  • Mitigating compounding errors can be done by prerequisites such as improving skill reliability, refining high-level decision-making, and incorporating corrective feedback mechanisms

Challenges

  • We faced numerous bugs when working with the Habitat Lab repository due to version-incompatability issues, which greatly hindered our progress
  • The majority of these bugs involved tensor shape issues with the recurrent neural network state encoders used in the policy
  • Training also took significantly longer than expected as aforementioned, with some skills taking multiple days to train
  • Environment issues also stopped us from running on multiple devices

Conclusion

  • We integrated natural language into hierarchical RL agents for indoor environment rearrangement, combining FiLM and cross-attention to fuse vision and instructions
  • Incorporating language into high-level policy decisions led to modest performance decreases, although the overall success rate remained low due to compounding errors in multi-step tasks
  • Our findings underscore the need for robust low-level skill execution to support language-conditioned high-level policies
  • Language-conditioned policies by themselves are insufficient; architectural and algorithmic strategies are required to mitigate error accumulation
  • Future work should investigate tighter coupling between high- and low-level components and develop more effective training methodologies
  • Enhancing the richness and realism of language annotations will help determine whether agents genuinely understand instructions rather than exploit dataset-specific biases.

Full Paper