We aim to create a hierarchical agent that can:
We explore the use of natural language to guide hierarchical reinforcement learning (HRL) agents in long-horizon indoor rearrangement tasks. Using the Habitat Lab framework and the ReplicaCAD rearrange_easy benchmark, we train a high-level policy to select among pre-trained low-level skills based on both visual observations and language instructions. To fuse multimodal inputs, we evaluate two techniques: FiLM and cross-attention. Language instructions are generated using a large language model and embedded via a pretrained CLIP encoder. Our results show that incorporating language slightly improves task performance; however, the overall success rate remains low due to compounding errors from unreliable low-level skills. This suggests that while language offers valuable context, its benefits are limited without reliable underlying skills. We discuss key challenges and propose directions for mitigating cascading failures.
We aim to create a hierarchical agent that can: