
Vision and Language Navigation (VLN) faces significant efficiency hurdles when agents rely on local actions and sequential memory to follow high-level instructions. The Dual-scale Graph Transformer (DUET) addresses these limitations by utilizing a topological mapping module and a global action planning module to enable more effective environment exploration. By explicitly building a map, the agent can perform global actions and compute shortest paths to new locations, avoiding the computational instability of step-by-step backtracking. The system employs a dual-scale encoder that balances coarse-grained graph reasoning for global navigation with fine-grained representations for precise local actions and stopping criteria. Evaluated on datasets like REVERIE and SOON, DUET achieved absolute success rate gains of over 20% and secured first place in the ICCV 2021 VLN challenge, demonstrating that combining global mapping with dynamic fusion of scales significantly outperforms traditional recurrent state approaches.
Sign in to continue reading, translating and more.
Continue