VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation

Naoki Yokoyama

Sehoon Ha

Dhruv Batra

Jiuguang Wang

Bernadette Bucher

arXiv BibTex Code

Best Paper in Cognitive Robotics at the International Conference on Robotics and Automation (ICRA), 2024
Workshop on Language and Robot Learning @ CoRL, 2023

Main Video

Real-World Demonstrations

(all videos 5x speed)

Target object: "microwave"

Text prompt: "Seems like there's a microwave ahead."

Target object: "potted plant"

Text prompt: "Potted plant in the sun by the stairs."

Target object: "toilet"

Text prompt: "Seems like there is a toilet ahead."

Simulation success examples

Target object: "toilet"

Text prompt: "Seems like there's a toilet ahead."

Target object: "bed"

Text prompt: "Seems like there's a bed ahead."

Target object: "chair"

Text prompt: "Seems like there's a chair ahead."

Simulation failure examples

VLFM does not yet filter its detections using other visual cues from the environment, and is thus still sensitive to false positives outputted by the detector.

Target object: "tv"

Text prompt: "Seems like there's a tv ahead."

Target object: "bed"

Text prompt: "Seems like there's a bed ahead."

We have found that VLFM can also be sensitive to environments that do not feature many visual semantics cues relating to the object, such as homogeneous office environments where it may be difficult to find a toilet from a far distance.

BibTex:

@inproceedings{yokoyama2024vlfm,
    title={VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation},
    author={Naoki Yokoyama and Sehoon Ha and Dhruv Batra and Jiuguang Wang and Bernadette Bucher},
    booktitle={International Conference on Robotics and Automation (ICRA)},
    year={2024},
}