[1] Iro Armeni, Zhi-Yang He, JunYoung Gwak, Amir R Zamir,
Martin Fischer, Jitendra Malik, and Silvio Savarese. 3d scene
graph: A structure for unified semantics, 3d space, and cam-
era. In Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision (ICCV), 2019. 2
[2] Junwei Bao, Nan Duan, Ming Zhou, and Tiejun Zhao.
Knowledge-based question answering as machine translation.
In Proceedings of the Annual Meeting of the Association for
Computational Linguistics, 2014. 3
[3] Dhruv Batra, Aaron Gokaslan, Aniruddha Kembhavi, Olek-
sandr Maksymets, Roozbeh Mottaghi, Manolis Savva,
Alexander Toshev, and Erik Wijmans. Objectnav revisited:
On evaluation of embodied agents navigating to objects. arXiv
preprint arXiv:2006.13171, 2020. 1
[4] Shaked Brody, Uri Alon, and Eran Yahav. How attentive are
graph attention networks? arXiv preprint arXiv:2105.14491,
2021. 7, 8
[5] Devendra Singh Chaplot, Dhiraj Gandhi, Abhinav Gupta, and
Ruslan Salakhutdinov. Object goal navigation using goal-
oriented semantic exploration. In Proceedings of Neural
Information Processing Systems (NeurIPS), 2020. 3
[6] Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu
Sun. Measuring and relieving the over-smoothing problem
for graph neural networks from the topological view. In
Proceedings of the AAAI Conference on Artificial Intelligence
(AAAI), 2020. 8
[7] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber,
Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-
annotated 3d reconstructions of indoor scenes. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2017. 2, 5
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. Bert: Pre-training of deep bidirectional trans-
formers for language understanding. In Proceedings of the
Conference of the North American Chapter of the Association
for Computational Linguistics (NAACL), 2019. 3
[9] Helisa Dhamo, Fabian Manhardt, Nassir Navab, and Federico
Tombari. Graph-to-3d: End-to-end generation and manipu-
lation of 3d scenes using scene graphs. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2021. 2
[10] Wafa Elmannai and Khaled Elleithy. Sensor-based assistive
devices for visually-impaired people: current status, chal-
lenges, and future directions. Sensors, 17(3):565, 2017. 1
[11] Keyur Faldu, Amit Sheth, Prashant Kikani, and Hemang Ak-
abari. Ki-bert: Infusing knowledge context for better language
and domain understanding. arXiv preprint arXiv:2104.08145,
2021. 3
[12] Paul Gay, James Stuart, and Alessio Del Bue. Visual graphs
from motion (vgfm): Scene understanding with object geom-
etry reasoning. In Proceedings of the Asian Conference on
Computer Vision (ACCV), 2018. 1, 2
[13] Francesco Giuliari, Alberto Castellini, Riccardo Berra,
Alessio Del Bue, Alessandro Farinelli, Marco Cristani,
Francesco Setti, and Yiming Wang. Pomp++: Pomcp-based
active visual search in unknown indoor environments. In
Proceedings of the IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), 2021. 3
[14] Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Handong Zhao, Xu
Yang, and Gang Wang. Unpaired image captioning via scene
graph alignments. In Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision (ICCV), 2019. 2
[15] Jiuxiang Gu, Handong Zhao, Zhe L. Lin, Sheng Li, Jianfei
Cai, and Mingyang Ling. Scene graph generation with exter-
nal knowledge and image reconstruction. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2019. 3
[16] Kamal Gupta, Justin Lazarow, Alessandro Achille, Larry S.
Davis, Vijay Mahadevan, and Abhinav Shrivastava. Layout-
transformer: Layout generation and completion with self-
attention. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (ICCV), 2021. 6
[17] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik,
Percy Liang, Vijay Pande, and Jure Leskovec. Strategies
for pre-training graph neural networks. In Proceedings of
the International Conference on Learning Representations
(ICLR), 2019. 7
[18] Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li,
David A. Shamma, Michael S. Bernstein, and Li Fei-Fei.
Image retrieval using scene graphs. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2015. 2
[19] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and
Li Fei-Fei. Visual genome: Connecting language and vision
using crowdsourced dense image annotations. International
Journal of Computer Vision, 123:32–73, 2016. 1
[20] Soohyeong Lee, Ju-Whan Kim, Youngmin Oh, and Joo Hyuk
Jeon. Visual question answering over scene graph. In Pro-
ceedings of the First International Conference on Graph Com-
puting (GC), 2019. 2
[21] Guohao Li, Hang Su, and Wenwu Zhu. Incorporating external
knowledge to answer open-domain visual questions with dy-
namic memory networks. arXiv preprint arXiv:1712.00733,
2017. 3
[22] Manyi Li, Akshay Gadi Patil, Kai Xu, Siddhartha Chaud-
huri, Owais Khan, Ariel Shamir, Changhe Tu, Baoquan Chen,
Daniel Cohen-Or, and Hao Zhang. Grains: Generative recur-
sive autoencoders for indoor scenes. ACM Transactions on
Graphics (TOG), 38(2):1–16, 2019. 3
[23] Andrew Luo, Zhoutong Zhang, Jiajun Wu, and Joshua B
Tenenbaum. End-to-end optimization of scene layout. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2020. 1, 2
[24] J. A. Nelder and R. Mead. A Simplex Method for Function
Minimization. The Computer Journal, 7(4):308–313, 01 1965.