In this work, we focus on the task of procedure planning from instructio...
As a de facto solution, the vanilla Vision Transformers (ViTs) are encou...
Deep neural networks only learn to map in-distribution inputs to their
When there is a discrepancy between in-distribution (ID) samples and
An explicit discriminator trained on observable in-distribution (ID) sam...