Deep Gated Multi-modal Learning: In-hand Object Pose Estimation with Tactile and Image
In robot manipulation tasks, especially in-hand manipulation, estimation of the position and orientation of an object is an essential skill to manipulate objects freely. However, since in-hand manipulation tends to cause occlusion by the hand itself, image information only is not sufficient. For the challenge, combining tactile sensors is one of the approaches. The advantage of using multiple sensors (modals) is that the other modals can compensate for occlusion, noise, and sensor malfunctions. Even though the decision making of each modal reliability corresponding to the situations is important, the manual design of the model is difficult to deal with various situations. Therefore, in this study, we propose deep gated multi-modal learning using end-to-end deep learning in which the network self-determines the reliability of each modal. As experiments, an RGB camera and a GelSight tactile sensor were attached to the gripper of the Sawyer robot, and the poses were estimated during grasping. A total of 15 objects were used in the experiments. In the proposed model, the reliability of the modal was determined according to the noise and failure of each modal, and it was confirmed that the pose was estimated even for unknown objects.
READ FULL TEXT