IROS2019國(guó)際學(xué)術(shù)會(huì)議論文集1339_第1頁(yè)
IROS2019國(guó)際學(xué)術(shù)會(huì)議論文集1339_第2頁(yè)
IROS2019國(guó)際學(xué)術(shù)會(huì)議論文集1339_第3頁(yè)
IROS2019國(guó)際學(xué)術(shù)會(huì)議論文集1339_第4頁(yè)
IROS2019國(guó)際學(xué)術(shù)會(huì)議論文集1339_第5頁(yè)
免費(fèi)預(yù)覽已結(jié)束,剩余1頁(yè)可下載查看

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

Generating an image of an objects appearance from somatosensory information during haptic exploration Kento Sekiya1, Yoshiyuki Ohmura and Yasuo Kuniyoshi2 AbstractVisual occlusions caused by the environ- ment or by the robot itself can be a problem for object recognition during manipulation by a robot hand. Under such conditions, tactile and somatosensory information are useful for object recognition during manipulation. Humans can visualize the appearance of invisible objects from only the somatosensory in- formation provided by their hands. In this paper, we propose a method to generate an image of an invisi- ble objects posture from the joint angles and touch information provided by robot fi ngers while touching the object. We show that the objects posture can be estimated from the time-series of the joint angles of the robot hand via regression analysis. In addition, conditional generative adversarial networks can gen- erate an image to show the appearance of the invisible objects from their estimated postures. Our approach enables user-friendly visualization of somatosensory information in remote control applications. I. INTRODUCTION Object and environmental recognition are crucial pro- cesses for object handling in the real world. Progress in computer vision has enabled robots to detect objects and recognize them. Additionally, computer vision is helpful in shape recognition and pose estimation applications for the purposes of robotic manipulation. However, com- puter vision is often useless during object manipulation because the robot hand or the surrounding environment hides part or the entirety of the object. In such situations, visual processing of the changes in the position and pose of an object that has been touched by the robot becomes diffi cult. Humans can recognize and manipulate objects in situ- ation where the visual information has been lost, e.g., in the dark, or when the object is in a pocket. Klatzky et al. showed that humans can recognize the type of an object with only a few touches 1. Furthermore, humans seem to be able to visualize an individual objects informa- tion during haptic exploration 2. While somatosensory information mainly consists of self-motion and posture- related information, humans frequently pay attention to the objects posture and pose rather than their hands pose. Because the objects posture and pose are more important than self-motion during manipulation, this 1Kento Sekiya is with the Faculty of Engineering, the University of Tokyo, Japansekiyaisi.imi.i.u-tokyo.ac.jp 2Yoshiyuki Ohmura and Yasuo Kuniyoshi are with the Graduate School of Information Science and Technology, the University of Tokyo, Japan ohmura,kuniyoshisi.imi.i.u-tokyo. ac.jp . . . Somatosensory information PostureReal images 128 128 145 n 1 Fig. 1: System used to match an image to somatosensory information via an objects posture. attention bias is reasonable. However, the method used to extract the objects information from the somatosensory information is poorly understood. We believe that this ability is crucial for eff ective object manipulation. In this paper, we show that the postures of sev- eral known objects can be estimated from time-series somatosensory information and provide a model that generates an image of the appearance of sample objects during haptic exploration. We propose a method that combines regression networks with conditional generative adversarial networks (cGANs) 3. Regression networks estimate an objects pose from somatosensory informa- tion and we evaluate how much of the time-series hand data contains the object pose information. A cGAN is a generative model that generates an image of an object corresponding to that objects pose. We also evaluate whether or not the generated image shows the objects pose correctly. Our proposed approach can be used to complement the visual information of objects when they are covered by the surrounding environment. The robot can present the somatosensory information as an image that a human can understand easily and our approach enables user- friendly visualization of somatosensory information in remote control applications. II. Related work A. Object recognition In the computer vision fi eld, high-level object recogni- tion has been achieved. Through the use of deep neural networks, techniques for feature extraction from images have improved and the acceleration of the processing 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019 978-1-7281-4003-2/19/$31.00 2019 IEEE8132 hand data n145 random noise n2 Regression nets mlp(145,50,10,2) n (cos, sin) pose n100 Generator mlp(100,256,512,1024,16384) n128128 Embedding Discriminator mlp(16384,512,512,512,1) 128128 real images Random selection Embedding real or fake? fake real generated images Regression nets Conditional generative adversarial networks Fig. 2: Model composed of regression nets and cGAN. The conditional labels of cGAN are constructed from the estimated objects pose with regression. ”mlp” means multi-layer perceptron and numbers are layer sizes. time has enabled real-time object recognition 4 5 6 7 8. In the neuroscience fi eld, the ability of humans to recognize objects by touch has often been discussed. Hern andez-P erez et al. showed that tactile object recog- nition generates patterns of activity in a multisensory area that is known to encode objects 2. Monaco et al. also showed that the area of the brain related to visual recognition is activated during haptic exploration of shapes 9. Furthermore, the relationship between visual perception and object manipulation has also been discussed in recent years 10. Therefore, it is believed that humans can imagine visual information from the tactile information acquired during haptic exploration. B. Image generation Generative modeling has been studied in both the computer vision and natural language processing fi elds. Recently, deep neural networks have made a major con- tribution to image generation using generative modeling. Examples of the deep generative models that have been developed include the variational autoencoder (VAE) 11 and generative adversarial networks (GANs) 12. GANs include two networks, known as the generator and the discriminator, and the generator can generate high-resolution images that we cannot discriminate from real images, but GANs have a problem with training instability. To solve this problem, various studies have proposed improved GAN models 3 13 14. In this pa- per, we have focused on cGANs 3, which can control the generated images using a conditional vector. In cGANs, conditional vectors are merged into the inputs of both the generator and the discriminator, so the generator can learn weights that represent images that correspond to conditional vectors. III. Methods A. Overview To generate an image of an objects appearance from somatosensory information during haptic exploration with supervised learning, it is necessary to collect a set composed of an image of the objects appearance and the somatosensory information during haptic ex- ploration. However, in the real world, the robotic hand generally covers objects during haptic exploration, so it is diffi cult to collect the object image and somatosensory information simultaneously. We propose a system to match images to the somatosensory information via the objects posture, which is measured using a rotation sensor. We collect the somatosensory information and the object posture data simultaneously, and collect the object posture and image data simultaneously. Finally, we match the images to the somatosensory information, as shown in Fig.1. Fig.2 shows the model used to generate an image of an objects appearance from somatosensory information during haptic exploration. To determine whether an objects information can be extracted from somatosen- sory information alone during this exploration, we used regression nets that estimate an objects pose from the somatosensory information. The cGAN trains a gener- ator that generates images from noise and conditional vectors that are constructed from the estimated objects pose. B. Regression nets We used regression nets to extract the objects pose from the somatosensory information. The regression nets were trained using a set of object postures and the somatosensory information, and estimated the objects pose. Posture data are cyclic data that become the same pos- ture again after rotating through 360. Therefore, when 8133 Fig. 3: Experimental setup. The robotic hand equipped with the fi xed robot arm touches the object at random. The stereo camera captures images of the object. THUMB (5+1) FF (4+1) MF (4+1)RF (4+1) LF (5+1) WRIST (2+0) Joint Touch sensor Fig. 4: Degrees of freedom of the robotic hand. The fi ngers have 22 degrees of freedom and the wrist has two degrees of freedom. The robotic hand has fi ve fi ngers that are equipped with touch sensors on the fi nger- tips. regression nets are trained, raw posture data cannot be used to calculate the minimum square error. We thus used the cosine and the sine of the posture as the outputs of the regression nets. C. Conditional generative adversarial networks A cGAN is composed of generator networks and dis- criminator networks. A conditional label is a number in the 0 9 range that classifi es one round of the objects posture into 10 discrete classes in our experiment. In the case where there are too many classes, we believe that the small quantities of training data per class infl uence the instability of the cGANs learning. In the case where there are too few classes, we believe that various images of the objects poses were included in a single class, so a conditional label cannot be used to control the correct Fig. 5: Three objects used in the experiments. The left object is a regular square prism, the middle object is an elliptical cylinder, and the right object is a regular triangular prism. image of the objects pose. LGis the loss function of the generator and LDis the loss function of the discriminator described by (1)-(2). x represents real images, y is a conditional label that is constructed from the estimated object poses and z is random noise. The generator minimizes log(D(x|y), which means that the discriminator discriminates the real images from the generated images correctly, and maximizes log(D(G(z|y), which means that the discrimi- nator recognizes the generated images as real images. In contrast, the discriminator minimizes log(D(G(z|y). LG= Expdatalog(D(x|y)+Ezpz1log(D(G(z|y) (1) LD= Ezpzlog(D(G(z|y)(2) D. Evaluation of the generated images To evaluate whether or not the generated images express the objects appearance correctly, we compare the image pixels of the generated images with those of the real images. The cGAN generates an image that corresponds to a conditional label and we calculate the pixel loss between this image and the real images for 10 classes. If the class of the smallest loss corresponds to the input label or to the label on both sides, the generated image expresses the objects appearance correctly. E. Implementation We implemented regression nets and the cGAN us- ing Keras 15, which is a neural network library in Python. The cGAN was trained using the DGX-1 system (NVIDIA), which contains eight Pascal P100 graphics processing units (GPUs). IV. Data collection A. Hardware setup Fig.3 shows the experimental hardware setup. We used a robotic arm (LBR iiwa 14 R820, KUKA) that has seven degrees of freedom and a robotic hand (Shadow Dexterous Hand E Series, Shadow Robot Company) that has 24 degrees of freedom (Fig.4). The joint angles of the robotic arm are all set at fi xed positions. The robotic hand has fi ve fi ngers that are equipped with touch sen- sors on the fi ngertips. Touch sensors are Pressure Sensor Tactiles (PSTs) which are a single region sensor. 8134 time Fig. 6: Haptic exploration with the robotic hand. 50150250350450550650750850950 epoch 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 loss regression loss loss:1 loss:3 loss:5 loss:7 loss:9 Fig. 7: Comparison of the transitions of the minimum square error when changing the span of time-series so- matosensory information in the case of a square prism. The test objects are set on a horizontal table and their positions are fi xed. They rotate around a single pivot and their angular positions are measured using a rotary encoder (MAS-14-262144N1, Micro Tech Laboratory). A stereo camera (ZED, Stereolabs) is then used to take photographs of the objects. The object images are gray scale images with a size of 128128. We used three objects: a regular square prism, an elliptical cylinder, and a regular triangular prism (Fig.5). The regular square prism achieves the same pose by rotating through 90. The elliptical cylinder achieves the same pose by rotating through 180. The regular triangular prism achieves the same pose by rotating through 120. B. Haptic exploration using the robotic hand We controlled the robotic hand remotely using a glove (CyberGlove II , CyberGlove Systems) and the robotic hand touched the objects at random (Fig.6). We collected fi ve tactile data and 24 joint angles of the robotic hand with a 10 Hz cycle, and then merged the tactile and somatosensory data into 29 dimensional hand data. The hand data generated when touching the objects at two or more points were extracted. To use the time-series infor- mation of the hand data, we merged the extracted hand TABLE I: Accuracy of the estimated posture with regres- sion analysis ShapeAccuracy Square prism89.9% Elliptic cylinder92.3% Triangular prism88.7% data with several steps before and after the extracted hand data were acquired. We collected 3000 extracted somatosensory data for each object. V. Experiments A. Pose estimation We trained the regression nets on the somatosensory information and the estimated object poses. The so- matosensory information was split into two sets, and we used 1500 data to train the regression nets and used the other 1500 data to test a regression model and estimate the object poses. To determine how many steps of the hand data were merged with the extracted hand data, we evaluated the minimum square error of the regression in time windows of fi ve diff erent sizes. Fig.7 shows the minimum square error results in 1, 3, 5, 7, and 9 steps, which were determined from before- and-after analysis of the extracted hand data on the square prism. One step means 29 dimensional hand data from touching the objects, while three steps means 87 dimensional hand data composed of the touching data and the hand data in the 0.1 s periods before and after touching occurred, and the quantities of data continue to increase with increasing numbers of steps. In the case of the shorter time-series hand data, the minimum square error did not decrease. In contrast, in the case of the longer time-series hand data, the weights were overfi tting the training data, so the minimum square error increased as the number of learning epochs increased. In the case of fi ve steps, which are merged with hand data in the 0.2 s periods before and after the touching data were acquired, the minimum square error gradually decreased. The estimated postures from the somatosensory infor- mation were classifi ed into 10 classes. A square prism was classifi ed every 9 , an elliptical cylinder was classifi ed every 18 , and a triangular prism was classifi ed every 12. TABLE I shows the accuracy as calculated from 8135 class: 0class: 1class: 2class: 3class: 4 class: 5class: 6class: 7class: 8class: 9 square (a) Square prism class: 0class: 1class: 2class: 3class: 4 class: 5class: 6class: 7class: 8class: 9 ellipse (b) Elliptical cylinder class: 0class: 1class: 2class: 3class: 4 class: 5class: 6class: 7class: 8class: 9 triangle (c) Triangular prism Fig. 8: Results for 10 generated images corresponding to the conditional labels that were classifi ed from estimated postures. Fig.8a shows the square prism, Fig.8b shows the elliptical cylinder, and Fig.8c shows the triangular prism. 025050075010001250150017502000 epoch/10 0.0 0.2 0.4 0.6 0.8 1.0 discriminator loss(square) 025050075010001250150017502000 epoch/10 0 5 10 15 generator (a) Square prism 025050075010001250150017502000 epoch/10 0.0 0.2 0.4 0.6 0.8 1.0 discriminator loss(ellipse) 025050075010001250150017502000 epoch/10 0 5 10 15 generator (b) Elliptical cylinder 025050075010001250150017502000 epoch/10 0.0 0.2 0.4 0.6 0.8 1.0 discriminator loss(triangle) 025050075010001250150017502000 epoch/10 0 5 10 15 generator (c) Triangular prism Fig. 9: Loss transitions of the generator and the discriminator. Fig.9a shows the results for the square prism, Fig.9b shows the results for the elliptical cylinder, and Fig.9c shows the results for the triangular prism. the classifi ed class corresponding to the correct class or the class on both sides. In the elliptic cylinder case, the accuracy was 92.3%, which was the highest score. The accuracies for the other objects also showed high scores, demonstrating that the object pose information can be extracted from fi ve steps of time-series somatosensory information with regression analysis. B. Image generation We matched the object images to the somatosensory information using the estimated postures. We trained the cGAN on 1500 samples of object images and conditional labels that were classifi ed based on the estimated pos- tures. The learning time was 20000 epochs and the batch size was 32. Fig.8 shows the results of image generation of each object shape. The cGAN was able to generate visual images of the objects. The results also show that the change in the class label corresponded to the change in object poses visually. Fig.9 shows the loss transitions of the generator and the discriminator. The loss

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論