Place: CVC Conference room & Streaming
Thesis Directors:
Dr. Antonio Lopez (Centre de Visió per Computador, Universitat Autònoma de Barcelona)
Dr. Germán Ros (Intel Intelligent Systems Lab)
Thesis Committee:
Dr. Ernest Valveny (Centre de Visió per Computador, Universitat Autònoma de Barcelona)
Dr. Francesc Moreno (Institut de Robòtica i Informàtica Industrial CSIC-UPC)
Dr. José Manuel Álvarez (AI-Infra, NVIDIA)
Abstract:
Manually annotating images to develop vision models has been a major bottleneck since computer vision and machine learning started to walk together. This thesis focuses on leveraging synthetic data to alleviate manual annotation for three perception tasks related to driving assistance and autonomous driving. In all cases, we assume the use of deep convolutional neural networks (CNNs) to develop our perception models.
The first task addresses traffic sign recognition (TSR), a kind of multi-class classification problem. We assume that the number of sign classes to be recognized must be suddenly increased without having annotated samples to perform the corresponding TSR CNN re-training. We show that leveraging synthetic samples of such new classes and transforming them by a generative adversarial network (GAN) trained on the known classes (i.e., without using samples from the new classes), it is possible to re-train the TSR CNN to properly classify all the signs for a ~1/4 ratio of new/known sign classes. The second task addresses on-board 2D object detection, focusing on vehicles and pedestrians. In this case, we assume that we receive a set of images without the annotations required to train an object detector, i.e., without object bounding boxes. Therefore, our goal is to self- annotate these images so that they can later be used to train the desired object detector. In order to reach this goal, we leverage from synthetic data and propose a semi-supervised learning approach based on the co-training idea. In fact, we use a GAN to reduce the synth-to-real domain shift before applying co-training. Our quantitative results show that co-training and GAN-based image-to-image translation complement each other up to allow the training of object detectors without manual annotation, and still almost reaching the upper-bound performances of the detectors trained from human annotations. While in previous tasks we focus on vision-based perception, the third task we address focuses on LiDAR pointclouds. Our initial goal was to develop a 3D object detector trained on synthetic LiDAR-style pointclouds. While for images we may expect synth/real-to-real domain shift due to differences in their appearance (e.g. when source and target images come from different camera sensors), we did not expect so for LiDAR pointclouds since these active sensors factor out appearance and provide sampled shapes. However, in practice, we have seen that it can be domain shift even among real-world LiDAR pointclouds. Factors such as the sampling parameters of the LiDARs, the sensor suite configuration on-board the ego-vehicle, and the human annotation of 3D bounding boxes, do induce a domain shift. We show it through comprehensive experiments with different publicly available datasets and 3D detectors. This redirected our goal towards the design of a GAN for pointcloud-to-pointcloud translation, a relatively unexplored topic.
Finally, it is worth to mention that all the synthetic datasets used for these three tasks, have been designed and generated in the context of this PhD work and will be publicly released. Overall, we think this PhD presents several steps forward to encourage leveraging synthetic data for developing deep perception models in the field of driving assistance and autonomous driving.