The CRV program includes a number of invited speakers from all over to talk about their research programs targeting computer vision and robotics. CRV 2020 speakers are:
Keynote Speakers
Jonathan How

Jonathan How
Massachusetts Institute of Technology

Talk Title: Navigation and Mapping for Robot Teams in Uncertain Environments


Abstract
Many robotic tasks require robot teams to autonomously operate in challenging, partially observable, dynamic environments with limited field-of-view sensors. In such scenarios, individual robots need to be able to plan/execute safe paths on short timescales to avoid imminent collisions. Robots can leverage high-level semantic descriptions of the environment to plan beyond their immediate sensing horizon. For mapping on longer timescales, the agents must also be able to align and fuse imperfect and partial observations to construct a consistent and unified representation of the environment. Furthermore, these tasks must be done autonomously onboard, which typically adds significant complexity to the system. This talk will highlight three recently developed solutions to these challenges that have been implemented to (1) robustly plan paths and demonstrate high-speed agile flight of a quadrotor in unknown, cluttered environments; and (2) plan beyond the line-of-sight by utilizing the learned context within the local vicinity, with applications in last-mile delivery. We further present a multi-way data association algorithm to correctly synchronize partial and noisy representations and fuse maps acquired by (single or multiple) robots, showcased on a simultaneous localization and mapping (SLAM) application.
Bio
Jonathan P. How is the Richard C. Maclaurin Professor of Aeronautics and Astronautics at the Massachusetts Institute of Technology. He received a B.A.Sc. from the University of Toronto in 1987, and his S.M. and Ph.D. in Aeronautics and Astronautics from MIT in 1990 and 1993, respectively. Prior to joining MIT in 2000, he was an assistant professor in the Department of Aeronautics and Astronautics at Stanford University. He was the editor-in-chief of the IEEE Control Systems Magazine (2015-19) and was elected to the Board of Governors of the IEEE Control System Society (CSS) in 2019. His research focuses on robust planning and learning under uncertainty with an emphasis on multiagent systems. His work has been recognized with multiple awards, including the 2020 AIAA Intelligent Systems Award, the 2002 Institute of Navigation Burka Award, the 2011 IFAC Automatica award for best applications paper, the 2015 AeroLion Technologies Outstanding Paper Award for Unmanned Systems, the 2015 IEEE Control Systems Society Video Clip Contest, the IROS Best Paper Award on Cognitive Robotics (2017 and 2019) and three AIAA Best Paper in Conference Awards (2011-2013). He was awarded the Air Force Commander's Public Service Award (2017). He is a Fellow of IEEE and AIAA.
Simon Lucey

Simon Lucey
Carnegie Mellon University

Talk Title: Geometric reasoning in machine vision with using only 2D supervision


Abstract
Machine vision has made tremendous progress over the last decade with respect to perception. Much of this progress can be attributed to two factors: the ability of deep neural networks (DNNs) to reliably learn a direct relationship between images and labels; and access to a plentiful number of images with corresponding labels. We often refer to these labels as supervision – because they directly supervise what we want the vision algorithm to predict when presented with an input image. 2D labels are relatively easy for the computer vision community to come by: human annotators are hired to draw – literally through mouse clicks on a computer screen – boxes, points, or regions for a few cents per image. But how to obtain 3D labels is an open problem for the robotics and vision community. Rendering through computer generated imagery (CGI) is problematic, since the synthetic images seldom match the appearance and geometry of the objects we encounter in the real world. Hand annotation by humans is preferable, but current strategies rely on the tedious process of associating the natural images with a corresponding external 3D shape – something we refer to as “3D supervision”. In this talk, I will discuss recent efforts my group has been taking to train a geometric reasoning system using solely 2D supervision. By inferring the 3D shape solely from 2D labels we can ensure that all geometric variation in the training images is learned. This innovation sets the ground work for training the next generation of reliable geometric reasoning AIs needed to solve emerging needs in: autonomous transport, disaster relief, and endangered species preservation.
Bio
Simon Lucey (Ph.D.) is an associate research professor within the Robotics Institute at Carnegie Mellon University, where he is part of the Computer Vision Group, and leader of the CI2CV Laboratory. Since 2017, he is also a principal scientist at Argo AI. Before this he was an Australian Research Council Future Fellow at the CSIRO (Australia's premiere government science organization) for 5 years. Simon’s research interests span computer vision, robotics, and machine learning. He enjoys drawing inspiration from vision researchers of the past to attempt to unlock computational and mathematic models that underly the processes of visual perception.

Symposium Speakers
Animesh Garg

Animesh Garg
University of Toronto

Talk Title: Generalizable Autonomy for Robot Manipulation


Abstract
Data-driven methods in Robotics circumvent hand-tuned feature engineering, albeit lack guarantees and often incur a massive computational expense. My research aims to bridge this gap and enable generalizable imitation for robot autonomy. We need to build systems that can capture semantic task structures that promote sample efficiency and can generalize to new task instances across visual, dynamical or semantic variations. And this involves designing algorithms that unify learning with perception, control and planning. In this talk, I will how inductive biases and priors help with Generalizable Autonomy. First I will talk about choice of action representations in RL and imitation from ensembles of suboptimal supervisors. Then I will talk about latent variable models in self-supervised learning. Finally I will talk about meta-learning for multi-task learning and data gather in robotics.
Minglun Gong

Minglun Gong
University of Guelph

Talk Title: Novel network architectures for arbitrary image style transfer


Abstract
Style transfer has been an important topic in both Computer Vision and Graphics. Since the pioneer work of Gatys et al. demonstrated the power of stylization through optimization in deep feature space, a number of approaches have been developed for real-time arbitrary style transfer. However, even the state-of-the-art approaches may generate insufficiently stylized results under challenging cases. Two novel network architectures are discussed in this talk for addressing the issues and delivering better performances. We first observe that only considering features in the input style image for the global deep feature statistic matching or local patch swap may not always ensure a satisfactory style transfer. Hence, we propose a novel transfer framework that aims to jointly analyze and better align exchangeable features extracted from the content and style image pair. This allows the style features used for transfer to be more compatible with content information in the content image, leading to more structured stylization results. Another observation is that existing methods try to generate stylized result in a single shot, making it difficulty to satisfy constraints on semantic structures in the content images and style patterns in the style images. Inspired by the works on error-correction, we propose a self-correcting model to predict what is wrong with the current stylization and refine it accordingly in an iterative manner. For each refinement, we transit the error features across both the spatial and scale domains and invert the processed features into a residual image.
Helge Rhodin

Helge Rhodin
University of British Columbia

Talk Title: Computer Vision for Interactive Computer Graphics


Abstract
I work at the intersection of computer graphics and machine learning-based computer vision. I will be talking about my works on human and animal motion capture and their impact on gesture-driven character animation, VR telepresence, and automation in neuroscience. Moreover, I will outline my ongoing work on replacing hand-crafted CG and CV models with learned ones using self-supervision through multi-view and other geometric and physical constraints, including gravity.
Ismail Benayed

Ismail Benayed
ETS Montreal

Talk Title: Constrained Deep Networks


Abstract
Embedding constraints on the outputs of deep networks has wide applicability in learning, vision and medical imaging. For instance, in weakly supervised learning, constraints can mitigate the lack of full and laborious annotations, leveraging unlabeled data and guiding training with domain-specific knowledge. Also, adversarial robustness, which currently attracts substantial interest in the field, amounts to imposing constraints on network outputs. In this talk, I will discuss some recent developments in those research directions, emphasize how more attention should be paid to optimization methods, and include various illustrations, applications and experimental results.
Francois Pomerleau

Francois Pomerleau
Université Laval

Talk Title: From subterranean to subarctic autonomous exploration


Abstract
Autonomous navigation algorithms are now pushed to the extreme with robotic deployment happening in harsh experimental sites. Gaining robustness against dynamic and unstructured environments along with managing unforeseen robot dynamics is mandatory to reach a larger spectrum of autonomous navigation applications. This talk will give lessons learned from such difficult environments, namely a subterranean urban circuit and subarctic forest. The first part of the presentation will present results from our latest participation to the DARPA Subterranean (SubT) Challenge, for which our lab was the only Canadian participant. In February 2020, we supported the deployment of nine robots in a disaffected nuclear power plant in Elma, Washington. Our team, named CTU-CRAS-NORLAB, finished third against leading research laboratories in the world. The second part of the presentation will present our latest research results on lidar-based mapping in winter conditions.
Igor Gilitschenski

Igor Gilitschenski
MIT

Talk Title: Robust Perception for Autonomous Systems


Abstract
In recent years we have seen an exploding interest in the real-world deployment of autonomous systems, such as autonomous drones or autonomous ground vehicles. This interest was sparked by major advancements in robot perception, planning, and control. However, robust operation in the “wild” remains a challenging goal. Correct consideration of the broad variety of real-world conditions cannot be achieved by merely optimizing algorithms that have been shown to work well in controlled environments. In this talk, I will focus on robust perception for autonomous systems. First, I will discuss the challenges involved in handling dynamic and changing environments in Visual-SLAM. Second, I will discuss autonomous vehicle navigation concepts that do not rely on highly detailed maps or sufficiently good weather for localization. Finally, I will discuss the role of perception in interactive autonomy. Particularly, I will focus on the use of simulators for edge-case generation, learning, and real-world transfer of deep driving policies.
Sajad Saeedi

Sajad Saeedi
Ryerson University

Talk Title: Bringing Computer Vision to Robotics


Abstract
The technological advancements in machine learning and robotics have moved the presence of robots from exclusively in manufacturing facilities, into households where they are executing simple tasks such as vacuuming and lawn mowing. To extend the capabilities of these systems, robots need to advance beyond just reporting ‘what’ is ‘where’ in an image to developing spatial AI systems, necessary to interact usefully with their unstructured and dynamic environment. Therefore, there is an urgent need for novel perception and control systems that can deal with many real-world constraints such as limited resources, dynamic objects, and uncertain information. In this talk, several recent projects related to robotics and machine perception are presented. Recent developments such as accelerated inference on focal-plane sensor-processor arrays are introduced. These developments have significant economic and scientific impacts on our society and will open up new possibilities for real-time and reliable utilization of AI and robotics in real-world and dynamic environments. At the end of the talk, future research directions will be outlined. The main goal for future research will be developing reliable, high-speed, and low-power perception systems that can be deployed in real-world applications. It has been hypothesized that while machine learning algorithms will give us the required reliability, data processing in the focal plane will help us to achieve the desired energy consumption and run-time speed limits. Reliable, fast, and low-power computation for scene understanding and spatial awareness will be of great interest not only to the robotics community, but also other fields, such as the Internet of Things (IoT), Industry 4.0, privacy-aware devices, and networked-visual devices. These research directions will help entrepreneurs and academic researchers to identify new opportunities in machine learning and its application in robotics in real-world and dynamic environments.
Xiaoming Liu

Xiaoming Liu
Michigan State University

Talk Title: Monocular Vision-based 3D Perception for Autonomous Driving


Abstract
Perception in the 3D world is an essential requirement for autonomous driving. Most of existing algorithms rely on depth sensors such as LiDAR for 3D perception. In this talk, we will present our recent efforts on 3D perception based solely on monocular RGB images. First, we describe a unified 3D-RPN for 3D detection of vehicles, pedestrians, and bicycles. Secondly, a novel inverse graphics framework is designed to model the 3D shape and albedo for generic objects, while fitting these models to an image leads to 3D reconstruction of objects. Finally, we will also briefly present the low-level, and high-level computer vision efforts for autonomous driving at MSU, including LiDAR and RGB fusion, depth estimation, and semantic segmentation forecasting.
Gregor Miller

Gregor Miller
Google

Talk Title: OpenVL, a developer-friendly abstraction of computer vision


Abstract
Computer vision is a complicated topic and the fruits of our efforts often gets included in libraries such as OpenCV or as open source projects released by university labs or by companies. Here, the presentation of our work is often aimed at other researchers or those who are well versed in computer vision. However, to encourage widespread or faster adoption of these technologies, it is important that they be accessible to those not necessarily expert in the field. This talk is about the principles that underly our OpenVL framework, and that guided us to create a computer vision platform that was usable by mainstream developers with no specific expertise in computer vision. My hope is that this will be inspirational and potentially guide others in how to present their research or products more effectively to their targeted users.
Negar Rostamzadeh

Negar Rostamzadeh
Google Brain

Talk Title: On Label Efficient Machine Perception


Abstract
Deep learning methods often require large amounts of labelled data which can be impractical or expensive to acquire. My talk will cover four categories of work in minimizing the required labeling effort without sacrificing performance: (i) algorithms requiring less annotations per instance (point-level annotation for object counting and instance level semantic segmentation) (ii) using active learning to label the most informative part of the data (iii) domain adaptation when source domain annotation is cheaper and easier to acquire and (iv) multi-modal learning for the task of few-shot classification. I will briefly touch on the first 3 categories while discussing multi-modal learning in depth.
Matthew Walter

Matthew Walter
Toyota Technological Institute at Chicago

Talk Title: Natural Language Learning for Human-Robot Collaboration


Abstract
Natural language promises an efficient and flexible means for humans to communicate with robots, whether they are assisting the physically or cognitively impaired, or performing disaster mitigation tasks as our surrogates. Recent advancements have given rise to robots that are able to interpret natural language commands that direct object manipulation and spatial navigation. However, most methods require prior knowledge of the metric and semantic properties of the objects and places that comprise the robot's environment. In this talk, I will present our work that enables robots to successfully follow natural language navigation instructions within novel, unknown environments. I will first describe a method that treats language as a sensor, exploiting information implicit and explicit in the user's command to learn distributions over the latent spatial and semantic properties of the environment and over the robot's intended behavior. The method then learns a belief space policy that reasons over these distributions to identify suitable navigation actions. In the second part of the talk, I will present an alternative formulation that represents language understanding as a multi-view sequence-to-sequence learning problem. I will introduce an alignment-based neural encoder-decoder architecture that translates free-form instructions to action sequences based on images of the observable world. Unlike previous methods, this architecture uses no specialized linguistic resources and can be trained in a weakly supervised, end-to-end fashion, which allows for generalization to new domains. Time permitting, I will then describe how we can effectively invert this model to enable robots to generate natural language utterances. I will evaluate the efficacy of these methods on a combination of benchmark navigation datasets and through demonstrations on a voice-commandable wheelchair.