Publications
Note: You can get updated by email on new publications by subscribing here!
Computer Vision and Robotic Conferences
|
Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite [pdf] [dataset] |
|
|
Today, visual recognition systems are still rarely employed in robotics applications. Perhaps one of the main reasons for this is the lack of demanding benchmarks that mimic such scenarios. In this paper, we take advantage of our autonomous driving platform to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry / SLAM and 3D object detection. Our recording platform is equipped with four high resolution video cameras, a Velodyne laser scanner and a state-of-the-art localization system. Our benchmarks comprise 389 stereo and optical flow image pairs, stereo visual odometry sequences of 39.2 km length, and more than 200k 3D object annotations captured in cluttered scenarios (up to 15 cars and 30 pedestrians are visible per image). Results from state-of-the-art algorithms reveal that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world. Our goal is to reduce this bias by providing challenging benchmarks with novel difficulties to the computer vision community.
LATEX BIBTEX CITATION ENTRY: |
|
A Toolbox for Automatic Calibration of Range and Camera Sensors using a single Shot [pdf] [toolbox] |
|
|
As a core robotic and vision problem, camera and range sensor calibration have been researched intensely over the last decades. However, robotic research efforts still often get heavily delayed by the requirement of setting up a calibrated system consisting of multiple cameras and range measurement units. With regard to removing this burden, we present an online toolbox for fully automatic camera-to-camera and camera-to-range calibration. Our system is easy to setup and recovers intrinsic and extrinsic camera parameters as well as the transformation between cameras and range sensors within less than one minute. In contrast to existing calibration approaches, which often require user intervention, the proposed method is robust to varying imaging conditions, fully automatic, and easy to use since a single image and range scan proves sufficient for most calibration scenarios. Experiments using a variety of sensors such as greyscale and color cameras, the Kinect 3D sensor and the Velodyne HDL-64 laser scanner show the robustness of our method in different indoor and outdoor settings and under various lighting conditions.
LATEX BIBTEX CITATION ENTRY: |
|
Joint 3D Estimation of Objects and Scene Layout [pdf] [supp] [poster] [data] |
|
|
We propose a novel generative model that is able to reason jointly about the 3D scene layout as well as the 3D location and orientation of objects in the scene. In particular, we infer the scene topology, geometry as well as traffic activities from a short video sequence acquired with a single camera mounted on a moving car. Our generative model takes advantage of dynamic information in the form of vehicle tracklets as well as static information coming from semantic labels and geometry (i.e., vanishing points). Experiments show that our approach outperforms a discriminative baseline based on multiple kernel learning (MKL) which has access to the same image information. Furthermore, as we reason about objects in 3D, we are able to significantly increase the performance of state-of-the-art object detectors in their ability to estimate object orientation.
LATEX BIBTEX CITATION ENTRY: |
|
A Generative Model for 3D Urban Scene Understanding from Movable Platforms [pdf] [supp] [talk] [slides] [data] |
|
|
3D scene understanding is key for the success of applications such as autonomous driving and robot navigation. However, existing approaches either produce a mild level of understanding, e.g., segmentation, object detection, or are not accurate enough for these applications, e.g., 3D pop-ups. In this paper we propose a principled generative model of 3D urban scenes that takes into account dependencies between static and dynamic features. We derive a reversible jump MCMC scheme that is able to infer the geometric (e.g., street orientation) and topological (e.g., number of intersecting streets) properties of the scene layout, as well as the semantic activities occurring in the scene, e.g., traffic situations at an intersection. Furthermore, we show that this global level of understanding provides the context necessary to disambiguate current state-of-the-art detectors. We demonstrate the effectiveness of our approach on a dataset composed of short stereo video sequences of 113 different scenes captured by a car driving around a mid-size city.
LATEX BIBTEX CITATION ENTRY: |
|
Visual SLAM for Autonomous Ground Vehicles [pdf] |
|
|
In this paper we propose a dense stereo V-SLAM algorithm that estimates a dense 3D map representation which is more accurate than raw stereo measurements. Thereto, we run a sparse VSLAM system, take the resulting pose estimates to compute a locally dense representation from dense stereo correspondences. This dense representation is expressed in local coordinate systems which are tracked as part of the SLAM estimate. This allows the dense part to be continuously updated. Our system is driven by visual odometry priors to achieve high robustness when tracking landmarks. Moreover, the sparse part of the SLAM system uses recently published submapping techniques to achieve constant runtime complexity most of the time. The improved accuracy over raw stereo measurements is shown in a Monte Carlo simulation. Finally, we demonstrate the feasibility of our method by presenting outdoor experiments of a car like robot.
LATEX BIBTEX CITATION ENTRY: |
|
Efficient Large-Scale Stereo Matching
[pdf]
[slides] [software] |
|
|
In this paper we propose a novel approach to binocular stereo for fast matching of high-resolution images. Our approach builds a prior on the disparities by forming a triangulation on a set of support points which can be robustly matched, reducing the matching ambiguities of the remaining points. This allows for efficient exploitation of the disparity search space, yielding accurate dense reconstruction without the need for global optimization. Moreover, our method automatically determines the disparity range and can be easily parallelized. We demonstrate the effectiveness of our approach on the large-scale Middlebury benchmark, and show that state-of-the-art performance can be achieved with significant speedups. Computing the left and right disparity maps for a one Megapixel image pair takes about one second on a single CPU core.
LATEX BIBTEX CITATION ENTRY: |
|
Rank Priors for Continuous Non-Linear Dimensionality Reduction
[pdf] |
|
|
Discovering the underlying low-dimensional latent structure in high-dimensional perceptual observations (e.g., images, video) can, in many cases, greately improve performance in recognition and tracking. However, non-linear dimensionality reduction methods are often susceptible to local minima and perform poorly when initialized far from the global optimum, even when the intrinsic dimensionality is known a priori. In this work we introduce a prior over the dimensionality of the latent space that penalizes high dimensional spaces, and simultaneously optimize both the latent space and its intrinsic dimensionality in a continuous fashion. Ad-hoc initialization schemes are unnecessary with our approach; we initialize the latent space to the observation space and automatically infer the latent dimensionality. We report results applying our prior to various probabilistic non-linear dimensionality reduction tasks, and show that our method can outperform graph-based dimensionality reduction techniques as well as previously suggested initialization strategies. We demonstrate the effectiveness of our approach when tracking and classifying human motion.
LATEX BIBTEX CITATION ENTRY: |
|
Topologically-Constrained Latent Variable Models
[pdf] |
|
|
In dimensionality reduction approaches, the data are typically embedded in a Euclidean latent space. However for some data sets this is inappropriate. For example, in human motion data we expect latent spaces that are cylindrical or a toroidal, that are poorly captured with a Euclidean space. In this paper, we present a range of approaches for embedding data in a non-Euclidean latent space. Our focus is the Gaussian Process latent variable model. In the context of human motion modeling this allows us to (a) learn models with interpretable latent directions enabling, for example, style/content separation, and (b) generalise beyond the data set enabling us to learn transitions between motion styles even though such transitions are not present in the data.
LATEX BIBTEX CITATION ENTRY: |
|
An All-In-One
Solution to Geometric and Photometric Calibration
[pdf] |
|
![]() |
We propose a fully automated approach to calibrating multiple cameras whose fields of view may not all overlap. Our technique only requires waving an arbitrary textured planar pattern in front of the cameras, which is the only manual intervention that is required. The pattern is then automatically detected in the frames where it is visible and used to simultaneously recover geometric and photometric camera calibration parameters. In other words, even a novice user can use our system to extract all the information required to add virtual 3D objects into the scene and light them convincingly. This makes it ideal for Augmented Reality applications and we distribute the code under a GPL license.
LATEX BIBTEX CITATION ENTRY: |
Intelligent Vehicle Conferences
|
Motion-without-Structure: Real-time Multipose Optimization for Accurate Visual Odometry [pdf] |
|
|
State of the art visual odometry systems use bundle adjustment (BA) like methods to jointly optimize motion and scene structure. Fusing measurements from multiple time steps and optimizing an error criterion in a batch fashion seems to deliver the most accurate results. However, often the scene structure is of no interest and is a mere auxiliary quantity although it contributes heavily to the complexity of the problem. Herein we propose to use a recently developed incremental motion estimator which delivers relative pose displacements between each two frames within a sliding window inducing a pose graph. Moreover, we introduce a method to learn the uncertainty associated with each of the pose displacements. The pose graph is adjusted by non-linear least squares optimization while incorporating a motion model. Thereby we fuse measurements from multiple time steps much in the same sense as BA does. However, we obviate the need to estimate the scene structure yielding a very efficient estimator: Solving the nonlinear least squares problem by a Gauss-Newton method takes approximately 1ms. We show the effectiveness of our method on simulated and real world data and demonstrate substantial improvements over incremental methods.
LATEX BIBTEX CITATION ENTRY: |
|
StereoScan: Dense 3d Reconstruction in Real-time [pdf] [supp] [slides] [IV demo]
[software] |
|
|
This paper proposes a novel approach to build 3d maps from high-resolution stereo sequences in real-time. Inspired by recent progress in stereo matching, we propose a sparse feature matcher in conjunction with an efficient and robust visual odometry algorithm. Our reconstruction pipeline combines both techniques with efficient stereo matching and a multi-view linking scheme for generating consistent 3d point clouds. In our experiments we show that the proposed odometry method achieves state-of-the-art accuracy. Including feature matching, the visual odometry part of our algorithm runs at 25 frames per second, while - at the same time - we obtain new depth maps at 3-4 fps, sufficient for online 3d reconstructions.
LATEX BIBTEX CITATION ENTRY: |
|
|
Sparse Scene Flow Segmentation for Moving Object Detection [pdf] |
|
|
This paper presents an approach for object detection utilizing sparse scene flow. For consecutive stereo images taken from a moving vehicle, corresponding interest points are extracted. Thus, for every interest point, disparity and optical flow values are known and consequently, scene flow can be calculated. Adjacent interest points describing similar scene flow are considered to belong to one rigid object. The proposed method does not rely on object classes and allows for a robust detection of dynamic objects in traffic scenes. Leading vehicles are continuously detected for several frames. Oncoming objects are detected within five frames after their appearance.
LATEX BIBTEX CITATION ENTRY: |
|
ObjectFlow: A Descriptor for Classifying Traffic Motion
[pdf] |
|
|
We present and evaluate a novel scene descriptor for classifying urban traffic by object motion. Atomic 3D flow vectors are extracted and compensated for the vehicle's egomotion, using stereo video sequences. Votes cast by each flow vector are accumulated in a bird's eye view histogram grid. Since we are directly using low-level object flow, no prior object detection or tracking is needed. We demonstrate the effectiveness of the proposed descriptor by comparing it to two simpler baselines on the task of classifying more than 100 challenging video sequences into intersection and non-intersection scenarios. Our experiments reveal good classification performance in busy traffic situations, making our method a valuable complement to traditional approaches based on lane markings.
LATEX BIBTEX CITATION ENTRY: |
|
Visual Odometry based on Stereo Image Sequences
[pdf] [software] |
|
|
A common prerequisite for many vision-based driver assistance systems is the knowledge of the vehicle's own movement. In this paper we propose a novel approach for estimating the egomotion of the vehicle from a sequence of stereo images. Our method is directly based on the trifocal geometry between image triples, thus no time expensive recovery of the 3-dimensional scene structure is needed. The only assumption we make is a known camera geometry, where the calibration may also vary over time. We employ an Iterated Sigma Point Kalman Filter in combination with a RANSAC-based outlier rejection scheme which yields robust frame-to-frame motion estimation even in dynamic environments. A high-accuracy inertial navigation system is used to evaluate our results on challenging real-world video sequences. Experiments show that our approach is clearly superior compared to other filtering techniques in terms of both, accuracy and run-time.
LATEX BIBTEX CITATION ENTRY: |
|
Monocular Road Mosaicing for Urban Environments
[pdf] |
|
![]() ![]() |
Marking-based lane recognition requires an unobstructed view onto the road. In practice however, heavy traffic often constrains the visual field, especially in urban scenarios such as urban crossroads. In this paper we present a novel approach to road mosaicing for dynamic environments. Our method is based on a multistage registration procedure and uses blending techniques. We show that under modest assumptions accurate registration is possible from monocular image sequences. We further demonstrate that fusing visual information from previous frames into the current view can greatly extend the camera's field of view.
LATEX BIBTEX CITATION ENTRY: |
Workshops
|
Realistic Modeling of Water Droplets for Monocular Adherent Raindrop Recognition
[pdf] |
|
|
In this paper, we propose a novel raindrop shape model for the detection of view-disturbing, adherent raindrops on inclined surfaces. Whereas state-of-the-art techniques do not consider inclined surfaces because they assume the droplets as sphere sections with equal contact angles, our model incorporates cubic Bezier curves that provide a low dimensional and physically interpretable representation of a raindrop surface. The parameters are empirically deduced from numerous observations of different raindrop sizes and surface inclination angles. It can be easily integrated into a probabilistic framework for raindrop recognition, using geometrical optics to simulate the visual raindrop appearance. In comparison to a sphere section model, the proposed model yields an improved droplet surface accuracy up to three orders of magnitude.
LATEX BIBTEX CITATION ENTRY: |
|
Video-based raindrop detection for improved image registration
[pdf] |
|
|
In this paper we present a novel approach to improved image registration in rainy weather situations. To this end, we perform monocular raindrop detection in single images based on a photometric raindrop model. Our method is capable of detecting raindrops precisely, even in front of complex backgrounds. The effectiveness is demonstrated by a significant increase in image registration accuracy which also allows for successful image restoration. Experiments on video sequences taken from within a moving vehicle prove the applicability to real-world scenarios.
LATEX BIBTEX CITATION ENTRY: |
Journal Articles
|
Team AnnieWAY's entry to the Grand Cooperative Driving Challenge 2011
[pdf] |
|
|
In this paper we present the concepts and methods developed for the autonomous vehicle AnnieWAY, our winning entry to the Grand Cooperative Driving Challenge of 2011. We describe algorithms for sensor fusion, vehicle-to-vehicle communication and cooperative control. Furthermore, we analyze the performance of the proposed methods and compare them to those of competing teams. We close with our results from the competition and lessons learned.
LATEX BIBTEX CITATION ENTRY: |
Diploma thesis
|
Human Body Tracking with Rank Priors for Non-Linear Dimensionality Reduction [pdf] |
Student Research Project
|
Automatic Multiple Camera Calibration
[pdf] |














