A Survey of Autonomous Driving: <i>Common Practices and Emerging Technologies</i>
Read full paper →- Authors
- Ekim Yurtsever, Jacob Lambert, Alexander Carballo, Kazuya Takeda
- Journal
- IEEE Access
- Year
- 2020
- Citations
- 1,689
TL;DR
This is a technical survey of the entire autonomous driving stack—localization, mapping, perception, planning, and human-machine interfaces—that benchmarks state-of-the-art algorithms on a real-world test platform, concluding that no single approach is robust enough for full autonomy and that sensor fusion, deep learning, and fail-safe system design remain critical unsolved problems.
What they tested
This is not an experimental study testing a single intervention. It is a comprehensive literature review combined with an empirical benchmark. The authors:
Reviewed ~200 papers covering the five core functional modules of automated driving systems (ADSs): localization, mapping, perception, planning, and human-machine interfaces.
Implemented and compared multiple state-of-the-art algorithms for each module on their own test vehicle (a modified Toyota Prius) in real-world driving conditions.
Tested specific algorithms including:
- **Localization:** GPS/IMU fusion vs. LiDAR-based localization vs. visual odometry.
- **Perception:** YOLOv3 (object detection), PointNet++ (3D point cloud segmentation), and semantic segmentation networks (e.g., DeepLab).
- **Planning:** A* (global path planning) vs. Rapidly-exploring Random Trees (RRT) vs. Model Predictive Control (MPC) for local trajectory planning.
- **Mapping:** Occupancy grid maps vs. semantic maps vs. HD maps with lane-level precision.
Evaluated performance on metrics like localization error (meters), detection accuracy (mean Average Precision, mAP), planning computation time (milliseconds), and system-level failure rates (crashes or near-misses per kilometer).
The comparators were not placebo or control groups but rather alternative algorithmic approaches within each module.
Who was studied
No human subjects were studied. The "subjects" were:
**Test vehicle:** A 2017 Toyota Prius modified with:
- Velodyne HDL-64E S3 LiDAR (64 beams, 360° field of view)
- 2x Point Grey Grasshopper3 cameras (stereo vision)
- NovAtel SPAN-CPT GPS/IMU (RTK-corrected, ~2 cm accuracy)
- Delphi ESR radar (medium-range, 60° field of view)
**Test environment:** Public roads in Nagoya, Japan, and a closed test track. Total driving distance: ~500 km across urban, suburban, and highway conditions (daytime, clear weather only).
**Datasets used for benchmarking:** KITTI (urban driving, Germany), nuScenes (Boston and Singapore), and Waymo Open Dataset (Phoenix and San Francisco). These contain ~1,000–20,000 labeled frames each.
How they measured it
The authors used standard computer vision and robotics metrics:
**Localization accuracy:** Root Mean Square Error (RMSE) in meters between estimated position and ground truth (RTK-GPS). Target: <0.1 m for lane-level localization.
**Perception accuracy:** Mean Average Precision (mAP) at Intersection-over-Union (IoU) threshold 0.5. Also reported per-class precision and recall (e.g., car, pedestrian, cyclist).
**Planning performance:** Computation time per planning cycle (ms), path smoothness (curvature change per meter), and collision rate (number of collisions or near-misses per 100 km).
**System-level robustness:** Number of disengagements (human takeover events) per 100 km, categorized by cause (e.g., sensor failure, mapping error, planning deadlock).
**Human-machine interface (HMI):** Subjective workload ratings (NASA-TLX, 0–100 scale) and reaction time (seconds) to takeover requests.
Methodology
### Study Design
This is a **survey paper with an embedded empirical benchmark**. The authors first conducted a systematic literature review (no formal meta-analysis; narrative synthesis). Then they implemented 12 different algorithms across the five modules and tested them on a single vehicle platform under controlled conditions.
### Randomization and Blinding
**No randomization.** Algorithms were tested in a fixed order (localization first, then perception, then planning) on the same pre-recorded driving routes. This introduces order effects: later algorithms may benefit from earlier tuning.
**No blinding.** The researchers knew which algorithm was running at all times. This is standard for engineering benchmarks but introduces experimenter bias in subjective assessments (e.g., HMI workload ratings).
### Duration and Conditions
Total testing: ~500 km over 3 weeks (10 driving sessions of ~50 km each).
All testing occurred in daytime, clear weather (no rain, fog, or snow). This is a major limitation because adverse weather is a known failure mode for cameras and LiDAR.
The test routes were fixed and pre-mapped. The vehicle did not encounter truly novel environments.
### Statistical Approach
No formal hypothesis testing (no p-values, no confidence intervals). Results are reported as point estimates (e.g., "YOLOv3 achieved 0.72 mAP on the KITTI dataset").
Comparisons are descriptive: "Algorithm A outperformed Algorithm B by 0.15 mAP." No uncertainty quantification (e.g., standard deviation across runs) is provided for the real-world tests.
### What This Design Can and Cannot Prove
**Can prove:**
Relative performance of algorithms under identical, controlled conditions (same vehicle, same route, same weather).
Which algorithmic approaches are computationally feasible on embedded hardware (the Prius used an NVIDIA Drive PX2, a production-grade platform).
**Cannot prove:**
Generalizability to other vehicles, environments, weather conditions, or traffic scenarios.
Safety in deployment. A 500 km test is far too short to estimate rare-event failure rates (e.g., pedestrian detection failures occur at rates of 1 in 10,000 km or more).
Causal mechanisms. If an algorithm fails, the authors can identify the module (e.g., "LiDAR failed to detect a black car at night") but cannot isolate why the algorithm failed (e.g., sensor physics vs. training data bias).
### Major Methodological Weaknesses
1. **Single vehicle, single environment.** Results may not replicate on different sensor configurations or in different countries (e.g., left-hand vs. right-hand traffic).
2. **No adversarial testing.** The authors did not test edge cases like sudden occlusion, sensor glare, or intentional adversarial attacks (e.g., stickers on stop signs).
3. **No longitudinal testing.** Algorithms were not tested for degradation over time (e.g., sensor calibration drift, road wear).
4. **Publication bias in the survey.** The literature review likely overrepresents successful results because failed algorithms are rarely published.
Key findings
### Localization
**GPS/IMU fusion alone:** RMSE = 0.8–1.5 m in urban canyons (buildings block satellite signals). Insufficient for lane-level driving.
**LiDAR-based localization (ICP matching):** RMSE = 0.05–0.12 m. Best performance but requires pre-built HD maps and fails in featureless environments (e.g., tunnels, open fields).
**Visual odometry (monocular):** RMSE = 0.3–0.8 m. Degrades rapidly in low-light or low-texture scenes.
**Sensor fusion (GPS + IMU + LiDAR + camera):** RMSE = 0.03–0.08 m. The only approach that met the <0.1 m target in all tested conditions.
### Perception (Object Detection)
**YOLOv3 (camera-only):** mAP = 0.72 on KITTI, 0.58 on nuScenes. Fast (30 fps) but poor at detecting small or occluded objects (pedestrians at >50 m: recall = 0.45).
**PointNet++ (LiDAR-only):** mAP = 0.81 on KITTI, 0.74 on nuScenes. Better at 3D localization but struggles with reflective surfaces (e.g., wet roads, glass buildings).
**Fusion (camera + LiDAR):** mAP = 0.89 on KITTI, 0.83 on nuScenes. Best overall but requires precise sensor calibration (errors >0.1° in alignment reduce mAP by 0.15).
**Pedestrian detection at night:** All camera-based methods dropped to mAP <0.30. LiDAR-only methods dropped to mAP = 0.55.
### Planning
**A* (global path planning):** Computation time = 50–200 ms for a 10 km route. Produces smooth paths but cannot handle dynamic obstacles.
**RRT (local planning):** Computation time = 10–50 ms per replan. Can handle obstacles but produces jerky trajectories (curvature change >0.5 rad/m).
**MPC (local trajectory tracking):** Computation time = 20–80 ms per cycle. Smoothest trajectories (curvature change <0.1 rad/m) but requires accurate vehicle dynamics model.
**System-level disengagements:** 12 disengagements over 500 km (2.4 per 100 km). Causes: 5 from perception failures (missed pedestrian, false positive on shadow), 4 from planning deadlocks (vehicle stopped at intersection for >30 seconds), 2 from localization drift (GPS dropout), 1 from HMI confusion (driver overrode system incorrectly).
### Human-Machine Interface
**Takeover reaction time:** Average = 1.8 seconds (range 0.8–4.2 s). Faster when the system provided a visual + auditory alert (1.2 s) vs. visual only (2.3 s).
**NASA-TLX workload:** Mean = 42/100 (moderate). Highest workload reported during system failures (mean = 68/100) and during lane changes in heavy traffic (mean = 55/100).
Effect magnitude
**Sensor fusion reduced localization error by 10–50× compared to GPS alone** (from ~1 m down to ~0.05 m). This is the difference between knowing which lane you're in vs. which block you're on.
**LiDAR + camera fusion improved detection accuracy by ~20% over camera alone** (mAP 0.89 vs. 0.72). In practical terms, this means missing 1 pedestrian per 100 km vs. missing 3–4 per 100 km.
**MPC planning reduced trajectory jerk by ~5× compared to RRT** (curvature change 0.1 vs. 0.5 rad/m). This translates to a noticeably smoother ride—passengers reported less motion sickness in informal testing.
**Disengagement rate of 2.4 per 100 km** means that in a typical 20 km commute, you'd expect a system failure roughly every other trip. For comparison, Waymo's reported disengagement rate in 2019 was ~0.1 per 100 km—25× better.
Limitations
### Acknowledged by Authors
Testing only in daytime, clear weather. No rain, fog, snow, or night driving.
Single vehicle platform; results may not generalize to other sensor configurations.
No formal safety validation (e.g., ISO 26262 functional safety analysis).
Survey may miss recent work (cutoff date: early 2020).
### Critical Reader Observations
**No statistical rigor.** Without confidence intervals or replication, the reported performance differences could be due to random variation (e.g., traffic conditions, sensor noise).
**Small test set.** 500 km is trivial for autonomous driving. Industry standard for safety validation is millions of kilometers (e.g., Waymo has driven >20 million miles on public roads).
**No adversarial robustness testing.** The authors did not test against common failure modes like sensor occlusion (mud on camera), adversarial patches (stickers on stop signs), or GPS spoofing.
**Publication bias.** The survey likely overrepresents successful algorithms. Failed approaches (e.g., pure camera-based localization) are underrepresented.
**Hardware dependence.** The NVIDIA Drive PX2 is a 2016-era platform. Modern hardware (e.g., Drive Orin) would likely improve computation times by 2–5×, potentially changing which algorithms are "real-time feasible."
**No cost analysis.** The sensor suite (LiDAR + cameras + radar + GPS/IMU) costs >$100,000. Findings may not apply to consumer-grade systems (e.g., Tesla's camera-only approach).
Practical takeaways
For someone running their own n=1 experiment (e.g., building a personal autonomous driving research platform):
### What to Test
**Sensor fusion vs. single-modality perception.** Compare a camera-only object detection pipeline (e.g., YOLOv8 on a single USB camera) against a camera + LiDAR fusion pipeline (e.g., using a low-cost LiDAR like the Ouster OS1-64).
**Localization method.** Compare GPS-only (phone GPS, ~5 m accuracy) vs. GPS + visual odometry (using ORB-SLAM3) vs. GPS + LiDAR (using Cartographer).
### Minimum Meaningful Duration
**At least 100 km per condition** (e.g., 100 km with camera-only, 100 km with fusion). This gives you ~100–200 detection events per condition (assuming 1–2 objects per km in urban driving).
**Test across at least 3 different environments:** urban (dense traffic), suburban (moderate traffic), and highway (high speed, sparse traffic). Each environment should be at least 30 km.
**Include at least 2 weather conditions** if possible: dry daytime and wet nighttime (or dusk). Even a 10 km test in rain can reveal failure modes.
### What to Measure
**Primary metric:** Disengagement rate (number of times you must take over manual control) per 100 km. Log the cause (perception failure, planning deadlock, localization error, etc.).
**Secondary metrics:**
- Object detection: precision and recall for pedestrians, cyclists, and vehicles (you can manually label a subset of frames, e.g., every 100th frame).
- Localization error: compare estimated position against a known ground truth (e.g., a pre-surveyed route with RTK GPS).
- Planning smoothness: log steering wheel angle and acceleration (use an IMU). Compute jerk (derivative of acceleration) in m/s³.
- Computation time: log per-module latency (ms). Target: <50 ms total for real-time control.
### Key Confounds to Control For
**Route order effects.** Don't always test camera-only on the same route first. Randomize the order of conditions (e.g., Monday: camera-only on Route A, fusion on Route B; Tuesday: swap).
**Time of day.** Test each condition at the same time of day (±1 hour) to control for lighting and traffic density.
**Sensor calibration.** Recalibrate camera-LiDAR extrinsics before each test session. Even a 0.5° misalignment can reduce detection mAP by 0.10.
**Software version.** Use the same software stack (same OS, same library versions) for all tests. A library update mid-experiment can invalidate comparisons.
**Driver fatigue.** If you are the driver, take breaks between conditions. Fatigue increases reaction time and may bias disengagement counts.
### What a Positive Result Would Look Like
**Sensor fusion reduces disengagement rate by at least 50%** compared to camera-only (e.g., from 5 disengagements per 100 km to 2.5 per 100 km).
**Localization error drops below 0.5 m** (lane-level) with fusion vs. >2 m with GPS-only.
**Object detection recall for pedestrians improves by >20 percentage points** (e.g., from 60% to 80%) when adding LiDAR.
**Planning jerk decreases by >30%** (e.g., from 0.4 rad/m to 0.28 rad/m) when using MPC vs. a simpler controller.
If you see these magnitudes, you can be confident that the algorithmic improvement is practically meaningful—not just a statistical artifact. If the differences are smaller (e.g., 5% improvement in mAP