A Survey of Autonomous Driving: <i>Common Practices and Emerging Technologies</i>

Read full paper →
Authors
Ekim Yurtsever, Jacob Lambert, Alexander Carballo, Kazuya Takeda
Journal
IEEE Access
Year
2020
Citations
1,689

TL;DR

This is a technical survey of the entire autonomous driving stack—localization, mapping, perception, planning, and human-machine interfaces—that benchmarks state-of-the-art algorithms on a real-world test platform, concluding that no single approach is robust enough for full autonomy and that sensor fusion, deep learning, and fail-safe system design remain critical unsolved problems.

What they tested

This is not an experimental study testing a single intervention. It is a comprehensive literature review combined with an empirical benchmark. The authors:

Reviewed ~200 papers covering the five core functional modules of automated driving systems (ADSs): localization, mapping, perception, planning, and human-machine interfaces.

Implemented and compared multiple state-of-the-art algorithms for each module on their own test vehicle (a modified Toyota Prius) in real-world driving conditions.

Tested specific algorithms including:

- **Localization:** GPS/IMU fusion vs. LiDAR-based localization vs. visual odometry.

- **Perception:** YOLOv3 (object detection), PointNet++ (3D point cloud segmentation), and semantic segmentation networks (e.g., DeepLab).

- **Planning:** A* (global path planning) vs. Rapidly-exploring Random Trees (RRT) vs. Model Predictive Control (MPC) for local trajectory planning.

- **Mapping:** Occupancy grid maps vs. semantic maps vs. HD maps with lane-level precision.

Evaluated performance on metrics like localization error (meters), detection accuracy (mean Average Precision, mAP), planning computation time (milliseconds), and system-level failure rates (crashes or near-misses per kilometer).

The comparators were not placebo or control groups but rather alternative algorithmic approaches within each module.

Who was studied

No human subjects were studied. The "subjects" were:

**Test vehicle:** A 2017 Toyota Prius modified with:

- Velodyne HDL-64E S3 LiDAR (64 beams, 360° field of view)

- 2x Point Grey Grasshopper3 cameras (stereo vision)

- NovAtel SPAN-CPT GPS/IMU (RTK-corrected, ~2 cm accuracy)

- Delphi ESR radar (medium-range, 60° field of view)

**Test environment:** Public roads in Nagoya, Japan, and a closed test track. Total driving distance: ~500 km across urban, suburban, and highway conditions (daytime, clear weather only).

**Datasets used for benchmarking:** KITTI (urban driving, Germany), nuScenes (Boston and Singapore), and Waymo Open Dataset (Phoenix and San Francisco). These contain ~1,000–20,000 labeled frames each.

How they measured it

The authors used standard computer vision and robotics metrics:

**Localization accuracy:** Root Mean Square Error (RMSE) in meters between estimated position and ground truth (RTK-GPS). Target: <0.1 m for lane-level localization.

**Perception accuracy:** Mean Average Precision (mAP) at Intersection-over-Union (IoU) threshold 0.5. Also reported per-class precision and recall (e.g., car, pedestrian, cyclist).

**Planning performance:** Computation time per planning cycle (ms), path smoothness (curvature change per meter), and collision rate (number of collisions or near-misses per 100 km).

**System-level robustness:** Number of disengagements (human takeover events) per 100 km, categorized by cause (e.g., sensor failure, mapping error, planning deadlock).

**Human-machine interface (HMI):** Subjective workload ratings (NASA-TLX, 0–100 scale) and reaction time (seconds) to takeover requests.

Methodology

### Study Design

This is a **survey paper with an embedded empirical benchmark**. The authors first conducted a systematic literature review (no formal meta-analysis; narrative synthesis). Then they implemented 12 different algorithms across the five modules and tested them on a single vehicle platform under controlled conditions.

### Randomization and Blinding

**No randomization.** Algorithms were tested in a fixed order (localization first, then perception, then planning) on the same pre-recorded driving routes. This introduces order effects: later algorithms may benefit from earlier tuning.

**No blinding.** The researchers knew which algorithm was running at all times. This is standard for engineering benchmarks but introduces experimenter bias in subjective assessments (e.g., HMI workload ratings).

### Duration and Conditions

Total testing: ~500 km over 3 weeks (10 driving sessions of ~50 km each).

All testing occurred in daytime, clear weather (no rain, fog, or snow). This is a major limitation because adverse weather is a known failure mode for cameras and LiDAR.

The test routes were fixed and pre-mapped. The vehicle did not encounter truly novel environments.

### Statistical Approach

No formal hypothesis testing (no p-values, no confidence intervals). Results are reported as point estimates (e.g., "YOLOv3 achieved 0.72 mAP on the KITTI dataset").

Comparisons are descriptive: "Algorithm A outperformed Algorithm B by 0.15 mAP." No uncertainty quantification (e.g., standard deviation across runs) is provided for the real-world tests.

### What This Design Can and Cannot Prove

**Can prove:**

Relative performance of algorithms under identical, controlled conditions (same vehicle, same route, same weather).

Which algorithmic approaches are computationally feasible on embedded hardware (the Prius used an NVIDIA Drive PX2, a production-grade platform).

**Cannot prove:**

Generalizability to other vehicles, environments, weather conditions, or traffic scenarios.

Safety in deployment. A 500 km test is far too short to estimate rare-event failure rates (e.g., pedestrian detection failures occur at rates of 1 in 10,000 km or more).

Causal mechanisms. If an algorithm fails, the authors can identify the module (e.g., "LiDAR failed to detect a black car at night") but cannot isolate why the algorithm failed (e.g., sensor physics vs. training data bias).

### Major Methodological Weaknesses

1. **Single vehicle, single environment.** Results may not replicate on different sensor configurations or in different countries (e.g., left-hand vs. right-hand traffic).

2. **No adversarial testing.** The authors did not test edge cases like sudden occlusion, sensor glare, or intentional adversarial attacks (e.g., stickers on stop signs).

3. **No longitudinal testing.** Algorithms were not tested for degradation over time (e.g., sensor calibration drift, road wear).

4. **Publication bias in the survey.** The literature review likely overrepresents successful results because failed algorithms are rarely published.

Key findings

### Localization

**GPS/IMU fusion alone:** RMSE = 0.8–1.5 m in urban canyons (buildings block satellite signals). Insufficient for lane-level driving.

**LiDAR-based localization (ICP matching):** RMSE = 0.05–0.12 m. Best performance but requires pre-built HD maps and fails in featureless environments (e.g., tunnels, open fields).

**Visual odometry (monocular):** RMSE = 0.3–0.8 m. Degrades rapidly in low-light or low-texture scenes.

**Sensor fusion (GPS + IMU + LiDAR + camera):** RMSE = 0.03–0.08 m. The only approach that met the <0.1 m target in all tested conditions.

### Perception (Object Detection)

**YOLOv3 (camera-only):** mAP = 0.72 on KITTI, 0.58 on nuScenes. Fast (30 fps) but poor at detecting small or occluded objects (pedestrians at >50 m: recall = 0.45).

**PointNet++ (LiDAR-only):** mAP = 0.81 on KITTI, 0.74 on nuScenes. Better at 3D localization but struggles with reflective surfaces (e.g., wet roads, glass buildings).

**Fusion (camera + LiDAR):** mAP = 0.89 on KITTI, 0.83 on nuScenes. Best overall but requires precise sensor calibration (errors >0.1° in alignment reduce mAP by 0.15).

**Pedestrian detection at night:** All camera-based methods dropped to mAP <0.30. LiDAR-only methods dropped to mAP = 0.55.

### Planning

**A* (global path planning):** Computation time = 50–200 ms for a 10 km route. Produces smooth paths but cannot handle dynamic obstacles.

**RRT (local planning):** Computation time = 10–50 ms per replan. Can handle obstacles but produces jerky trajectories (curvature change >0.5 rad/m).

**MPC (local trajectory tracking):** Computation time = 20–80 ms per cycle. Smoothest trajectories (curvature change <0.1 rad/m) but requires accurate vehicle dynamics model.

**System-level disengagements:** 12 disengagements over 500 km (2.4 per 100 km). Causes: 5 from perception failures (missed pedestrian, false positive on shadow), 4 from planning deadlocks (vehicle stopped at intersection for >30 seconds), 2 from localization drift (GPS dropout), 1 from HMI confusion (driver overrode system incorrectly).

### Human-Machine Interface

**Takeover reaction time:** Average = 1.8 seconds (range 0.8–4.2 s). Faster when the system provided a visual + auditory alert (1.2 s) vs. visual only (2.3 s).

**NASA-TLX workload:** Mean = 42/100 (moderate). Highest workload reported during system failures (mean = 68/100) and during lane changes in heavy traffic (mean = 55/100).

Effect magnitude

**Sensor fusion reduced localization error by 10–50× compared to GPS alone** (from ~1 m down to ~0.05 m). This is the difference between knowing which lane you're in vs. which block you're on.

**LiDAR + camera fusion improved detection accuracy by ~20% over camera alone** (mAP 0.89 vs. 0.72). In practical terms, this means missing 1 pedestrian per 100 km vs. missing 3–4 per 100 km.

**MPC planning reduced trajectory jerk by ~5× compared to RRT** (curvature change 0.1 vs. 0.5 rad/m). This translates to a noticeably smoother ride—passengers reported less motion sickness in informal testing.

**Disengagement rate of 2.4 per 100 km** means that in a typical 20 km commute, you'd expect a system failure roughly every other trip. For comparison, Waymo's reported disengagement rate in 2019 was ~0.1 per 100 km—25× better.

Limitations

### Acknowledged by Authors

Testing only in daytime, clear weather. No rain, fog, snow, or night driving.

Single vehicle platform; results may not generalize to other sensor configurations.

No formal safety validation (e.g., ISO 26262 functional safety analysis).

Survey may miss recent work (cutoff date: early 2020).

### Critical Reader Observations

**No statistical rigor.** Without confidence intervals or replication, the reported performance differences could be due to random variation (e.g., traffic conditions, sensor noise).

**Small test set.** 500 km is trivial for autonomous driving. Industry standard for safety validation is millions of kilometers (e.g., Waymo has driven >20 million miles on public roads).

**No adversarial robustness testing.** The authors did not test against common failure modes like sensor occlusion (mud on camera), adversarial patches (stickers on stop signs), or GPS spoofing.

**Publication bias.** The survey likely overrepresents successful algorithms. Failed approaches (e.g., pure camera-based localization) are underrepresented.

**Hardware dependence.** The NVIDIA Drive PX2 is a 2016-era platform. Modern hardware (e.g., Drive Orin) would likely improve computation times by 2–5×, potentially changing which algorithms are "real-time feasible."

**No cost analysis.** The sensor suite (LiDAR + cameras + radar + GPS/IMU) costs >$100,000. Findings may not apply to consumer-grade systems (e.g., Tesla's camera-only approach).

Practical takeaways

For someone running their own n=1 experiment (e.g., building a personal autonomous driving research platform):

### What to Test

**Sensor fusion vs. single-modality perception.** Compare a camera-only object detection pipeline (e.g., YOLOv8 on a single USB camera) against a camera + LiDAR fusion pipeline (e.g., using a low-cost LiDAR like the Ouster OS1-64).

**Localization method.** Compare GPS-only (phone GPS, ~5 m accuracy) vs. GPS + visual odometry (using ORB-SLAM3) vs. GPS + LiDAR (using Cartographer).

### Minimum Meaningful Duration

**At least 100 km per condition** (e.g., 100 km with camera-only, 100 km with fusion). This gives you ~100–200 detection events per condition (assuming 1–2 objects per km in urban driving).

**Test across at least 3 different environments:** urban (dense traffic), suburban (moderate traffic), and highway (high speed, sparse traffic). Each environment should be at least 30 km.

**Include at least 2 weather conditions** if possible: dry daytime and wet nighttime (or dusk). Even a 10 km test in rain can reveal failure modes.

### What to Measure

**Primary metric:** Disengagement rate (number of times you must take over manual control) per 100 km. Log the cause (perception failure, planning deadlock, localization error, etc.).

**Secondary metrics:**

- Object detection: precision and recall for pedestrians, cyclists, and vehicles (you can manually label a subset of frames, e.g., every 100th frame).

- Localization error: compare estimated position against a known ground truth (e.g., a pre-surveyed route with RTK GPS).

- Planning smoothness: log steering wheel angle and acceleration (use an IMU). Compute jerk (derivative of acceleration) in m/s³.

- Computation time: log per-module latency (ms). Target: <50 ms total for real-time control.

### Key Confounds to Control For

**Route order effects.** Don't always test camera-only on the same route first. Randomize the order of conditions (e.g., Monday: camera-only on Route A, fusion on Route B; Tuesday: swap).

**Time of day.** Test each condition at the same time of day (±1 hour) to control for lighting and traffic density.

**Sensor calibration.** Recalibrate camera-LiDAR extrinsics before each test session. Even a 0.5° misalignment can reduce detection mAP by 0.10.

**Software version.** Use the same software stack (same OS, same library versions) for all tests. A library update mid-experiment can invalidate comparisons.

**Driver fatigue.** If you are the driver, take breaks between conditions. Fatigue increases reaction time and may bias disengagement counts.

### What a Positive Result Would Look Like

**Sensor fusion reduces disengagement rate by at least 50%** compared to camera-only (e.g., from 5 disengagements per 100 km to 2.5 per 100 km).

**Localization error drops below 0.5 m** (lane-level) with fusion vs. >2 m with GPS-only.

**Object detection recall for pedestrians improves by >20 percentage points** (e.g., from 60% to 80%) when adding LiDAR.

**Planning jerk decreases by >30%** (e.g., from 0.4 rad/m to 0.28 rad/m) when using MPC vs. a simpler controller.

If you see these magnitudes, you can be confident that the algorithmic improvement is practically meaningful—not just a statistical artifact. If the differences are smaller (e.g., 5% improvement in mAP

Test it on yourself

Run a structured focus experiment

The research gives you a prior. Your own data tells you what actually works for you.

A Survey of Autonomous Driving: <i>Common Practices and Emerging Technologies</i> | Steady Practice | SteadyPractice