VadModel Design#
- code: vad_model.hpp
Responsibilities#
- Responsible for CUDA-based processing and TensorRT inference
- Receives
VadInputData, executes inference, and returnsVadOutputData - Preprocesses multi-camera images using CUDA via
MultiCameraPreprocessor - Manages two-stage inference: backbone network and head network (with/without prev_bev)
- Transfers data between Host (CPU) and Device (GPU) using
Tensorclass - Post-processes outputs using CUDA:
- Object detection via
ObjectPostprocessor - Map polylines via
MapPostprocessor
- Object detection via
- Manages prev_bev features across frames for temporal context
- Dynamically loads/unloads networks (releases
head_no_prevafter first frame)
Processing Flowchart#
flowchart TD
Start([VadNode calls VadModel::infer]) --> VadInputDataBox((VadInputData))
VadInputDataBox --> LoadInputs
subgraph InferScope["VadModel::infer()"]
LoadInputs[load_inputs: Preprocess input data and transfer from CPU to GPU]
LoadInputs --> Enqueue[enqueue: Execute TensorRT inference]
Enqueue --> SavePrevBev[save_prev_bev: Transfer prev_bev from GPU to CPU for next frame]
SavePrevBev --> Postprocess[postprocess: Postprocess output and transfer from GPU to CPU]
Postprocess --> CheckFirstFrame{Is first frame?}
CheckFirstFrame -->|Yes| ReleaseNetwork[release_network: Release first-frame-only ONNX]
ReleaseNetwork --> LoadHead[load_head: Load head ONNX that uses previous frame's BEV features]
LoadHead --> ReturnText[return]
CheckFirstFrame -->|No| ReturnText
end
ReturnText --> VadOutputDataBox((VadOutputData))
style InferScope fill:#e1f5fe,stroke:#0288d1,stroke-width:2px,color:#000000
style CheckFirstFrame fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#000000
style ReleaseNetwork fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#000000
style LoadHead fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#000000
%% Links to source code files
click Start "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/vad_node.hpp" "VadNode header file"
click VadInputDataBox "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/data_types.hpp" "VadInputData definition"
click LoadInputs "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/vad_model.hpp" "VadModel implementation"
click Enqueue "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/vad_model.hpp" "VadModel implementation"
click SavePrevBev "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/vad_model.hpp" "VadModel implementation"
click Postprocess "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/vad_model.hpp" "VadModel implementation"
click ReleaseNetwork "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/vad_model.hpp" "VadModel implementation"
click LoadHead "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/vad_model.hpp" "VadModel implementation"
click VadOutputDataBox "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/data_types.hpp" "VadOutputData definition"
Function Roles#
API functions (public)#
infer(const VadInputData&): Main inference pipeline- Selects head network (
head_no_prevfor first frame,headfor subsequent frames) - Calls
load_inputs()to preprocess and transfer data to GPU - Calls
enqueue()to execute backbone and head networks - Calls
save_prev_bev()to preserve BEV features for next frame - Calls
postprocess()to extract outputs and transfer to CPU - On first frame: releases
head_no_prevnetwork and callsload_head()to initializeheadnetwork - Returns
VadOutputDatacontaining trajectories, objects, and map polylines
- Selects head network (
Internal functions (private)#
init_engines(): Initializes TensorRT networks at construction- Creates
Backbone,Head, andHead_no_prevnetwork instances - Sets up external input bindings for memory sharing between networks
- Logs binding connections between networks
- Creates
load_inputs(): Prepares inputs for inference- Preprocesses multi-camera images using
MultiCameraPreprocessor::preprocess_images() - Loads metadata:
shift,vad_base2img(lidar2img),can_bus - Sets
prev_bevbinding forheadnetwork (not needed forhead_no_prev)
- Preprocesses multi-camera images using
enqueue(): Executes inference- Enqueues backbone network
- Enqueues selected head network
- Synchronizes CUDA stream to wait for completion
save_prev_bev(): Preserves BEV features for next frame- Copies
out.bev_embedfrom head output toprev_bevtensor - Uses
cudaMemcpyAsyncDevice-to-Device transfer - Returns shared pointer to
Tensorstored insaved_prev_bev_member
- Copies
postprocess(): Extracts and processes outputs- Retrieves
out.ego_fut_preds(trajectory predictions) from GPU - Processes detected objects via
ObjectPostprocessor::postprocess_objects() - Processes map polylines via
MapPostprocessor::postprocess_map_preds() - Performs cumulative sum on trajectory points to convert deltas to absolute positions
- Extracts trajectories for all commands and the selected command
- Returns
VadOutputDatawith all processed outputs
- Retrieves
release_network(): Frees network resources- Clears bindings map
- Resets and erases network from
nets_map - Called to release
head_no_prevafter first frame
load_head(): Initializes head network with prev_bev support- VAD requires previous frame's BEV features as input
- First frame uses
head_no_prev(no previous BEV available) - After first frame, switches to
headwhich usesprev_bevfrom previous inference - Sets up external bindings to connect with backbone outputs
Design concepts#
Logger Pattern#
- Goal:
VadModelshould not depend on ROS, but needs logging capabilities - Solution: Template-based dependency injection with abstract logger interface
VadLogger: Abstract base class defining logging interfaceRosVadLogger: Concrete implementation using ROS 2 logging macros (RCLCPP_INFO_THROTTLE, etc.)VadModel<LoggerType>: Template class parameterized by logger type- Enforces
LoggerTypemust inherit fromVadLoggerviastatic_assert - Enables unit testing with mock loggers
- Maintains consistent logging behavior across ROS and CUDA domains
- Enforces
- Trade-off: Template implementation requires all code in header file (
.hpp), following C++ template constraints
Network Architecture#
- Each ONNX file corresponds to one
Netclass Netclass usesautoware_tensorrt_commonto build and execute TensorRT engines- Three network types:
- Backbone: Processes multi-camera images and extracts features
- Head_no_prev: First-frame head without previous BEV features
- Head: Subsequent-frame head using
prev_bevfrom previous inference
- Memory management via
Tensorclass- Handles
cudaMallocfor GPU memory allocation - Handles
cudaMemcpyAsyncfor Host↔Device transfers
- Handles
- Bindings sharing: Outputs from one network can be directly used as inputs to another (e.g., backbone outputs → head inputs)
flowchart TD
VadModel[VadModel]
VadModel --> VadModelInit[VadModel::init_engines]
VadModel --> VadModelEnqueue[VadModel::enqueue]
VadModelInit --> NetConstructor[Net Constructor]
VadModelInit --> NetSetInputTensor[Net::set_input_tensor]
VadModelEnqueue --> NetEnqueue[Net::enqueue]
subgraph NetClass["Net Class"]
subgraph InitProcess["Net::init_tensorrt"]
SetupIO[setup_network_io]
BuildEngine[build_engine]
end
subgraph TensorSubgraph["Tensor Class"]
CudaMalloc[cudaMalloc]
end
subgraph TrtCommonSubgraph["TrtCommon Class"]
EnqueueV3[enqueueV3]
end
NetConstructor --> InitProcess
NetSetInputTensor --> TensorSubgraph
NetEnqueue --> TrtCommonSubgraph
end
style NetClass fill:#e1f5fe,stroke:#0288d1,stroke-width:2px,color:#000000
style InitProcess fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000000
style TensorSubgraph fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#000000
style TrtCommonSubgraph fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#000000
style VadModel fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#000000
style VadModelInit fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#000000
style VadModelEnqueue fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#000000
%% Links to source code files
click CudaMalloc "https://github.com/autowarefoundation/autoware_universe/tree/main/e2e/autoware_tensorrt_vad/lib/networks/tensor.cpp" "Tensor class implementation"
click BuildEngine "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/networks/net.hpp" "Net class implementation"
click NetEnqueue "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/networks/net.hpp" "Net class implementation"
click NetSetInputTensor "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/networks/net.hpp" "Net class implementation"
click NetConstructor "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/networks/net.hpp" "Net class implementation"
click EnqueueV3 "https://github.com/autowarefoundation/autoware_universe/blob/main/perception/autoware_tensorrt_common/src/tensorrt_common.cpp" "enqueueV3 implementation"
click VadModelInit "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/vad_model.hpp" "VadModel implementation"
click VadModelEnqueue "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/vad_model.hpp" "VadModel implementation"
click InitTensorRT "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/networks/net.hpp" "Net class implementation"
click SetupIO "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/networks/net.hpp" "Net class implementation"
click TensorClass "https://github.com/autowarefoundation/autoware_universe/tree/main/e2e/autoware_tensorrt_vad/lib/networks/tensor.cpp" "Tensor class implementation"
Network classes: API functions#
- Constructor →
init_tensorrt
- set_input_tensor: Called from
VadModel::init_enginesandVadModel::load_head- Accepts map of external bindings (tensors from other networks)
- Allocates GPU memory for new tensors via
Tensorconstructor - Sets up memory sharing for external inputs (avoids redundant copies)
- enqueue: Called from
VadModel::enqueue- Executes TensorRT inference via
TrtCommon::enqueueV3 - Runs asynchronously on CUDA stream
- Note: Uses
enqueueV3for current TensorRT version; update if TensorRT API changes
- Executes TensorRT inference via
CUDA Preprocessor and Postprocessor Architecture#
Preprocessor and Postprocessor classes wrap CUDA kernels and provide clean C++ interfaces to VadModel.
Preprocessor/Postprocessor Classes (CPU-side wrappers):
MultiCameraPreprocessor: Multi-camera image preprocessingObjectPostprocessor: 3D object detection postprocessingMapPostprocessor: Map polyline postprocessing
Processing Flow:
VadModelcallspreprocess_*()orpostprocess_*()methods- These methods call
launch_*_kernel()functions - Kernel launch functions calculate CUDA grid/block dimensions
- CUDA kernels execute on GPU
CUDA Kernel Launch Functions:
launch_multi_camera_resize_kernel: Resize images to target resolutionlaunch_multi_camera_normalize_kernel: Normalize pixel valueslaunch_object_postprocess_kernel: Filter and decode object detectionslaunch_map_postprocess_kernel: Decode map polylines
Benefits:
- Separation of concerns: C++ wrapper logic vs CUDA kernel logic
- Testability: Can mock pre-processors/post-processors
- Performance: Kernels optimized for parallel execution on GPU
flowchart TD
VadModel[VadModel] --> PreprocessCall[Call preprocess_* functions]
VadModel --> PostprocessCall[Call postprocess_* functions]
subgraph PreprocessorClasses["Preprocessor (CPU, Host)"]
MultiCamera[MultiCameraPreprocessor]
end
subgraph PostprocessorClasses["Postprocessor (CPU, Host)"]
ObjectPost[ObjectPostprocessor]
MapPost[MapPostprocessor]
end
PreprocessCall --> MultiCamera
PostprocessCall --> ObjectPost
PostprocessCall --> MapPost
MultiCamera --> LaunchResize[launch_multi_camera_resize_kernel]
MultiCamera --> LaunchNormalize[launch_multi_camera_normalize_kernel]
ObjectPost --> LaunchObject[launch_object_postprocess_kernel]
MapPost --> LaunchMap[launch_map_postprocess_kernel]
subgraph CudaKernels["Kernels (GPU, Device)"]
ResizeKernel[multi_camera_resize_kernel]
NormalizeKernel[multi_camera_normalize_kernel]
ObjectKernel[object_postprocess_kernel]
MapKernel[map_postprocess_kernel]
end
LaunchResize --> ResizeKernel
LaunchNormalize --> NormalizeKernel
LaunchObject --> ObjectKernel
LaunchMap --> MapKernel
style PreprocessorClasses fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000000
style PostprocessorClasses fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000000
style CudaKernels fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#000000
style VadModel fill:#e8f5e8,stroke:#388e3c,stroke-width:2px,color:#000000
%% Links to source code files
click VadModel "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/vad_model.hpp" "VadModel implementation"
click MultiCamera "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/networks/preprocess/multi_camera_preprocess.hpp" "MultiCameraPreprocessor"
click ObjectPost "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/networks/postprocess/object_postprocess.hpp" "ObjectPostprocessor"
click MapPost "https://github.com/autowarefoundation/autoware_universe/tree/main/planning/autoware_tensorrt_v../src/networks/postprocess/map_postprocess.hpp" "MapPostprocessor"
click LaunchResize "https://github.com/autowarefoundation/autoware_universe/tree/main/e2e/autoware_tensorrt_vad/lib/networks/preprocess/multi_camera_preprocess_kernel.cu" "Resize kernel implementation"
click LaunchNormalize "https://github.com/autowarefoundation/autoware_universe/tree/main/e2e/autoware_tensorrt_vad/lib/networks/preprocess/multi_camera_preprocess_kernel.cu" "Normalize kernel implementation"
click LaunchObject "https://github.com/autowarefoundation/autoware_universe/tree/main/e2e/autoware_tensorrt_vad/lib/networks/postprocess/object_postprocess_kernel.cu" "Object postprocess kernel implementation"
click LaunchMap "https://github.com/autowarefoundation/autoware_universe/tree/main/e2e/autoware_tensorrt_vad/lib/networks/postprocess/map_postprocess_kernel.cu" "Map postprocess kernel implementation"
Key Design Details#
Two-Stage Network Loading#
- First Frame: Uses
head_no_prev(no temporal context required) - Subsequent Frames: Uses
head(incorporatesprev_bevfrom previous frame) - Memory Optimization:
head_no_previs released after first frame to free GPU memory
Memory Management Strategy#
- Bindings Sharing: Network outputs can be directly connected as inputs to other networks
- Tensor Class: Encapsulates CUDA memory operations (malloc, copy, free)
- Saved BEV:
prev_bevis preserved on GPU between frames (Device-to-Device copy)
Asynchronous Execution#
- All CUDA operations use a single
cudaStream_t stream_member - Operations are asynchronous but synchronized before postprocessing
- Enables overlapping computation when possible
TODO#
- Optimize prev_bev transfer: Currently copies Device→Device; could keep as output binding to avoid copy
- Quantization: Currently uses FP32; INT8 quantization would improve speed and memory efficiency
- Multi-stream execution: Could pipeline backbone and head execution with multiple streams