Performance Troubleshooting#
Overall symptoms:
- Autoware is running slower than expected
- Messages show up late in RViz2
- Point clouds are lagging
- Camera images are lagging behind
- Point clouds or markers flicker on RViz2
- When multiple subscribers use the same publishers, the message rate drops
Diagnostic Steps#
Check if multicast is enabled#
Target symptoms#
- When multiple subscribers use the same publishers, the message rate drops
Diagnosis#
Make sure that the multicast is enabled for your interface.
For example when you run following:
source /opt/ros/humble/setup.bash
ros2 run demo_nodes_cpp talker
If you get the error message selected interface "{your-interface-name}" is not multicast-capable: disabling multicast
, this should be fixed.
Solution#
Run the following command to allow multicast:
sudo ip link set multicast on {your-interface-name}
This way DDS will function as intended and multiple subscribers can receive data from a single publisher without any significant degradation in performance.
This is a temporary solution. And will be reverted once the computer restarts.
To make it permanent either,
- Create a service to run this on startup (recommended)
-
OR put following lines to the
~/.bashrc
file:if [ ! -e /tmp/multicast_is_set ]; then sudo ip link set lo multicast on touch /tmp/multicast_is_set fi
- This will probably ask for password on the terminal every time you restart the computer.
Check the compilation flags#
Target symptoms#
- Autoware is running slower than expected
- Point clouds are lagging
- When multiple subscribers use the same publishers, the message rate drops even further
Diagnosis#
Check the ~/.bash_history
file to see if there are any colcon build
directives without -DCMAKE_BUILD_TYPE=Release
or -DCMAKE_BUILD_TYPE=RelWithDebInfo
flags at all.
Even if a build starts with these flags but same workspace gets compiled without these flags, it will still be a slow build in the end.
In addition, the nodes will run slow in general, especially the pointcloud_preprocessor
nodes.
Example issue: issue2597
Solution#
- Remove the
build
,install
and optionallylog
folders in the mainautoware
folder. -
Compile the Autoware with either
Release
orRelWithDebInfo
tags:colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release # Or build with debug flags too (comparable performance but you can debug too) colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=RelWithDebInfo
Check the DDS settings#
Target symptoms#
- Autoware is running slower than expected
- Messages show up late in RViz2
- Point clouds are lagging
- Camera images are lagging behind
- When multiple subscribers use the same publishers, the message rate drops
Check the RMW (ROS Middleware) implementation#
Diagnosis#
Run following to check the middleware used:
echo $RMW_IMPLEMENTATION
The return line should be rmw_cyclonedds_cpp
. If not, apply the solution.
If you are using a different DDS middleware, we might not have official support for it just yet.
Solution#
Add export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
as a separate line in you ~/.bashrc
file.
Check if the CycloneDDS is configured correctly#
Diagnosis#
Run following to check the configuration .xml
file of the CycloneDDS
:
echo $CYCLONEDDS_URI
The return line should be a valid path pointing to an .xml
file with CycloneDDS
configuration.
Also check if the file is configured correctly:
cat !{echo $CYCLONEDDS_URI}
This should print the .xml
file on the terminal.
Solution#
Follow DDS settings:Tuning DDS documentation and make sure:
- you have
export CYCLONEDDS_URI=/absolute_path_to_your/cyclonedds_config.xml
as a line on your~/.bashrc
file. - you have the
cyclonedds_config.xml
with the configuration provided in the documentation.
Check the Linux kernel maximum buffer size#
Diagnosis#
- Run:
sysctl net.core.rmem_max
, it should return at leastnet.core.rmem_max = 2147483647
.- This parameter specifies the maximum size of the "receive buffer" for each network connection, which determines the maximum amount of data that can be held in memory at any given time. By increasing the maximum buffer size, the operating system can accommodate larger bursts of data, which can help prevent network congestion and reduce packet loss, resulting in faster and more reliable data transfers.
- Run:
sysctl net.ipv4.ipfrag_time
, it should return around:net.ipv4.ipfrag_time = 3
- The "net.ipv4.ipfrag_time" parameter specifies the maximum time in seconds that the kernel should retain partially fragmented IP packets before discarding them. The default value for this parameter is usually set to 30 seconds, but it may vary depending on the specific operating system and configuration.
- By setting this parameter to a lower value, such as 3 seconds, the kernel can free up memory resources more quickly by discarding partially fragmented packets that are no longer needed, which can help improve the overall performance and stability of the system.
- Run:
sysctl net.ipv4.ipfrag_high_thresh
, it should return at around:net.ipv4.ipfrag_high_thresh = 134217728
- The "net.ipv4.ipfrag_high_thresh" parameter specifies the high watermark threshold for the number of partially fragmented packets allowed in the kernel IP packet reassembly queue. When the number of partially fragmented packets in the queue exceeds this threshold, the kernel will start to drop newly arrived packets until the number of partially fragmented packets drops below the threshold.
- By setting this parameter to a higher value, such as 134217728 (128 MB), the kernel can accommodate a larger number of partially fragmented packets in the queue, which can help improve the performance of network applications that transfer large amounts of data, such as file transfer protocols and multimedia streaming applications.
More info on these values: Cross-vendor tuning
Solution#
Either:
-
Create the following file:
sudo touch /etc/sysctl.d/10-cyclone-max.conf
(recommended)-
Edit the file to contain (
sudo gedit /etc/sysctl.d/10-cyclone-max.conf
):net.core.rmem_max=2147483647 net.ipv4.ipfrag_time=3 net.ipv4.ipfrag_high_thresh=134217728 # (128 MB)
-
Either restart the computer or run following to enable the changes:
sudo sysctl -w net.core.rmem_max=2147483647 sudo sysctl -w net.ipv4.ipfrag_time=3 sudo sysctl -w net.ipv4.ipfrag_high_thresh=134217728
-
-
-
OR put following lines to the
~/.bashrc
file:if [ ! -e /tmp/kernel_network_conf_is_set ]; then sudo sysctl -w net.core.rmem_max=2147483647 sudo sysctl -w net.ipv4.ipfrag_time=3 sudo sysctl -w net.ipv4.ipfrag_high_thresh=134217728 # (128 MB) fi
- This will probably ask for password on the terminal every time you restart the computer.
Check if ROS localhost only communication is enabled#
- If you are using multi computer setup, please skip this check.
- Enabling ROS localhost only communication can help improve the performance of ROS by reducing network traffic and avoiding potential conflicts with other devices on the network.
- Also check Enable localhost-only communication
Target symptoms#
- You see topics that shouldn't exist
- You see point clouds that don't belong to your machine
- They might be from another computer running ROS 2 on your network
- Point clouds or markers flicker on RViz2
- Another publisher (on another machine) may be publishing on the same topic as your node does.
- Causing the flickering.
Diagnosis#
Run following to check it:
echo $ROS_LOCALHOST_ONLY
The return line should be 1
. If not, apply the solution.
Solution#
- Add
export $ROS_LOCALHOST_ONLY=1
as a separate line in you~/.bashrc
file.- This environment variable tells ROS to only use the
loopback
network interface (i.e., localhost) for communication, rather than using the network interface card (NIC) for Ethernet or Wi-Fi. This can reduce network traffic and potential conflicts with other devices on the network, resulting in better performance and stability.
- This environment variable tells ROS to only use the