Performance Overview#

IPC, Context-Switch and Syscall Performance#

With L4Re being a microkernel-based system and hypervisor, some of you are interested in the IPC and syscall performance of L4Re as well as the performance of context switches. IPC is a base-level communication mechanisms that allows to exchange a limited amount of payload data between two threads. Context switching is switching from one executing thread to another, which sending a message is exactly doing. The fastest IPC is between two threads running in the same address space (task) on the same CPU core (Intra). Inter is IPC between two address spaces. A syscall is also an IPC but only communicates with the kernel.

The following table provides IPC performance numbers for a single IPC on various popular platforms (average over multiple ten thousand calls). To perform the measurement, the L4Re microkernel has been configured in its performance configuration CONFIG_PERFORMANCE=y, i.e., without assertions.

The source code of the benchmark program can be found here. The images used to measure those are linked in the table below.

Numbers are measured with the performance counters. On Arm, the cycle counter is used. On x86, the fixed-function counters are used.

Platform	Processor	Cfg	IPC (in CPU cycles)		Syscall	Image
Platform	Processor	Cfg	Intra	Inter	Syscall	Image
Raspberry Pi 5 64bit	Arm Cortex-A76	EL1	247	384	138	▶️
Raspberry Pi 5 64bit	Arm Cortex-A76	EL2	300	401	202	▶️
Raspberry Pi 4 64bit	Arm Cortex-A72	EL1	505	702	305	▶️
Raspberry Pi 4 64bit	Arm Cortex-A72	EL2	1388 [4]	1600 [4]	567 [4]	▶️
NXP S32G2 64bit	Arm Cortex-A53	EL1	562	691	230	▶️
NXP S32G2 64bit	Arm Cortex-A53	EL2	661	770	228	▶️
Ampere Altra (32 /80 Cores) 64bit	Arm Neoverse-N1	EL1	257	399	142	▶️
Ampere Altra (32 /80 Cores) 64bit	Arm Neoverse-N1	EL2	299	440	148	▶️
amd64 / x86_64	Intel N100		173/622/557 [3]	390/1388/613 [3]	52/188/147 [3]	▶️
amd64 / x86_64	Intel Core i9-13900K		516/687/557 [3]	930/1240/613 [3]	118/158/147 [3]	▶️
amd64 / x86_64	Intel Xeon Platinum 8352S		511/649/543 [3]	934/1128/587 [3]	222/160/148 [3]	▶️
amd64 / x86_64	Intel Xeon Gold 6248R		664/663/557 [3]	1172/1172/613 [3]	146/146/147 [3]	▶️

For x86: You can boot the image directly in GRUB2, e.g. multiboot2 (http,l4re.org)/download/ipcbench/amd64/l4re_ipcbench-20250602.elf

For the Raspberry Pi’s, converting the uimage to a raw image for firmware boot works like this: dd if=l4re_ipcbench_rpi5-elX.uimage of=l4re.raw bs=64 skip=1.

Plots of Parallel Execution of the Benchmark#

The following plots show parallel execution of IPC and system call operation over the cores available in those platforms. The benchmarks show the scalability of the system. The graphs shall be horizontal and flat to indicate IPC and system calls are happening without interference to and from other cores.

The IPC operation in the benchmark is IPC between two threads in the same task (address space), running on the same core. Such a pair of threads is running on each core.

System calls are executed on each core in parallel.

Processor: Ampere Altra / Arm Neoverse N1 80-core (EL1)
Cores: 80

Processor: Ampere Altra / Arm Neoverse N1 80-core (EL2)
Cores: 80

Processor: Xeon Gold 6248R (Without SMT/HT, efficient mode)
Cores: 48

Processor: Xeon Gold 6248R (With SMT/HT, efficient mode)
Cores: 96

Processor: Intel(R) N100 at 806MHz
Cores: 4

Performance Overview

Contents

Performance Overview#

IPC, Context-Switch and Syscall Performance#

Plots of Parallel Execution of the Benchmark#