Tracking down a bug in the Linux kernel

#linux

#io_uring

#cgroup

#containers

About two months ago, I stumbled upon a critical issue allowing a process to allocate kernel memory exceeding the process's resource limits, potentially opening the door for Denial of Service attacks. This post summarize my journey from finding the bug to submitting a patch to the Linux kernel.

A little introduction

In October 2025, I started building a toy event loop library to play with io_uring.

For those unfamiliar with io_uring, it is a modern Linux API for asynchronous input/output. Unlike traditional methods, it leverages ring buffers for efficient communication between user and kernel space. Here’s a brief overview diagram:

io_uring usage overview

And for code aficionado, here's an imperative pseudo-code:

# Create an io_uring file descriptor.
ring = io_uring_setup()

# Add an entries to submission queue (SQ)
sqe = io_uring_get_sqe(ring)
io_uring_prepare_timeout(sqe, 2 * ns_per_sec)
sqe = io_uring_get_sqe(ring)
io_uring_prepare_timeout(sqe, 1 * ns_per_sec)

# Submit 2 entries from submission queue and wait for 1 operation to complete.
io_uring_enter(ring, 2, 1)

# Read entry from completion queue (CQ)
cqe = io_uring_get_cqe()

# Process completion...

Once you've setup io_uring, a single system call is used to submit works and wait for completions. I/O heavy programs tends to perform better with io_uring because they makes way less system calls and less data is copied between kernel and user space.

That being said, io_uring, being released in 2019, is not as battle tested as other Linux system calls. In June 2023, Google's security team reported that 60% of the exploits submitted to their bug bounty program in 2022 were exploits of io_uring vulnerabilities. This is why Android, ChromeOS as well as Google servers disable it by default.

The issue

io_uring uses 2 ring buffers that serves as fixed size queue, one for submission (SQ) and one for completion (CQ).

As always when working with queues, you must consider what happens when the consumer is slower than the producer. What happens if a program keep adding entries to the submission queue faster than the kernel reads them? What happens if the kernel produces completions faster than the program consume them?

Let's answer those questions.

First, you can't produce faster than the kernel consumes, this is by design. If the SQ is full, you won't be able to get a new SQE without submitting first. Submission is done via a system call that doesn't return until the kernel starts work for each submitted entry.

However, there is no limit on the total number of active submitted entries, as long as you perform multiple submissions. This makes it possible for a (bad) program to keep pushing I/O work on the kernel when completion queue is full. What happens when kernel completes an I/O operation and the completion queue is full?

I couldn't find the answer in the man pages so I built a small C program to do so:

int submit_loop(struct io_uring *ring, long entries) {
  struct io_uring_sqe *sqe;
  int err, s;

  // Submit tasks in loop.
  for (;;) {
    sqe = io_uring_get_sqe(ring);

    // Submission queue is full, submit it.
    if (sqe == NULL || entries <= 0) {
      s = io_uring_submit(ring);
      if (err != 0) {
        return -err;
      }

      if (s >= entries || entries <= 0)
        break;

      entries -= s;

      continue;
    }

    // Prepare a no-op operation that will completes immediately.
    io_uring_prep_nop(sqe);
  }

  return 0;
}

If you run ./kmemleak 1000000000 and observe memory usage in htop, you will see that total memory usage increase (the Mem[||||] bar in the top section) while process memory usage stays at 0.0%.

It looks like memory is allocated on the kernel side and the process isn't accounted for it, at least htop doesn't report it. If that's really the case, this is bad and could be used to perform Denial Of Service. If the kernel doesn't track kernel memory associated to a process, one could allocate tons of kernel memory, triggers the Out-of-Memory killer so it kills another, more user space memory hungry, process.

We can verify this by running kmemleak in a control group with a memory controller and a defined maximum memory usage limit. Control groups are a kernel mechanism that underpins containers. You can attach controllers to a control group to monitor various types of system resources, such as memory, CPU, threads, and more.

Since version 211, systemd-run can spawn a process in a control group. Let's run kmemleak and limits memory usage to 1GiB:

$ systemd-run --scope --user -p MemoryMax=10M ./kmemleak 1000000

Again, if you observe memory usage you will see that memory grows beyond the limit. If you try the same using memleak, a program that allocates user space memory, the process is instantly killed as soon as it tries to exceed the limit. So it really looks like there is a problem here...

Before jumping on conclusion, let's check memory usage reported by the memory controller.

Control groups expose configuration, metrics, and more as files of the cgroup2 filesystem. Here is set of commands to print peak memory usage:

$ PID="$(ps a -o pid,command | awk ' /kmemleak/ { print $1; exit }')"
$ cat "/sys/fs/cgroup/$(cat /proc/$PID/cgroup | awk -v FS=: '{ print $3 }')/memory.peak"
41799680

According to cgroup-v2 documentation, value in memory.peak correspond to:

The max memory usage (in bytes) recorded for the cgroup and its descendants since the creation of the cgroup.

This is roughly 40MiB of memory, way beyond the 10MiB limit.

We are now 100% sure that there is a problem here. Kernel memory is tracked, but for whatever reason the process isn't killed. Let's track that motherfucking bug 🦟.

The bug

Before spending time on tracking the bug and fixing it, it's always wise to check no one else is doing the same.

After a quick search, I've found this issue in the liburing repository: io_uring can use unlimited memory, DoS the system: no backpressure on non-completing operations #293

It's marked as resolved with the following conclusion:

it's now restricted by a memory cgroup

https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.14/io_uring&id=91f245d5d5de0802428a478802ec051f7de2f5d6

My kernel is more recent than this issue, so it's different. Still, this confirm that kernel memory is accounted.

I've found nothing on bugzilla, so... Let the track begin!

But where do we start? The kernel is literally millions of lines of code, the memory subsystem is more than 100 thousand lines of code and io_uringis around 20 thousand lines of code.

We have to narrow down our search. We must trace the kernel.

The Linux kernel comes with various tracing mechanisms that can be used for debugging.

One of them is Kernel Probes or Kprobes that enables you to dynamically break into any kernel routine and collect debugging and performance information non-disruptively.

Kprobes requires building and loading a kernel module that will register a group of probes. It looks like a lot of work for something we will throw in the trash as soon as we're done. If only there was an easy way to trace the kernel without building a kernel module...

> eBPF enters the chat.

Extended Berkeley Packet Filter or eBPF is Linux technology that can run sandboxed programs within the kernel itself. It is used to safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules.

However, to use eBPF you must write a program, compile it to eBPF bytecode and loads it into the kernel. If only there was an easy way to create and load an eBPF program...

> bpftrace enters the chat.

bpftrace is a high-level tracing language for Linux that provides a quick and easy way to write eBPF programs. It is exactly what we need.

bpftrace takes inspiration from awk and is ideal for short, single use programs.

Now, let's search for a Kprobe that we can use. We can list available Kprobes:

$ sudo bpftrace -l 'kprobe:*' | wc -l
79187

Hmm, that's a lot. Maybe we can list for probes with the infix mem:

$ sudo bpftrace -l 'kprobe:*mem*' | wc -l
1523

That's better. After a quick skim, it seems we are only interested in probes with the memcg infix. memcg stands for memory control group.

$ sudo bpftrace -l 'kprobe:*memcg*' | wc -l
59

Perfect! Let's trace all those kernel probes and count their invocations.

$ sudo bpftrace \
    -e 'kprobe:*memcg* { printf("%s\n", probe); }'\
    -p $PID \
  | sort | uniq -c
      35 kprobe:flush_memcg_stats_dwork
     163 kprobe:__get_obj_cgroup_from_memcg
     186 kprobe:memcg_charge_kernel_stack
    1033 kprobe:memcg_page_state
    1450 kprobe:__memcg_kmem_uncharge_page
    3006 kprobe:memcg_list_lru_alloc
    4248 kprobe:__memcg_kmem_charge_page
   38475 kprobe:count_memcg_events
  104512 kprobe:charge_memcg
  135263 kprobe:mod_memcg_state
  154792 kprobe:try_charge_memcg
  531742 kprobe:mod_memcg_lruvec_state
 6028668 kprobe:__memcg_slab_post_alloc_hook
25649294 kprobe:__memcg_slab_free_hook

Only these 14 probes are triggered by kmemleak. Nothing conclusive here, let's print kernel stack trace:

$ sudo bpftrace -e 'kprobe:*memcg* { printf("%s\n", kstack()); }' -p $PID
        __memcg_slab_post_alloc_hook+5
        __kmalloc_noprof+1153
        io_alloc_ocqe+111
        __io_submit_flush_completions.cold+61
        io_submit_sqes+555
        __do_sys_io_uring_enter+597
        do_syscall_64+183
        entry_SYSCALL_64_after_hwframe+119
...

Much better! We can see all functions called from the io_uring_enter system call up to an allocation of kernel memory: kmalloc.

It appears that an o?????????? completion queue entry (OCQE) is allocated by io_alloc_ocqe. The manual discusses CQE but there is no mention of OCQE.

At this point, I feel like we have sufficiently narrowed our search. It's time to read kernel code 😃.

$ git clone https://github.com/torvalds/linux
$ git checkout v6.16 # my kernel version
$ grep -Rn 'io_alloc_ocqe' io_uring/
io_uring.c:747:static struct io_overflow_cqe *io_alloc_ocqe(struct io_ring_ctx *ctx,
io_uring.c:888:	ocqe = io_alloc_ocqe(ctx, cqe, big_cqe, GFP_KERNEL);
io_uring.c:900:	ocqe = io_alloc_ocqe(ctx, cqe, big_cqe, GFP_ATOMIC);

💡 Oooh ! an overflowing completion queue entry (OCQE) is allocated each time an operation completes and the CQ is full.

By looking at the bpftrace output for our process, all kernel allocations are done by io_alloc_ocqe. However, we don't know which call, the one with GFP_KERNEL or GFP_ATOMIC, is allocating the memory.

Memory allocation in the kernel is well documented. GFP acronym stands for "get free pages", the underlying memory allocation function. GFP flags exists to express how some memory should be allocated.

For example, kzalloc, a function to allocate zeroed memory, is defined as follow:

static inline void *kzalloc(size_t size, gfp_t gfp)
{
        return kmalloc(size, gfp | __GFP_ZERO);
}

It simply sets the __GFP_ZERO bit flag.

Common set of GFP flags are defined as GFP_XXX values while bit flags are defined as __GFP_XXX:

GFP_KERNEL, a general purpose flag used for kernel-internal allocations, is defined as: __GFP_RECLAIM | __GFP_IO | __GFP_FS.
GFP_ATOMIC, a flag used for allocations that can not sleep and need to succeed, is defined as __GFP_HIGH | __GFP_KSWAPD_RECLAIM

Allocations that should be accounted by the memory controller must have the __GFP_ACCOUNT bit set. At first, I thought that's the solution but the io_alloc_ocqe function always sets this bit when calling kzalloc:

static struct io_overflow_cqe *io_alloc_ocqe(struct io_ring_ctx *ctx,
					     struct io_cqe *cqe,
					     struct io_big_cqe *big_cqe, gfp_t gfp)
{
    // ...
	ocqe = kzalloc(ocq_size, gfp | __GFP_ACCOUNT);
    // ...
}

So, why the process isn't killed? When does the OOM killer is triggered?

From kernel.org:

It is possible that on a loaded machine memory will be exhausted and the kernel will be unable to reclaim enough memory to continue to operate. In order to save the rest of the system, it invokes the OOM killer.

It seems to be event based. When the kernel needs more memory and can't reclaim it via swap or else, it invokes the OOM killer.

So, logically, this is done by the memory allocator: kmalloc. That way, when memory allocation is requested, it can triggers the OOM killer to free some memory instead of returning NULL.

Same goes for memory accounting in control groups. As soon, as it reach its limit, process get killed.

If we look at stack traces of other Kprobes, notably try_charge_memcg, we can confirm our hypothesis:

try_charge_memcg+1
obj_cgroup_charge_account+209
__memcg_slab_post_alloc_hook+259
__kmalloc_noprof+1153
io_alloc_ocqe+111
__io_submit_flush_completions.cold+61
io_submit_sqes+555
__do_sys_io_uring_enter+597
do_syscall_64+183
entry_SYSCALL_64_after_hwframe+119

After allocating memory, obj_cgroup_charge_account is called to update control group's statistics, and, trigger the OOM killer if needed.

try_charge_memcg is pretty complex, it is 194 lines long. I won't go into the details as I don't understand everything but you can read it here if you're curious.

As expected, it can trigger the OOM killer.

< Send Web Push messages with Deno