DuckDB CPU Attribution β Design Spec
DuckDB CPU Attribution β Design Spec
- Status: Proposed (for review)
- Date: 2026-06-26
- Supersedes: the caller-side CPU metric in PR #5642 (
kurrentdb.duckdb.cpu.seconds,source=caller) - Spans two repositories: Kurrent.Quack (executor + native interop) and KurrentDB (call-site migration + metric registration)
1. Background & problem
We want operators to be able to answer: "What fraction of this node's CPU is DuckDB vs the rest of KurrentDB?" Process- and system-level CPU are already exported (kurrentdb_proc_cpu, kurrentdb_sys_cpu); the missing piece is the DuckDB share so the two can be compared.
PR #5642 (now closed, superseded by this design) proposed a first attempt: DuckDBCpuMetrics.Measure(activity) returns a ref struct scope that reads the calling thread's CPU (clock_gettime(CLOCK_THREAD_CPUTIME_ID) on Linux/macOS, GetThreadTimes on Windows) at construction and again at Dispose, recording the delta into a counter. It is wrapped around the synchronous DuckDB sections of commit, checkpoint, index reads, and query setup.
Code review identified three fundamental flaws, all rooted in one assumption β "measure the calling thread across a synchronous span":
- Parallel work is invisible. DuckDB executes parallelizable queries (scans, filters, aggregations, sorts) on its own morsel-scheduler worker threads.
CLOCK_THREAD_CPUTIME_IDon the calling thread never sees those threads, so the metric undercounts most severely exactly when DuckDB is busiest. This is not an edge case β it is the common case for analytical queries. - It blocks async. The measurement requires start and end on the same thread within one synchronous region. If these operations ever become genuinely asynchronous (await points), the scope cannot span them β so the metric would actively prevent a desirable refactor.
- Thread-affinity of
ref structis incidental, not guaranteed. Today aref structcannot cross anawaitor be captured, so it stays on one thread in practice. The language does not guarantee this, and the C#ref structrules are actively loosening. If start and end ever run on different threads, the per-thread CPU delta is silently wrong β no error, just bad data.
No refinement of the ref struct rescues this; the measurement mechanism must change.
2. Goals & non-goals
Goals
- A correct total of DuckDB CPU consumed on the node, exported as a standard OpenTelemetry metric on
/metrics. - Correct in the presence of DuckDB's internal parallelism.
- Independent of caller threading and of whether callers are synchronous or asynchronous.
- Free of any "two readings on the same thread" assumption.
Non-goals
- Per-query or per-activity attribution. Measuring at the thread level means we cannot say whether a given worker's morsel belongs to a "commit" or a "query". We deliberately trade the (badly-estimated) activity breakdown of the old metric for a correct total. If per-query attribution is wanted later, DuckDB's own query profiling (
CPU_TIME) is the right tool and is out of scope here (see Appendix). - Changing what counts as "DuckDB work" β it remains everything executed through the DuckDB engine.
3. Chosen approach: a dedicated DuckDB executor
Stop measuring the caller. Instead, own the threads DuckDB executes on, and measure those threads.
A new DuckDBExecutor becomes the single place DuckDB runs. KurrentDB submits DuckDB operations to it and awaits results; the executor runs them on threads it creates, names, and measures. Because every byte of DuckDB CPU β parallel and serial β lands on owned threads, summing those threads' CPU yields the correct total, with no dependence on the caller's thread or sync/async shape.
This is the "own the worker pool" direction; it is the only option that captures parallel work, covers all activities (queries, commits, checkpoints, background), and is inherently async- and thread-safe.
4. Architecture
The executor owns two distinct sets of named threads:
- Workers (
N = SET threads): each thread loopsduckdb_execute_tasks_state(sharedState), draining DuckDB's task/morsel queue. DuckDB is configuredSET threads = N; SET external_threads = N, so these owned threads constitute DuckDB's entire parallel execution pool. - Dispatchers: a bounded pool that runs the issuing side of an operation β the blocking
duckdb_query/ chunk-fetch / appender-flush call that drives the root pipeline. A caller submits an operation; a dispatcher executes it.
The separation is the load-bearing invariant. A thread blocked inside a query/flush call cannot also pump the task queue. If every thread were busy issuing operations, no thread would process morsels and DuckDB would deadlock. Workers and dispatchers must therefore be separate sets. The corollary is the deadlock-freedom guarantee in Β§8: because morsels are always drained by workers (never blocked by dispatchers), in-flight operations always make progress regardless of dispatcher saturation.
submit(op) results
KurrentDB call sites ββββββββββββΆ Dispatcher pool ββββββββΆ awaiting callers
(await Execute) (issue + drive root (any thread)
pipeline; owned,
named, measured)
β
β enqueues tasks
βΌ
DuckDB task queue
β
βΌ
Worker pool (N threads
looping execute_tasks_state;
owned, named, measured)
metric = Ξ£ CPU(all owned threads) β sampled on scrape5. Execution model & API
API (Quack): a single async entry point, approximately:
ValueTask<T> Execute<T>(DuckDBConnection conn, Func<DuckDBConnection, T> op, CancellationToken ct);The caller awaits; the op is enqueued; a dispatcher runs op(conn); the result completes the ValueTask; the caller resumes on its own context (which no longer matters for measurement).
Connection affinity is preserved for free. DuckDB forbids concurrent use of one connection, not use from different threads. KurrentDB already gives each unit of work its own connection β the shared write connection is used serially by the index processor; reads use per-request pooled connections β so two concurrent ops are never submitted for the same connection. A dispatcher simply borrows the connection for the duration of the op.
6. Call-site migration (KurrentDB)
The following sites change from inline synchronous DuckDB calls to await executor.Execute(...):
DefaultIndexProcessor.CommitandUserIndexProcessor.Commit/Checkpoint(appender flush).- The reader path β
SecondaryIndexReaderBase.GetDbRecordsForwards/Backwardsand the category / event-type / user readers. These already sit under an asyncReadForwards/ReadBackwards, so the async fits naturally. QueryEngine.ExecuteAsyncandGetArrowSchema.- The shutdown checkpoint in
DuckDBConnectionPoolLifetime.StopAsync.
The one genuinely tricky site is streaming reads. QueryEngine's consumer pulls chunks in a TryRead loop, and each TryFetch is a DuckDB call that must run on an owned thread. Resolution: run the whole consume loop inside one Execute submission (the loop executes on a dispatcher), rather than marshalling each fetch individually β one submission per query, every fetch owned. Treated as its own work item.
Most of the blast radius is converting a few void/synchronous DuckDB methods to async on call paths that are already async (ReadForwards, ExecuteAsync, the subscription loop).
7. The metric
- Instrument: an OpenTelemetry observable counter
kurrentdb.duckdb.cpu.seconds(monotonic CPU-seconds), with an optionalrole=worker|dispatchertag for diagnostics. On a dashboard,rate(kurrentdb_duckdb_cpu_seconds_total[1m])yields DuckDB CPU in cores; compare it against the process-CPU signal in the same unit. Notekurrentdb_proc_cpu(Β§1) is a gauge (anObservableUpDownCounterof instantaneous CPU usage), so it must not be wrapped inrate()β divide by the gauge directly once both are expressed as cores (e.g.rate(kurrentdb_duckdb_cpu_seconds_total[1m]) / kurrentdb_proc_cpu, adjusting for the gauge's scaling). - Sampling: on each scrape, sum every owned thread's cumulative CPU, read by thread handle (cross-thread, not "current thread"):
- Linux:
pthread_getcpuclockid(thread, &clockid)thenclock_gettime(clockid). - Windows:
GetThreadTimes(handle, β¦)(kernel + user time) for any owned thread handle. - macOS:
thread_info(mach_thread, THREAD_BASIC_INFO, β¦). Dev-only platform; degrade to no-op if unavailable (macOS does not implementpthread_getcpuclockid). This generalizes the existingThreadCpuTimeshim from "the calling thread" to "any owned thread", and β crucially β reads correctly even while a worker is parked insideduckdb_execute_tasks_state, because the kernel keeps accounting that thread's CPU regardless. With ~8β16 owned threads sampled every 15s, the cost is negligible.
- Linux:
- Ownership split: Quack owns the threads and their handles and exposes their per-thread CPU (e.g., an enumeration of CPU-seconds per owned thread). KurrentDB registers the OTel observable counter that reads it and adds the
KurrentDB.DuckDBmeter tometricsconfig.json. OTel/config concerns stay in KurrentDB. - Testability: keep an injectable CPU-time source (as the current implementation has) so the summing and tagging are deterministically unit-testable without relying on real per-thread accounting.
8. Error handling, cancellation & lifecycle
- Op failures: an exception thrown by
op(conn)on a dispatcher is captured and faulted onto the returnedValueTask, surfacing to the awaiting caller exactly as a synchronous throw does today. - Cancellation: the token wired through
Executetriggersconnection.Interrupton the running query (today'sInterruptQueryOnCancellationbehavior); the op throws, theValueTaskfaults, and the dispatcher is freed for the next op. DuckDB'sInterruptβOperationCanceledExceptionmapping is retained. - No deadlock under load: if every dispatcher is busy, further submissions queue (bounded) and wait. They cannot deadlock, because morsel processing happens on the worker pool, which is never blocked by dispatchers β so in-flight ops always complete and free dispatchers, draining the queue.
- Thread lifecycle: the executor owns its threads and task state (the creator owns teardown). Startup spawns workers (each looping
execute_tasks_state) and dispatchers. Shutdown order: stop accepting new ops β drain in-flight βduckdb_finish_execution(state)so workers return from the native loop β join all threads β run the final checkpoint β dispose connections. The executor subsumes today'sDuckDBConnectionPoolLifetimeshutdown checkpoint. - Worker loss: an uncaught native failure in a worker is logged and surfaced; remaining workers still drain the queue (degraded parallelism, not a hang). Dead threads are not resurrected β a crashing DuckDB worker indicates larger trouble.
- Configuration:
threads(worker count; default to the existing core/RAM heuristic already used formemory_limit) and dispatcher count, both overridable.
9. Testing
Quack (deterministic unit tests):
- Injectable clock β CPU summing across owned threads is exact; the
roletag is correct. - Headline test proving concern #1 is solved: run a parallelizable query, then assert total DuckDB CPU exceeds wall-clock elapsed. That is only possible if multiple worker threads are counted β precisely what the caller-side approach could never show (it was bounded by single-thread wall time).
- Async/thread-independence (concerns #2, #3): drive an op whose continuation resumes on a different thread; assert the measurement is unaffected.
- No deadlock under saturation: submit more concurrent ops than dispatchers; assert all complete.
- Lifecycle: clean start/stop,
finish_executionjoins every thread, no leaked threads, shutdown checkpoint runs. - Failure & cancellation: an op that throws faults the awaiting caller; a cancelled op interrupts DuckDB and frees its dispatcher.
KurrentDB (functional safety net): the existing SecondaryIndexing integration tests β reads, subscriptions, FlightSQL, query engine β must pass unchanged through the executor; that proves the call-site migration did not alter behavior. Plus a smoke check that kurrentdb.duckdb.cpu.seconds appears on /metrics.
10. Repository split, rollout & disposition of PR #5642
- Kurrent.Quack (first): add the
DuckDBExecutor(worker + dispatcher pools), the native task-scheduler interop (duckdb_create_task_state/execute_tasks_state/finish_execution/destroy_task_state), the cross-thread per-OS CPU read, and the per-thread CPU enumeration. Ship as a new Quack version. - KurrentDB (second): consume the new Quack version; migrate the DuckDB call sites (Β§6) to the executor; register the
kurrentdb.duckdb.cpu.secondsobservable counter and add theKurrentDB.DuckDBmeter tometricsconfig.json; document the metric indocs/server/diagnostics/metrics.md. - PR #5642: remove the caller-side CPU measurement (
DuckDBCpuMetrics,ThreadCpuTime, the scope instrumentation and its tests). Per the review, we will not merge a metric whose headline value is wrong for parallel work. TheKurrentDB.DuckDBmeter name and the docs scaffolding are retained as the landing point for this design (re-created fresh on this PR). #5642 has been closed in favor of this work.
11. Risks & open questions
- Dispatcher pool sizing. Too few dispatchers throttle concurrent reads; too many add scheduling overhead against a fixed worker pool. Needs a sensible default and load testing. (Does not affect correctness of the metric, only read latency.)
- Worker / external-threads interaction under concurrent queries. Multiple in-flight queries share the single worker pool via DuckDB's global task scheduler. Behavior is expected to be standard, but must be load- and soak-tested before this becomes the default execution path β it changes DuckDB's execution model from internal to external threads.
- macOS per-thread CPU.
pthread_getcpuclockidis unsupported on macOS; thethread_info/mach path must be implemented or the metric degrades to no-op on macOS (acceptable: macOS is a development platform only). - Magnitude of the old blind spot. The headline test (total CPU > wall-clock) will, for the first time, quantify how much CPU the previous caller-side metric was missing β useful validation that the rework was warranted.
- Scope of the async migration. Converting the streaming reader to run its consume loop on a dispatcher is the largest single change; it must preserve current read semantics (ordering, cancellation, snapshot capture).
Appendix: verified facts (DuckDB 1.5, as shipped)
- Task-scheduler C API present in the shipped
libduckdb1.5 binaries (Linux/Windows/macOS):duckdb_create_task_state,duckdb_execute_tasks,duckdb_execute_tasks_state,duckdb_execute_n_tasks_state,duckdb_finish_execution,duckdb_task_state_is_finished,duckdb_destroy_task_state,duckdb_execution_is_finished. duckdb_execute_tasks_statesemantics (DuckDB C API docs): "Execute DuckDB tasks on this thread. The thread will keep on executing tasks forever, untilduckdb_finish_executionis called on the state. Multiple threads can share the sameduckdb_task_state." This is exactly the owned-worker-pool primitive.external_threads+threadscaps total parallelism on DuckDB's global task scheduler; combined with externally-provided threads it replaces the internal pool.- DuckDB threads do not name themselves, so a sampler cannot reliably identify DuckDB's internal workers from outside β which is why owning (and naming) the threads is necessary, and why a thread-enumeration sampler was rejected.
- DuckDB profiling
CPU_TIME(the per-query alternative, deliberately out of scope) "measures the CPU time spent on a query, specifically accumulating operator timings; it does not account for parsing or planning." It is per-query, requires profiling to be enabled (overhead), and does not cover non-query work (appender flush / commit / checkpoint). The C symbolduckdb_get_profiling_infois present in 1.5. - Kurrent.Quack today exposes neither the task scheduler nor profiling; its
Threadingnamespace is the buffered appender only. Both require new Quack work.