Skip to content

fix(thread_pool): re-read outstanding work under lock before stopping#339

Merged
sgerbino merged 1 commit into
cppalliance:develop-2from
sgerbino:fix/thread-pool-join-race
Jun 25, 2026
Merged

fix(thread_pool): re-read outstanding work under lock before stopping#339
sgerbino merged 1 commit into
cppalliance:develop-2from
sgerbino:fix/thread-pool-join-race

Conversation

@sgerbino

Copy link
Copy Markdown
Collaborator

testJoinDrainsWork could intermittently fail: join() returned with posted tasks still queued and never run.

on_work_finished() decremented outstanding_work_ lock-free and decided to stop from that decrement alone. A worker could observe the count transiently reach zero, get preempted before taking the mutex, and then latch stop_ after more work had been posted and join() had begun waiting; join() woke and abandoned the still-outstanding work. The same hole strands a task that suspends and is resumed after the count briefly hits zero, since its run queue is empty while it is in flight.

Keep outstanding_work_ atomic and lock-free on the start path, but have the worker that drives the count to zero re-read it under mutex_ before latching stop_. The re-read observes any on_work_started() that landed in the window after the lock-free decrement, so work started before the decision is never stranded; work whose count is raised after the decision is post-drain and abandoned as before. join() still blocks until the count reaches zero.

Also correct the class example: a bare post() does not register outstanding work, so join() does not wait for it. Use run_async, which holds a work guard for the operation, and document the contract.

testJoinDrainsWork could intermittently fail: join() returned with
posted tasks still queued and never run.

on_work_finished() decremented outstanding_work_ lock-free and decided
to stop from that decrement alone. A worker could observe the count
transiently reach zero, get preempted before taking the mutex, and then
latch stop_ after more work had been posted and join() had begun
waiting; join() woke and abandoned the still-outstanding work. The same
hole strands a task that suspends and is resumed after the count briefly
hits zero, since its run queue is empty while it is in flight.

Keep outstanding_work_ atomic and lock-free on the start path, but have
the worker that drives the count to zero re-read it under mutex_ before
latching stop_. The re-read observes any on_work_started() that landed
in the window after the lock-free decrement, so work started before the
decision is never stranded; work whose count is raised after the
decision is post-drain and abandoned as before. join() still blocks
until the count reaches zero.

Also correct the class example: a bare post() does not register
outstanding work, so join() does not wait for it. Use run_async, which
holds a work guard for the operation, and document the contract.
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop-2@51116a0). Learn more about missing BASE report.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             develop-2     #339   +/-   ##
============================================
  Coverage             ?   98.39%           
============================================
  Files                ?       83           
  Lines                ?     4234           
  Branches             ?        0           
============================================
  Hits                 ?     4166           
  Misses               ?       68           
  Partials             ?        0           
Flag Coverage Δ
linux 98.39% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51116a0...572394a. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cppalliance-bot

Copy link
Copy Markdown

An automated preview of the documentation is available at https://339.capy.prtest3.cppalliance.org/index.html

If more commits are pushed to the pull request, the docs will rebuild at the same URL.

2026-06-25 19:38:01 UTC

@sgerbino sgerbino merged commit 690ab36 into cppalliance:develop-2 Jun 25, 2026
37 checks passed
@sgerbino sgerbino deleted the fix/thread-pool-join-race branch June 25, 2026 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants