The Prefork Worker Pool

The question of scaling - concurrency in Cerlery jargon - generally comes as an afterthought. And the pool choice is an afterthought of the afterthought.

Why do I start with the scaling question when I write about Celery's prefork pool?

Because the prefork pool is Celery's default. This means you are using the prefork pool even when you are not aware. It is easy to get started with, works with any Python package and plays nicely with whatever your code does. It so painfree that you do not have to think or worry about it.

Until you do, because you start looking into scaling. That is when the honeymoon period comes to an end and the prefork pool turns into a CPU und memory hungry, database connections spawning monster. With a serious impact on costs of running your app.

This is part three of my seven parts series about Celery execution pools: the prefork pool. I am going to cover what the prefork pool is, how it works, its configuration options, its limitations. The good and the bad. My goal is to help you make an educated decision whether the prefork pool is a good fit for your application.

What is prefork?

The term forking comes from the Linux and Unix world. fork() is an operating systems call and creates a new process by duplicating the calling process.

In server architectures, prefork is a model to handle multiple concurrent workload. At startup (pre), the server creates a number of child processes (forks). Multiple child processes can handle multiple tasks concurrently.

The child processes stay alive so that they are available for incoming workload. This allows the server to handle a higher workload, faster.

Starting a Celery prefork worker

To start a Celery worker using the prefork pool, use the prefork or processes --pool option, or no pool option at all. Use --concurrency to control the number of child processes (forks) within the pool.

celery --app=worker.app worker --concurrency=2 --pool=prefork
celery --app=worker.app worker --concurrency=2 --pool=processes
celery --app=worker.app worker --concurrency=2

The worker clones itself and creates two new processes. This gives you three copies of your Celery app: the worker (parent) process and two forked (child) processes.

[config]
.> app:         app:0x1027f3d50
.> transport:   redis://localhost:6379/0
.> results:     disabled://
.> concurrency: 2 (prefork)
.> task events: OFF (enable -E to monitor tasks in this worker)

[queues]
.> celery           exchange=celery(direct) key=celery

The child processes do not share state. Neither do they not communicate with each other. They do not even know that they have siblings. The parent process (the worker) communicates with the pool processes via pipes. It is the worker's responsibility to pick new tasks from the broker, pass them on to the pool, collect results from the pool and write them to the result backend (if set up).

Because of the process isolation, Python's Global Interpreter Lock (GIL) works beautifully with the prefork pool. By being multiprocessing-based, it is pythonic concurrency by design.

This gives the prefork pool it's well behaved characteristics. It does not impose any rules on how you write your code. Nor does it restrict you on the Python packages you can use. This makes the prefork pool well-understood, stable and reliable. The prefork pool is a rock solid choice that is as pythonic as it gets.

Configuration

Celery provides a number of settings for the prefork pool. Some of which are only available to the prefork pool.

Concurrency

The number of concurrent worker processes in the pool. Defaults to number of the number of CPUs on host machine if omitted.

Worker command line argument: --concurrency <n>
Environment variable: CELERY_WORKER_CONCURRENCY=<n>
Celery app arguments: Celery(worker_concurrency=<n>)
Celery config: app.conf.worker_concurrency = <n>

Maximum number of tasks per child

The maximum number of tasks <n> a pool process is allowed to execute before it is shutdown and replaced with a new me. Eg if set to ten, the worker shuts down a pool process after that process has executed ten tasks. It is then replaced with a new fork. Ugly but useful if you deal with memory leaks you have no control over. If not set, the process will just execute tasks for as long as it is alive.

Worker command line argument: --max-tasks-per-child <n>
Environment variable: CELERY_WORKER_MAX_TASKS_PER_CHILD=<n>
Celery app arguments: Celery(worker_max_tasks_per_child=<n>)
Celery config: app.conf.worker_max_tasks_per_child = <n>

Running with a max-tasks-per-child setting of three and a concurrency of one. The logs show that the worker shuts down ForkPoolWorker-1 after three tasks and replaces it with ForkPoolWorker-2. Note how tasks queue up while ForkPoolWorker-1 is shutting and ForkPoolWorker-2 is starting up.

[2023-09-18 14:31:18,924: INFO/MainProcess] Task worker.task1 received
[2023-09-18 14:31:18,925: INFO/ForkPoolWorker-1] Task worker.task1 succeeded
[2023-09-18 14:31:19,433: INFO/MainProcess] Task worker.task1 received
[2023-09-18 14:31:19,434: INFO/ForkPoolWorker-1] Task worker.task1 succeeded
[2023-09-18 14:31:19,942: INFO/MainProcess] Task worker.task1 received
[2023-09-18 14:31:19,944: INFO/ForkPoolWorker-1] Task worker.task1 succeeded
[2023-09-18 14:31:20,453: INFO/MainProcess] Task worker.task1 received
[2023-09-18 14:31:20,965: INFO/MainProcess] Task worker.task1 received
[2023-09-18 14:31:21,471: INFO/MainProcess] Task worker.task1 received
[2023-09-18 14:31:21,979: INFO/MainProcess] Task worker.task1 received
[2023-09-18 14:31:22,126: INFO/ForkPoolWorker-2] Task worker.task1 succeeded
[2023-09-18 14:31:22,130: INFO/ForkPoolWorker-2] Task worker.task1 succeeded

Maximum memory per child

Similar to the maximum number of tasks <n> per child, but based on resident memory usage. When the maximum amount of resident memory (in kb) is reached, the child process is shut down after processing its current task and replaced by a new process.

Worker command line argument: --max-memory-per-child <n>
Environment variable: CELERY_WORKER_MAX_MEMORY_PER_CHILD=<n>
Celery app arguments: Celery(worker_max_memory_per_child=<n>)
Celery config: app.conf.worker_max_memory_per_child = <n>

Autoscaling pool size

The autoscale option resizes the pool based on load. It adds pool processes, up to a maximum concurrency of<n>, when there is work to do. It removes processes down to a minimum pool size of <m> when the workload is low.

Worker command line argument: --autoscale <m>,<n>
Environment variable: CELERY_WORKER_AUTOSCALE=<m>,<n>
Celery app arguments: Celery(worker_max_memory_per_child=<n>)
Celery config: app.conf.worker_max_memory_per_child = <n>

Running with a an autoscale setting of 1,3. As tasks start queueing up, the worker adds a new process to the pool.

[2023-09-19 11:45:08,708: INFO/MainProcess] Task worker.task1 received
[2023-09-19 11:45:08,711: INFO/MainProcess] Task worker.task1 received
[2023-09-19 11:45:08,712: INFO/MainProcess] Scaling up 1 processes.
[2023-09-19 11:45:08,713: INFO/ForkPoolWorker-1] Task worker.task1 succeeded
[2023-09-19 11:45:08,731: INFO/MainProcess] Task worker.task1 received
[2023-09-19 11:45:08,732: INFO/ForkPoolWorker-1] Task worker.task1 succeeded
[2023-09-19 11:45:08,737: INFO/ForkPoolWorker-2] Task worker.task1 succeeded

When using the autoscale option, inconsistencies can occur. In this case, for the minimum pool size, the concurrency setting takes precedence over the minimum autoscale setting. For example, when you run with a concurrency of two and an autoscale of (1,3), the pool scales up to three but not down below two. When you run with a concurrency of two and an autoscale of (3, 4), the pool scales up to four and down to two.

The good

The prefork pool has a very pythonic look and feel because it works very much like Python's multiprocessing Pool class. Though Celery does not use Python's multiprocessing module but its own fork billiard, it is very similar.

The prefork pool plays perfectly with Python's Global Interpeter Lock (GIL). In simple terms, the GIL allows only one thread to execute Python code at a time, even on multi-core processors. To take full advantage of multi-core processors, you need to use multiple processes in Python. For that reason, the prefork pool shines when you deal with a CPU bound workload.

The process isolation also means that there are no rules on how you write your code. Think about how you need to refactor your code when moving from classic sync Python to an eventloop based async/await paradigm. There is nothing like that with the prefork pool. It also does not restrict you on the Python packages you can use.

The bad

You cannot spawn another process from within a child process. This means that you cannot make use of multiprocessing inside a Celery task when using the prefork pool.

The biggest issue with the prefork pool is its impact on resources. It can becomoe very hungry. For one, there is memory consumption. Total resident memory consumption is roughly the resident memory footprint of your application x (concurrency + 1). The +1 being the worker (parent) process.

Then there is the impact on CPU usage. Remember that it is one CPU per child. If you have a concurrency of 10, you need 11 CPUs. The pool size is effectively limited by the number of CPUs at your disposal, or the $$$ you are willing to pay to your cloud provider.

The actual issue is more nuanced though. For example, if your tasks perform heavy calculations and require CPU, you are putting those 11 CPUs to good use. If all you do is hitting an external API or reading and writing large files to disk, you are waisting CPU cycles and ultimately $$$.

Imagine a Celery app that fetches large image files from an external service and writes them to S3. Each transfer takes five seconds to complete and you get one request per second. To avoid any waiting time in the queue, you need to be able to process five requests a second concurrently.

You set your concurrency to five and pay for five Heroku dynos. But think about what your five fork processes do: they just sit there idling, not utilising any CPU resources. They copy a file from one place to another. Which is entirely IO bound and not a good use of CPU.

Five processes is not a lot, and neither is paying for five Heroku dynos. You can easily scale this example up to a higher throughput and longer running tasks and see that this is not a cost-effective solution. There are pool types that are much better suited for IO bound tasks and I will cover them as part of this series.

Another thing to bear in mind is database connections. Each fork process has its own database connection. If you have a concurrency of 10, you have 10 database connections. This is not a problem if you use a database connection pool. But if you do not, you are going to run into issues with the number of database connections you can open. There are ways to mitigate this, such as using a connection pooling service that sits outside your app, for example pgbouncer in the Postgres case.

Closing words

While Celery's prefork pool is Celery's default and Python's darling, it is not a one size fits all solution. As a result, it is important to have at least an idea about workload requirements early on in the development process. In short: prefork and CPU yay, IO nay.

The Prefork Worker Pool

What is prefork?

Starting a Celery prefork worker

Configuration

Concurrency

Maximum number of tasks per child

Maximum memory per child

Autoscaling pool size

The good

The bad

Closing words

More from this blog

Happy User, Happy Life: Real-Time Celery Progress Bars With FastAPI and HTMX

Celery Tasks: A Guide to SQLAlchemy Session Handling

Celery on Windows: What's the latest?

A Quick Guide to Celery Task Routing

Filesystem As Broker: A Quick Guide

Command Palette

What is prefork?

Starting a Celery prefork worker

Configuration

Concurrency

Maximum number of tasks per child

Maximum memory per child

Autoscaling pool size

The good

The bad

Closing words

More from this blog