Celery Execution Pools: What is it all about?

Have you ever asked yourself what happens when you start a Celery worker? Ok, it might not have been on your mind. But you might have come across things like execution pool, concurrency settings, prefork, threads, gevent, eventlet and solo. So, what is it all about? How does it all fit together? And how is it related to the mechanics of a Celery worker?

The Celery worker

When you start a Celery worker on the command line via celery --app=..., you just start a supervisor process. The Celery worker itself does not process any tasks (with one notable exception, which I will get to later).

It spawns child processes (or threads) and deals with all the bookkeeping stuff. The child processes (or threads) execute the actual tasks. These child processes (or threads) are also known as the execution pool.

The size of the execution pool determines the number of tasks your Celery worker can process. The more processes (or threads) the worker spawns, the more tasks it can process concurrently. If you need to process as many tasks as quickly as possible, you need a bigger execution pool. At least, that is the idea.

In reality, it is more complicated. The answer to the question of how big your execution pool should be depends on whether you use processes or threads. And the answer to that question depends what your tasks do.

The --pool option

You can choose between processes or threads, using the --pool command line argument. Use a gevent execution pool, spawning 100 "green threads" or "greenlets" (you need to pip-install gevent first):

# start celery worker with the gevent pool
$ celery worker --app=worker.app --pool=gevent --concurrency=100

Don't worry too much about the details for now (why are threads green?). We will go into more details if you carry on reading. Celery supports four execution pool implementations:

prefork
threads
solo
eventlet
gevent

The --pool command line argument is optional. If not specified, Celery defaults to the prefork execution pool.

Prefork

The prefork pool implementation is based on Python's multiprocessing package. It allows your Celery worker to side-step Python's Global Interpreter Lock and fully leverage multiple processors on a given machine.

You want to use the prefork pool if your tasks are CPU-bound. A task is CPU-bound, if it spends the majority of its time using the CPU (crunching numbers). Your task could only go faster if your CPU were faster.

The number of available cores limits the number of concurrent processes. It only makes sense to run as many CPU-bound tasks in parallel as there are CPUs available.

This is why Celery defaults to the number of CPUs available on the machine if the --concurrency argument is not set. Start a worker using the prefork pool, using as many processes as there are CPUs available:

# start celery worker with the prefork pool
$ celery worker --app=worker.app

Threads

# start celery worker with the threads pool
$ celery worker --app=worker.app --pool=threads

The threads pool is Celery's latest addition to the pool zoo and was introduced back in 2019. It uses Python's ThreadPoolExecutor. These threads are real OS threads, managed directly by the operating system kernel.

The threads pool does what it says on the tin and is suitable for tasks that are I/O bound. A task is I/O bound if it spends the majority of its time waiting for an Input/Output operation to finish. Your task could only go faster if the Input/Output operation were faster.

Just like in the prefork case, Celery defaults to the number of CPUs available on the machine if no --concurrency is specified. This behaviour is slightly misleading because Python's Global Interpreter Lock (GIL) still reigns. Even if you have 4 CPUs and have Celery spawn 4 worker threads, only one thread can execute Python bytecode at any given time.

Solo

The solo pool is a bit of a special execution pool. Strictly speaking, the solo pool is neither threaded nor process-based. More strictly speaking, the solo pool is not even a pool as it is always solo. Even more strictly speaking, the solo pool contradicts the principle that the worker does not process any tasks.

The solo pool runs inside the worker process. It runs inline which means there is no bookkeeping overhead. This makes the solo worker fast. But it also blocks the worker during task execution. Which has some implications when remote-controlling workers.

# start celery worker in solo mode
$ celery worker --app=worker.app --pool=solo

The solo pool is an interesting option when running CPU-intensive tasks in a microservices environment. For example, in a Kubernetes context, managing the worker pool size can be easier than managing multiple execution pools. Instead of managing the execution pool size per worker(s), you manage the total number of workers.

Eventlet and gevent

Let's say you need to execute thousands of HTTP GET requests to fetch data from external REST APIs. The time it takes to complete a single GET request depends almost entirely on the time it takes the server to handle that request. Most of the time, your tasks wait for the server to send the response, not using any CPU.

The bottleneck for this kind of task is not the CPU. The bottleneck is waiting for an Input/Output operation to finish. This is an Input/Output-bound task (I/O bound). The time the task takes to complete is determined by the time spent waiting for an input/output operation to finish.

If you run a single process execution pool, you can only handle one request at a time. It takes a long time to complete those thousands of GET requests. So you spawn more processes.

However, there is a tipping point where adding more processes to the execution pool hurts performance. The overhead of managing the process pool becomes more expensive than the marginal gain for an additional process.

In this scenario, spawning hundreds (or even thousands) of threads is a much more efficient way to increase capacity for I/O-bound tasks. Celery supports two thread-based execution pools: eventlet and gevent. Here, the execution pool runs in the same process as the Celery worker itself. To be precise, both eventlet and gevent use greenlets and not threads.

Greenlets - also known as green threads, cooperative threads or coroutines - give you threads, but without using threads (unlike the threads pool). Threads are managed by the operating system kernel. The operating system uses a general-purpose scheduler to switch between threads. This general-purpose scheduler is not always very efficient.

Greenlets emulate multi-threaded environments without relying on any native operating system capabilities. Greenlets are managed in application space and not in kernel space. There is no scheduler pre-emptively switching between your threads at any given moment. Instead, your greenlets voluntarily or explicitly give up control to one another at specified points in your code.

This makes greenlets excel at running a huge number of non-blocking tasks. Your application can schedule things much more efficiently. For a large number of tasks, this can be a lot more scalable than letting the operating system interrupt and awaken threads arbitrarily.

For us, the benefit of using a gevent or eventlet pool is that our Celery worker can do more work than it could before. This means we do not need as much RAM to scale up. This optimises the utilisation of our workers.

Start a Celery worker using a gevent execution pool with 500 worker threads (you need to pip-install gevent):

# start celery worker using the gevent pool
$ celery worker --app=worker.app --pool=gevent --concurreny=500

Start a Celery worker using a eventlet execution pool with 500 worker threads (you need to pip-install eventlet):

# start celery worker using the eventlet pool
$ celery worker --app=worker.app --pool=eventlet --concurreny=500

Both pool options are based on the same concept: Spawn a greenlet pool. The difference is that --pool=gevent uses the gevent Greenlet pool (gevent.pool.Pool). Whereas --pool=eventlet uses the eventlet Greenlet pool (eventlet.GreenPool).

gevent and eventlet are both packages that you need to pip-install yourself. There are implementation differences between the eventlet and gevent packages. Depending on your circumstances, one can perform better than the other. On the downside, both packages rely on monkey patching. This has an impact on package compatibility and your code base.

The --concurrency option

To choose the best execution pool, you need to understand whether your tasks are CPU- or I/O-bound. CPU-bound tasks are best executed by a prefork execution pool. I/O bound tasks are best executed by a gevent/eventlet execution pool.

The only question that remains is: how many worker processes/threads should you start? The --concurrency command line argument determines the number of processes/threads:

# start celery worker using the prefork pool
$ celery worker --app=worker.app --concurrency=2

This starts a worker with a prefork execution pool which is made up of two processes. For prefork pools, the number of processes should not exceed the number of CPUs.

Spawn a Greenlet based execution pool with 500 worker threads:

# start celery worker using the gevent pool
$ celery worker --app=worker.app --pool=gevent --concurrency=500

If the --concurrency argument is not set, Celery always defaults to the number of CPUs, whatever the execution pool.

This makes the most sense for the prefork execution pool. But you have to take it with a grain of salt. If there are many other processes on the machine, running your Celery worker with as many processes as CPUs available might not be the best idea.

Using the default concurrency setting for a gevent/eventlet pool is almost outright stupid. The number of green threads it makes sense for you to run is unrelated to the number of CPUs you have at your disposal.

Another special case is the solo pool. Even though you can provide the --concurrency command line argument, is meaningless for this execution pool.
For these reasons, it is always a good idea to set the --concurrency command line argument.

Conclusion

Celery supports three concepts for spawning its execution pool: Prefork, OS Threads and Greenlets. Prefork is based on multiprocessing and is the best choice for tasks which make heavy use of CPU resources. Prefork pool sizes are roughly in line with the number of available CPUs on the machine.

Tasks that perform Input/Output operations should run in a thread or greenlet-based execution pool.

The threads pool uses OS threads and plays nicely with any codebase. There is a limit though on how many threads it makes sense to spawn.

Greenlets behave like threads but are much more lightweight and efficient. Greenlet pools can scale to hundreds or even thousands of tasks. Using one of the greenlet implementations will in all likelihood have an impact on your codebase.

What can you do if you have a mix of CPU and I/O-bound tasks? Set up two queues with one worker processing each queue. One queue/worker with a prefork execution pool for CPU-heavy tasks. And another queue/worker with a gevent or eventlet execution pool for I/O tasks. And don't forget to route your tasks to the correct queue.