asyncio performance

Random notes about tuning asyncio for performance. Performance means two different teams which might be incompatible:

  • Number of concurrent requests per second
  • Request latency in seconds: min/average/max time to complete a request

Architecture: Worker processes

Because of its GIL, CPython is basically only able to use 1 CPU. To increase the number of concurrent requests, one solution is to spawn multiple worker processes. See for example:

Stream limits

aiohttp uses set_writer_buffer_limits(0) for backpressure support and implemented their own buffering, see:

TCP_NODELAY

Since Python 3.6, asyncio now sets the TCP_NODELAY option on newly created sockets: disable the Nagle algorithm for send coalescing. Disable segment buffering so data can be sent out to peer as quickly as possible, so this is typically used to improve network utilisation.

See Nagle’s algorithm.

TCP_QUICKACK

(This option is not used by asyncio by default.)

The TCP_QUICKACK option can be used to send out acknowledgements as early as possible than delayed under some protocol level exchanging, and it’s not stable/permanent, subsequent TCP transactions (which may happen under the hood) can disregard this option depending on actual protocol level processing or any actual disagreements between user setting and stack behaviour.

Tune the Linux kernel

Linux TCP sysctls:

  • /proc/sys/net/ipv4/tcp_mem
  • /proc/sys/net/core/rmem_default and /proc/sys/net/core/rmem_max: The default and maximum amount for the receive socket memory
  • /proc/sys/net/core/wmem_default and /proc/sys/net/core/wmem_max: The default and maximum amount for the send socket memory
  • /proc/sys/net/core/optmem_max: The maximum amount of option memory buffers
  • net.ipv4.tcp_no_metrics_save
  • net.core.netdev_max_backlog: Set maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them.