import os
import random
import numpy as np
import torch
import fcntl
import time
import signal
import sys
Experimenting with os.fork
os.fork
, and observe different behaviors when using os.fork
in a notebook environment or the shell.
Background
In Lesson 10 of the fastai course (Part 2) we’re introduced to os.fork
, specifically in the context of random number generation. In this notebook I’ll get some more reps working with os.fork
.
In the Lesson, Jeremy shows how random number generation in different libraries is handled across parent and child processes, as shown below (using seed
and rand
as defined in the lesson):
= None
rnd_state def seed(a):
global rnd_state
= divmod(a, 30268)
a, x = divmod(a, 30306)
a, y = divmod(a, 30322)
a, z = int(x)+1, int(y)+1, int(z)+1 rnd_state
457428938475)
seed( rnd_state
(4976, 20238, 499)
def rand():
global rnd_state
= rnd_state
x, y, z = (171 * x) % 30269
x = (172 * y) % 30307
y = (170 * z) % 30323
z = x,y,z
rnd_state return (x/30269 + y/30307 + z/30323) % 1.0
The from-scratch rand
function generates the same random number in both parent and child processes because they share the same random state:
if os.fork(): print(f'In parent: {rand(), rnd_state}')
else:
print(f'In child: {rand(), rnd_state}')
os._exit(os.EX_OK)
In parent: (0.7645251082582081, (3364, 25938, 24184))
In child: (0.7645251082582081, (3364, 25938, 24184))
torch
does the same:
if os.fork(): print(f'In parent: {torch.rand(1).item(), torch.get_rng_state().sum().item()}')
else:
print(f'In child: {torch.rand(1).item(), torch.get_rng_state().sum().item()}')
os._exit(os.EX_OK)
In parent: (0.0692816972732544, 325580)
In child: (0.0692816972732544, 325580)
As does NumPy:
if os.fork(): print(f'In parent: {np.random.rand(1)[0], np.random.get_state()[1].sum()}')
else:
print(f'In child: {np.random.rand(1)[0], np.random.get_state()[1].sum()}')
os._exit(os.EX_OK)
In child: (0.8234897720205184, 1375830894290)
In parent: (0.8234897720205184, 1375830894290)
The Python standard library generates different random numbers in the parent and the child, indicating that the random state has changed:
if os.fork(): print(f'In parent: {random.random(), sum(random.getstate()[1])}')
else:
print(f'In child: {random.random(), sum(random.getstate()[1])}')
os._exit(os.EX_OK)
In parent: (0.7978973512537335, 1327601590235)
In child: (0.5603922565589059, 1333438682830)
Jeremy also mentioned in the video that there used to be a bug in fastai related to this os.fork
behavior which resulted in incorrectly handling data augmentations across multiple processes. I poked around the fastai repo and found this issue and corresponding PR which might have been the ones he was referring to? I’m not sure, but it did lead me down an interesting rabbit hole in the fastai repo and I learned a couple of new things that I’ll share.
In the PR, they introduce the following line:
self.store = threading.local()
self.store
is reference throughout the PR, for example:
def set_state(self):
self.store.rand_r = random.uniform(0, 1)
self.store.rand_c = random.uniform(0, 1)
The corresponding GitHub issue linked to this StackOverflow post which talks about threading.local()
. I didn’t quite follow the post so I copy/pasted its text as a prompt to Claude and asked it to create an example to illustrate the core concepts of threading.local
. It gave me the following example:
import threading
import multiprocessing
import time
import random
First, threading.local
is instantiated as a global variable:
# Thread-local storage for threading module
= threading.local() thread_local
Next, we have a function that creates a worker. Claude defines a worker as follows (I found similar definitions with Google searches):
a unit of execution that performs a specific task or job. In the context of concurrent programming, a worker is typically implemented as either a thread or a process, depending on the chosen concurrency model.
threading_worker
adds a count
attribute to thread_local
(if it doesn’t have it already) or increments count
by 1 if it exists.
def threading_worker(worker_id):
if not hasattr(thread_local, 'count'):
print(f'\n\tWorker {worker_id}: instantiating `count`')
= 0
thread_local.count += 1
thread_local.count print(f"Threading: Worker {worker_id}, Count: {thread_local.count}\n")
time.sleep(random.random())
To illustrate, we create 5 threads and pass threading_worker
to each one. The result is that each worker has its own “private view” to the global thread_local
, as exhibited by thread_local.count
for each worker_id
having the same value of 1
.
Finally, Claude explains that the purpose of thread.join()
is to complete the action in the thread before returning to the main thread. Note that the final print statement, print("Threading example finished.")
is run after all threads finish executing.
def run_threading_example():
= []
threads for i in range(5):
= threading.Thread(target=threading_worker, args=(i,))
thread
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
print("Threading example finished.")
It’s interesting to note that each Worker instantiates count
before adding 1
to it (as expected), but the order of each thread instantiating count
(0, 1, 2, 3, 4) is not the same order of each thread adding 1
(0, 1, 3, 4, 2; which I didn’t expect).
run_threading_example()
Worker 0: instantiating `count`
Threading: Worker 0, Count: 1
Worker 1: instantiating `count`
Threading: Worker 1, Count: 1
Worker 2: instantiating `count`
Worker 3: instantiating `count`
Threading: Worker 3, Count: 1
Worker 4: instantiating `count`
Threading: Worker 4, Count: 1
Threading: Worker 2, Count: 1
Threading example finished.
There is much to learn when it comes to threading and multiprocessing, but I’ll exit this rabbit hole for now.
The second thing I learned was this clever way to index into a tuple using a boolean expression:
@property
def multi_processing_context(self): return (None,multiprocessing)[self.num_workers>0]
I commented about this on Twitter and Jeremy replied:
Alternatively you can use
— Jeremy Howard (@jeremyphoward) September 28, 2024a if pred else b
btw. (Most people seem to hate both options ;) )
Years back when I getting into web development, one of the patterns in JavaScript I enjoyed was the ternary operator:
= is_true ? val_if_true : val_if_false a
From what I understand, Python doesn’t have such an operator so anytime I come across a concise way to execute logic using a boolean expression, I’m excited to see it.
With that short interlude out of the way, I’ll now dig in to os.fork
.
os.fork
Experiments
I prompted Claude to give me some examples using os.fork
with the following prompt:
I want to better understand what
os.fork
does. what’s a good set of experiments I can run to understand it’s functionality?
Claude with responded with four experiments, which I’ll run through next.
Basic os.fork()
example
I’ll start with a definition from the “fork” Wikipedia page:
In computing, particularly in the context of the Unix operating system and its workalikes, fork is an operation whereby a process creates a copy of itself. It is an interface which is required for compliance with the POSIX and Single UNIX Specification standards. It is usually implemented as a C standard library wrapper to the fork, clone, or other system calls of the kernel. Fork is the primary method of process creation on Unix-like operating systems.
In multitasking operating systems, processes (running programs) need a way to create new processes, e.g. to run other programs. Fork and its variants are typically the only way of doing so in Unix-like systems. For a process to start the execution of a different program, it first forks to create a copy of itself. Then, the copy, called the “child process”, calls the exec system call to overlay itself with the other program: it ceases execution of its former program in favor of the other.
Next, I’ll look at the definition of os.getpid
from the docs before using it:
Return the parent’s process id. When the parent process has exited, on Unix the id returned is the one of the init process (1), on Windows it is still the same id, which may be already reused by another process.
print(f"Main process PID: {os.getpid()}")
Main process PID: 436
Next, I’ll call os.fork
:
Fork a child process. Return 0 in the child and the child’s process id in the parent. If an error occurs OSError is raised.
Note that some platforms including FreeBSD <= 6.3 and Cygwin have known issues when using fork() from a thread.
if os.fork(): print(f'In parent: {os.getpid()}')
else:
print(f'In child: {os.getpid()}')
os._exit(os.EX_OK)
In parent: 436
In child: 580
It’s important to note that I took the above code straight from Lesson’s 10’s 01_matmul.ipynb.
When I tried to run the following in Colab, the cell wouldn’t execute and would just hang:
= os.fork() pid
When I tried to run that locally on my MacBook, I got the following error:
OSError: [Errno 9] Bad file descriptor
I found this StackOverflow post which talks about similar issues, and that os.fork
doesn’t play nice with Jupyter Notebooks. Claude also seemed to agree, recommending that I either use the os._exit
approach from Lesson 10, or put my os.fork
-related code in a separate .py
script outside the notebook.
I asked Claude to rewrite the os.fork
experiments using that if/else approach.
When I run the following code block, it’s interesting to note that the child process runs before the parent process. I wonder if that means os.fork
returneed 0
? Claude says no:
The reason it might seem like the child process runs first is due to how process scheduling works in operating systems. When os.fork() is called, both the parent and child processes are ready to run, and the operating system’s scheduler decides which one to execute first. In this case, the child process got scheduled to run before the parent continued.
It adds the following context:
This behavior - where the child might run before the parent continues - is normal and expected in multi-process programming. It’s one of the reasons why synchronization mechanisms are often needed when working with multiple processes.
print(f"\nMain process PID: {os.getpid()}")
if os.fork():
print(f"\nIn parent: {os.getpid()}")
else:
print(f"\nIn child: {os.getpid()}, Parent PID: {os.getppid()}")
os._exit(os.EX_OK)
print(f"\nThis will be printed only by the parent process. PID: {os.getpid()}")
Main process PID: 436
In child: 853, Parent PID: 436
Main process PID: 436
In parent: 436
This will be printed only by the parent process. PID: 436
Memory Independence Example
The following example illustrates how “forked processes have independent memory spaces and that changes to variables in one process don’t affect the other process” as Claude states it.
The global shared_variable
maintains its global value of 0
in the child process, before 1
is added to it to give it a final value of 1
in the child process. Meanwhile, in the parent process, it’s final value is 2
. This reminds me of the threading.local
behavior.
= 0
shared_variable
if os.fork():
# Parent process
+= 2
shared_variable print(f"\nIn parent: {os.getpid()}, shared_variable = {shared_variable}")
else:
# Child process
+= 1
shared_variable print(f"\nIn child: {os.getpid()}, shared_variable = {shared_variable}")
os._exit(os.EX_OK)
print(f"Final shared_variable in parent: {shared_variable}")
In parent: 436, shared_variable = 2
Final shared_variable in parent: 2
In child: 902, shared_variable = 1
File Descriptor Inheritance
Claude then provided the following code to illustrate how to write to the same file different data from the parent and child process. However, this code resulted in only the parent writing to the file:
with open("test.txt", "w") as f:
if os.fork():
# parent process
"Written by parent\n")
f.write(else:
# child process
"Written by child\n")
f.write(
os._exit(os.EX_OK)
# Run this after the script to see the contents:
print(open("test.txt", "r").read())
Written by parent
Claude then suggested using “file locking” and “flushing” to ensure the writing happens before process execution has ended, but this didn’t help. Sometimes it wrote from both processes, sometimes just from one. I’ve illustrated both examples below:
def do_write():
with open("test.txt", "w") as f:
if os.fork():
# parent process
fcntl.flock(f, fcntl.LOCK_EX)"Written by parent\n")
f.write(
f.flush()
fcntl.flock(f, fcntl.LOCK_UN)else:
# child process
fcntl.flock(f, fcntl.LOCK_EX)"Written by child\n")
f.write(
f.flush()
fcntl.flock(f, fcntl.LOCK_UN)
os._exit(os.EX_OK)
# Run this after the script to see the contents:
print(open("test.txt", "r").read())
do_write()
Written by parent
do_write()
Written by child
Written by parent
I wanted something deterministic so I prompted Claude again. It responded with the following solution where “the child writes first and then signals the parent”. A couple of things to note:
- The child sends a
SIGUSR1
signal to the parent pid. (SIGUSR1 stands for “User-defined signal 1”) - Inside
parent_process
, the file is opened in “append mode”.
def child_process(parent_pid):
0.1) # Small delay to ensure parent is waiting
time.sleep(with open("test.txt", "w") as f:
"Written by child\n")
f.write(
f.flush()# this is where the child sends a signal to the parent
os.kill(parent_pid, signal.SIGUSR1)
os._exit(os.EX_OK)
def parent_process(signum, frame):
with open("test.txt", "a") as f: # notice the "a" for "append mode"
"Written by parent\n")
f.write(
f.flush()
def do_write2():
signal.signal(signal.SIGUSR1, parent_process)
= os.getpid()
parent_pid
if os.fork() == 0:
child_process(parent_pid)else:
# Wait for signal from child
signal.pause()
# Read and print the file contents
with open("test.txt", "r") as f:
= f.read()
res return res
This works as expected! At least for the 1000 times that I ran it:
for _ in range(1000):
= do_write2()
res assert res == 'Written by child\nWritten by parent\n'
I noticed that parent_process
is passed signum
and frame
. I asked Claude to define these:
signum
: This parameter represents the signal number that was caught. In this case, it will besignal.SIGUSR1
, which is the signal sent by the child process to the parent. Thesignum
allows the signal handler to identify which signal triggered it, which can be useful if the same handler is used for multiple signals.frame
: This parameter is a frame object representing the stack frame of the interrupted code when the signal was received. It contains information about the program’s execution state at the time the signal was caught, such as the current line number and local variables.
I’ll print out signum
and frame
to see what they look like here:
def parent_process(signum, frame):
print(signum, frame)
with open("test.txt", "a") as f: # notice the "a" for "append mode"
"Written by parent\n")
f.write( f.flush()
signum
has a value of 10
and frame
has the additional information as Claude described.
do_write2()
10 <frame at 0x56005af01c30, file '<ipython-input-56-0f16beee5172>', line 22, code do_write2>
'Written by child\nWritten by parent\n'
Exit Status
Claude describes the following code as a way to illustrate how “the parent can wait for the child to finish and retrieve its exit status.” I added a couple of print statements to see more clearly that the parent process waits for the child process to exit.
Claude describes the -1
in os.waitpid(-1, 0)
as follows:
When
-1
is used as the first argument toos.waitpid()
, it tells the function to wait for any child process to terminate.
The 0
in os.waitpid(-1, 0)
is explained in the docs:
The semantics of the call are affected by the value of the integer options, which should be 0 for normal operation.
def do_exit():
if os.fork():
# Parent process
print("Parent waiting...")
= os.waitpid(-1, 0)
child_pid, status print("Parent done waiting!")
print(f"In parent: {os.getpid()}")
print(f"Child process (PID {child_pid}) exited with status {os.WEXITSTATUS(status)}")
else:
# Child process
print(f"In child: {os.getpid()}, exiting with status 5")
5) # Use os._exit to avoid affecting the notebook process
os._exit(
print(f"This will be printed only by the parent process. PID: {os.getpid()}")
However, when I run do_exit
, based on the child pid’s shown, it creates two different child processes (4475
and 4448
):
do_exit()
In child: 4475, exiting with status 5Parent waiting...
Parent done waiting!
In parent: 436
Child process (PID 4448) exited with status 5
This will be printed only by the parent process. PID: 436
And note that do_exit
print statements don’t always run in that order, indicating that the child process is not running first even though we have used waitpid
:
do_exit()
Parent waiting...
Parent done waiting!
In parent: 436
Child process (PID 902) exited with status 0
This will be printed only by the parent process. PID: 436
In child: 1249, exiting with status 5
When I put that code into a .py
file and run it from the shell, it behaves as expected (there is only one child process created, 5221
, and it runs first while the parent process waits):
!python3 do_exit.py
Parent waiting...
In child: 5221, exiting with status 5
Parent done waiting!
In parent: 5216
Child process (PID 5221) exited with status 5
This will be printed only by the parent process. PID: 5216
Final Thoughts
Working with os.fork
was tougher than I expected. I assumed it would be plug-and-play, but I encountered non-deterministic behavior, which seems to be common when working with multiple processes.
I also learned that os.fork
behaves (or misbehaves) differently when running inside a notebook cell compared to running in the shell. For instance, executing pid = os.fork
in a notebook cell causes the execution to hang when trying to return the child’s process ID, or spawns multiple child processes when using the if os_fork:/else:
pattern.
There are some ways to make os.fork
behave in a notebook environment, as we saw when synchronizing work between the child and parent by having the child signal the parent before both wrote to the same file.
Another key concept I observed was memory independence— even in a notebook environment, the parent and child processes have their own private access to global variables, allowing you to assign different values to the same variable in each process.
Future work: I want to run a similar set of experiments with the multiprocessing
library, as I see it used more often (for example, in the fastai repo).
I hope you enjoyed this blog post. Follow me on Twitter @vishal_learner.