TP2 : batch scheduling

This practical session aims at practicing concepts related to batch scheduling algorithms. Similarily to TP1, it contains a simple simulation engine that replicates the behavior of jobs arriving at different times to run on a cluster or supercomputer. The activities to be made in this section are described in the activities section below.

Firstly, read the below sections carefully.

How To

Decompress the batch_scheduling.tgz archive. If you wish to check and install any missing python modules on your own machine, run python3 setup.py.
The activities require the ANL-Intrepid-2009-1.swf file obtained from the ANL Intrepid Log. The file is already provided here.
To run a simulation, try python3 replay.py fcfs 20000 10000. replay.py takes parameters from the command line and feeds them to the simulation engine.
To learn more about the code in the simulator, try using the help function in your Python3 interpreter. Example:

>>> import simulator.job
>>> help(simulator.job)

To check if the code you downloaded or changed is still working properly, try the following commands:

$ cd unitary_tests
$ ./test_fcfs.py

About the traces and simulation

The ANL Intrepid trace contains information about 68,936 jobs submitted to the Intrepid supercomputer (Argonne National Laboratory, USA) over 240 days of 2009. At the time, the machine of 40,960 quad-core nodes was among the world's 10 fastest supercomputers (https://www.top500.org/lists/2009/06/). About each job, the dataset contains (among other things) its submission time (when an user has asked the system to run it), number of requested nodes, amount of requested time, and its actual execution time.

The simulation engine uses this information to simulate the job scheduler operation: jobs are added to a queue when submitted, and a scheduling algorithm is used to decide when to remove jobs from the queue to run them.

At the end of the simulation, the makespan is reported, together with statistics on the wait times (time spent in the queue by the jobs while waiting to be scheduled) and on the usage of the machine.

To run the simulation, use the following command:

$ python3 replay.py [scheduling algorithm] [number of nodes] [number of jobs]

Activities

Basic steps

Run replay.py with the FCFS simulator with a few variations on the number of nodes and see how the reported metrics behave. Read the code of the FCFS scheduler and try to understand how to write your own algorithms.
Write the First Fit (ff) scheduling algorithm. Check if it passes the tests in the unitary tests file. Compare it to the FCFS scheduler for the different metrics and different numbers of nodes. Try to justify the most interesting changes.
Write the Shortest-Job First (sjf) scheduling algorithm. Check if it passes the tests in the unitary tests file. Compare it to the FF scheduler for the different metrics and different numbers of nodes. Try to understand their differences.

Additional challenge

Write the FCFS with EASY backfilling (fcfs_easy). Check if it passes the tests in the unitary tests file. Compare it to the FCFS and FF schedulers for the different metrics and different numbers of nodes. Try to understand their differences.
Write the FCFS with conservative backfilling (fcfs_conservative). Compare it to the FCFS with EASY backfilling for the different metrics and different numbers of nodes. Try to understand their differences.
Code a version of the backfilling algorithms above where we select the best task (the one that fits the best the hole) instead of the first one fitting as we usually do.
Write your own batch scheduling algorithm and compare it to the aforementioned algorithms. You can try: smallest-job-first, biggest-job-first (with/without backfilling), longest-job-first (with/without backfilling), best-fit that maximizes resource usage. You can also try your own scheduling algorithms, or some different backfilling strategies.

Credits

Francielli Zanon-Boito (INRIA Bordeaux) and Laercio Lima-Pilla (CNRS)