簡介
https://slurm.schedmd.com/overview.html
Overview
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Architecture
image
主從式架構(gòu),一個primary(slurmctld), 負(fù)責(zé)作業(yè)管理, 多個 nodes(slurmd), 負(fù)責(zé)執(zhí)行計(jì)算任務(wù), primary有一個可選的backup.
tutorial
https://slurm.schedmd.com/tutorials.html
直接看這份文檔 https://www.open-mpi.org/video/slurm/Slurm_EMC_Dec2012.pdf
概念:
SLURM Entities
- Jobs: Resource allocation requests
- Job steps: Set of (typically parallel) tasks
- Partitions: Job queues with limits and access controls
- Nodes
- NUMA boards
- Sockets
- Cores
- Hyperthreads
- Cores
- Memory
- Generic Resources (e.g. GPUs)
- Sockets
- NUMA boards
- Users submit jobs to a partition (queue)
- Jobs are allocated resources
- Jobs spawn steps, which are allocated resources from
within the job's allocation -
Job States
截屏2019-12-14下午1.57.32.png -
Linux Job Launch Sequence
截屏2019-12-14下午3.23.13.png
操作
幾種運(yùn)行模式
- srun
Create a job allocation (if needed) and launch
a job step (typically an MPI job) - salloc
Create job allocation and start a shell to use it
(interactive mode) - sbatch
Submit script for later execution (batch
mode) - sattach
Connect stdin/out/err for an existing job or
job step
其他命令
- sinfo
- squeue
- smap
- sbcast
- scanncel
MPI 支持
- Many different MPI implementations are supported:
- MPICH1, MPICH2, MVAPICH, OpenMPI, etc.
- Many use srun to launch the tasks directly
- Some use “mpirun” or another tool within an existing SLURM allocation (they reference SLURM environment variables to determine what resources are allocated to the job)
- Details are online:
http://www.schedmd.com/slurmdocs/mpi_guide.html
發(fā)布節(jié)奏借鑒
持續(xù)集成,定期發(fā)布可用特性
- New minor release about every 9 months
- 2.4.x June 2012
- 2.5.x December 2012
- Micro releases with bug fixes about once each month
構(gòu)建和安裝
Slurm 自帶Test Suite, 安裝好以后可以用來做回歸驗(yàn)證
2019.12.14 Tutorial 看完。


