Introduction
This is a special node with a separate batching system
Usage
To log in to the server, you will need to ssh into either of the cluster login nodes (gowonda or gowonda2)
Once logged in, you may ssh n060 or run the pbs job from the login nodes itself.
To list all the applications available for this node: module avail e.g: module load gcc/8.2.0
There are two queues namely dljun and dlyaq. The other queues are all disabled. This syntax should be used: #PBS -q dljun@n060 or #PBS -q dlyao@n060
Users of queue dljun must satisfy the following constraints:
be listed in the group "deeplearning" and must use this for accounting in their scripts: #PBS -W group_list=deeplearning -A deeplearning
Users of queue dlyao must satisfy the following constraints:
be listed in the group "aspen" and must use this for accounting in their scripts: #PBS -W group_list=aspen -A aspen
Special note
There is a space on /lscratch for each user. This is a fast SSD and hence it would be advantageous to copy the data to this folder and run the job from it.
As the home directory is shared across all nodes, you can transfer files first to gowonda and you will see it in your home directory on all nodes including n060. If you need to use the local scratch on n060 (it is not shared with gowonda), then move the folder or files from your home directory to your /lscratch/snumber . For performance, it is best to use /scratch space for all computation. e.g: on n060, run this: mv mydataFolder /lscratch/snumber
- There are 5 GPUs for dljun queue and 1 GPu for use by dlyaq queue. To use this, use attribute: ngupus=1 together with the queue name (see sample pbs script above). All jobs will be queued and when a resource becomes available, it will be run on that queue.
- There is a space /project/deeplearning for all members of the group "deeplearning". And there is a space /project/aspen for all members of group "aspen"
qstat -q server: n060 Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- ----- ----- ---- ----- workq -- -- -- -- 0 0 -- D R dl -- -- 100:00:0 1 0 0 -- D R dljun -- -- 300:00:0 1 0 0 -- E R dlyao -- -- 300:00:0 1 0 0 -- E R ----- ----- 0 0
Quick run for testing
qsub -q dljun@n060 -W group_list=deeplearning -A deeplearning -- /bin/date qstat -1an @n060
Sample Interactive PBS run
qsub -I -q dljun@n060 -W group_list=deeplearning -A deeplearning OR qsub -I -q dlyao@n060 -W group_list=aspen -A aspen
Sample pbs script: To use in queue dljun
cat sample.pbs.script-dljun to run on queue named dlyao #!/bin/bash -l #PBS -m abe ## Mail to user #PBS -M YourEmail@griffith.edu.au #PBS -V ## Job name #PBS -N JunTest #PBS -q dljun@n060 #####PBS -q dlyao@n060 ####Other options #PBS -q dlyao@n060 or #PBS -q workq@n060 #PBS -W group_list=deeplearning -A deeplearning ###Other options group_list=aspen -A aspen ### Number of nodes:Number of CPUs:Number of threads per node #PBS -l select=1:ncpus=1:ngpus=1:mem=12gb,walltime=100:00:00 #PBS -j oe ### Add current shell environment to job (comment out if not needed) #PBS -V # The job's working directory echo Working directory is $PBS_O_WORKDIR cd $PBS_O_WORKDIR source $HOME/.bashrc module list echo "Starting job" echo Running on host `hostname` echo Time is `date` echo Directory is `pwd` gpustat nvidia-smi echo "Done with job"
Another Sample pbs script: To use in queue dlyao
#!/bin/bash -l #PBS -m abe ## Mail to user #PBS -M YOURNAME@griffith.edu.au #PBS -V ## Job name #PBS -N YaoJobMyName #PBS -q dlyao@n060 ####Other options #PBS -q dlyao@n060 or #PBS -q workq@n060 #PBS -W group_list=aspen -A aspen ##### Other option##s PBS -W group_list=deeplearning -A deeplearning ###Other options group_list=aspen -A aspen ### Number of nodes:Number of CPUs:Number of threads per node #PBS -l select=1:ncpus=1:ngpus=1:mem=12gb,walltime=100:00:00 #PBS -j oe ### Add current shell environment to job (comment out if not needed) #PBS -V # The job's working directory echo Working directory is $PBS_O_WORKDIR cd $PBS_O_WORKDIR source $HOME/.bashrc module list echo "Starting job" echo Running on host `hostname` echo Time is `date` echo Directory is `pwd` gpustat nvidia-smi sleep 100 echo "Done with job"
Specifications
Hardware: HPE Proliant HPE XL270d Gen 10 Node CTO server
The OS is Centos 7.6 and the batching system is PBS 18.2
This node has 6 nvidia GPU cards (HPE NVIDIA Tesla V100-32GB PCle)
nvidia-smi Wed Dec 12 08:28:49 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:14:00.0 Off | 0 | | N/A 32C P0 26W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... On | 00000000:15:00.0 Off | 0 | | N/A 33C P0 25W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-PCIE... On | 00000000:39:00.0 Off | 0 | | N/A 33C P0 25W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-PCIE... On | 00000000:3A:00.0 Off | 0 | | N/A 33C P0 28W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-PCIE... On | 00000000:88:00.0 Off | 0 | | N/A 34C P0 27W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-PCIE... On | 00000000:89:00.0 Off | 0 | | N/A 33C P0 26W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ gpustat n060.default.domain Wed Dec 12 08:29:10 2018 [0] Tesla V100-PCIE-32GB | 32'C, 0 % | 0 / 32480 MB | [1] Tesla V100-PCIE-32GB | 33'C, 0 % | 0 / 32480 MB | [2] Tesla V100-PCIE-32GB | 33'C, 0 % | 0 / 32480 MB | [3] Tesla V100-PCIE-32GB | 33'C, 0 % | 0 / 32480 MB | [4] Tesla V100-PCIE-32GB | 34'C, 0 % | 0 / 32480 MB | [5] Tesla V100-PCIE-32GB | 33'C, 0 % | 0 / 32480 MB | These are the specs for each GPU card: GPU Architecture NVIDIA Volta NVIDIA Tensor Cores 640 NVIDIA CUDA® Cores 5,120 Double-Precision Performance 7 TFLOPS Single-Precision Performance 14 TFLOPS Tensor Performance 112 TFLOPS GPU Memory 16 GB HBM2 Memory Bandwidth 900 GB/sec ECC YESInterconnect Bandwidth* 32 GB/sec System Interface PCIe Gen3 Form Factor PCIe Full Height/Length Max Power Comsumption 250 W Thermal Solution Passive Compute APIs CUDA, DirectCompute,OpenCLTM, OpenACC
Configuration
pbs node configuration
Qmgr: p n n060 # # Create nodes and set their properties. # # # Create and define node n060 # create node n060 Mom=n060.default.domain set node n060 state = free set node n060 resources_available.arch = linux set node n060 resources_available.host = n060 set node n060 resources_available.mem = 197554184kb set node n060 resources_available.ncpus = 70 set node n060 resources_available.ngpus = 6 set node n060 resources_available.vnode = n060 set node n060 resv_enable = True
pbs queue configuration
pbs queue configuration Qmgr: p q dljun # # Create queues and set their attributes. # # # Create and define queue dljun # create queue dljun set queue dljun queue_type = Execution set queue dljun Priority = 20 set queue dljun acl_user_enable = True set queue dljun acl_users = redacted set queue dljun acl_users += redacted <redacted> set queue dljun resources_max.ncpus = 56 set queue dljun resources_max.ngpus = 5 set queue dljun resources_max.nodect = 1 set queue dljun resources_max.walltime = 300:00:00 set queue dljun resources_default.ncpus = 1 set queue dljun resources_default.nodect = 1 set queue dljun resources_default.nodes = 1 set queue dljun resources_default.walltime = 100:00:00 set queue dljun acl_group_enable = True set queue dljun acl_groups = deeplearning set queue dljun enabled = True set queue dljun started = True # Create and define queue dlyao # create queue dlyao set queue dlyao queue_type = Execution set queue dlyao Priority = 20 set queue dlyao acl_user_enable = True set queue dlyao acl_users = redacted set queue dlyao acl_users += redacted <redacted> set queue dlyao resources_max.ncpus = 12 set queue dlyao resources_max.ngpus = 1 set queue dlyao resources_max.nodect = 1 set queue dlyao resources_max.walltime = 300:00:00 set queue dlyao resources_default.ncpus = 1 set queue dlyao resources_default.nodect = 1 set queue dlyao resources_default.nodes = 1 set queue dlyao resources_default.walltime = 100:00:00 set queue dlyao acl_group_enable = True set queue dlyao acl_groups = aspen set queue dlyao enabled = True set queue dlyao started = True
Installed Applications
'module avail" will list currently installed application e.g: module load anaconda/5.3.1py3 conda info --envs source activate keras pip install soundfile
Q&A
Qs: It seems some modules like tensorflow, keras, theano, pythorch are missing Answer: All these modules are installed. Usage is : module load anaconda/5.3.1py3 Are you familiar with anaconda usage? Useful commands: conda info --envs conda list conda activate # To activate "accelerate" environment, use: # source activate accelerate # # To deactivate an active environment, use: # source deactivate
Qs: Regarding the output, there are some print lines in my code that help me to monitor how my program is working. like the error of model and so on. So is there any way to see this kind of online output on the terminal or log files while the job is being processed by the cluster? Ans: There are a few ways of doing this. 1. You may run an interactive pbs job with "-I" option. For example: qsub -I -q dljun@n060 -W group_list=deeplearning -A deeplearning -l select=1:ncpus=1:ngpus=1:mem=12gb,walltime=100:00:00 After this you will be given a shell and then you can run your command: module load anaconda/5.3.1py3 module load cuda/10.0 source activate tensorflow-gpu python3 /export/home/s5108500/lscratch/Nick/DeepModels/keypoints/baseline_main.py 2. Alternatively, submit the job. Run the script named watch_jobs.sh It will ask for the compute node name and the pbs job number and basically will run this command: tail -f /var/spool/pbs/spool/$JOBNO.n060.* e.g: sh watch_jobs.sh n060: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 58.n060 s2594054 dljun IndyTestDL 45304 1 1 12gb 100:0 R 00:11 n060/0 =========================== Please enter Node Number e.g: n060 n060 Please enter Job number e.g 9066 58 =========================== | 5 Tesla V100-PCIE... On | 00000000:89:00.0 Off | 0 | | N/A 33C P0 26W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ ? +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+