Introduction
This is a special node with a separate batching system
Usage
To log in to the server, you will need to ssh into either of the cluster login nodes (gowonda or gowonda2)
Once logged in, you may ssh n060 or run the pbs job from the login nodes itself.
To list all the applications available for this node: module avail e.g: module load gcc/8.2.0
There are two queues namely dljun and dlyaq. The other queues are all disabled. This syntax should be used: #PBS -q dljun@n060 or #PBS -q dlyao@n060
Users of queue dljun must satisfy the following constraints:
be listed in the group "deeplearning" and must use this for accounting in their scripts: #PBS -W group_list=deeplearning -A deeplearning
Users of queue dlyao must satisfy the following constraints:
be listed in the group "aspen" and must use this for accounting in their scripts: #PBS -W group_list=aspen -A aspen
Special note
There is a space on /lscratch for each user. This is a fast SSD and hence it would be advantageous to copy the data to this folder and run the job from it.
As the home directory is shared across all nodes, you can transfer files first to gowonda and you will see it in your home directory on all nodes including n060. If you need to use the local scratch on n060 (it is not shared with gowonda), then move the folder or files from your home directory to your /lscratch/snumber . For performance, it is best to use /scratch space for all computation. e.g: on n060, run this: mv mydataFolder /lscratch/snumber
- There are 5 GPUs for dljun queue and 1 GPu for use by dlyaq queue. To use this, use attribute: ngupus=1 together with the queue name (see sample pbs script above). All jobs will be queued and when a resource becomes available, it will be run on that queue.
- There is a space /project/deeplearning for all members of the group "deeplearning". And there is a space /project/aspen for all members of group "aspen"
qstat -q server: n060 Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- ----- ----- ---- ----- workq -- -- -- -- 0 0 -- D R dl -- -- 100:00:0 1 0 0 -- D R dljun -- -- 300:00:0 1 0 0 -- E R dlyao -- -- 300:00:0 1 0 0 -- E R ----- ----- 0 0
Quick run for testing
qsub -q dljun@n060 -W group_list=deeplearning -A deeplearning -- /bin/date qstat -1an @n060
Sample Interactive PBS run
qsub -I -q dljun@n060 -W group_list=deeplearning -A deeplearning OR qsub -I -q dlyao@n060 -W group_list=aspen -A aspen
Sample pbs script: To use in queue dljun
cat sample.pbs.script-dljun to run on queue named dlyao #!/bin/bash -l #PBS -m abe ## Mail to user #PBS -M YourEmail@griffith.edu.au #PBS -V ## Job name #PBS -N JunTest #PBS -q dljun@n060 #####PBS -q dlyao@n060 ####Other options #PBS -q dlyao@n060 or #PBS -q workq@n060 #PBS -W group_list=deeplearning -A deeplearning ###Other options group_list=aspen -A aspen ### Number of nodes:Number of CPUs:Number of threads per node #PBS -l select=1:ncpus=1:ngpus=1:mem=12gb,walltime=100:00:00 #PBS -j oe ### Add current shell environment to job (comment out if not needed) #PBS -V # The job's working directory echo Working directory is $PBS_O_WORKDIR cd $PBS_O_WORKDIR source $HOME/.bashrc module list echo "Starting job" echo Running on host `hostname` echo Time is `date` echo Directory is `pwd` gpustat nvidia-smi echo "Done with job"
Another Sample pbs script: To use in queue dlyao
#!/bin/bash -l #PBS -m abe ## Mail to user #PBS -M YOURNAME@griffith.edu.au #PBS -V ## Job name #PBS -N YaoJobMyName #PBS -q dlyao@n060 ####Other options #PBS -q dlyao@n060 or #PBS -q workq@n060 #PBS -W group_list=aspen -A aspen ##### Other option##s PBS -W group_list=deeplearning -A deeplearning ###Other options group_list=aspen -A aspen ### Number of nodes:Number of CPUs:Number of threads per node #PBS -l select=1:ncpus=1:ngpus=1:mem=12gb,walltime=100:00:00 #PBS -j oe ### Add current shell environment to job (comment out if not needed) #PBS -V # The job's working directory echo Working directory is $PBS_O_WORKDIR cd $PBS_O_WORKDIR source $HOME/.bashrc module list echo "Starting job" echo Running on host `hostname` echo Time is `date` echo Directory is `pwd` gpustat nvidia-smi sleep 100 echo "Done with job"
Specifications
Hardware: HPE Proliant HPE XL270d Gen 10 Node CTO server
The OS is Centos 7.6 and the batching system is PBS 18.2
This node has 6 nvidia GPU cards (HPE NVIDIA Tesla V100-32GB PCle)
nvidia-smi Wed Dec 12 08:28:49 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:14:00.0 Off | 0 | | N/A 32C P0 26W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... On | 00000000:15:00.0 Off | 0 | | N/A 33C P0 25W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-PCIE... On | 00000000:39:00.0 Off | 0 | | N/A 33C P0 25W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-PCIE... On | 00000000:3A:00.0 Off | 0 | | N/A 33C P0 28W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-PCIE... On | 00000000:88:00.0 Off | 0 | | N/A 34C P0 27W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-PCIE... On | 00000000:89:00.0 Off | 0 | | N/A 33C P0 26W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ gpustat n060.default.domain Wed Dec 12 08:29:10 2018 [0] Tesla V100-PCIE-32GB | 32'C, 0 % | 0 / 32480 MB | [1] Tesla V100-PCIE-32GB | 33'C, 0 % | 0 / 32480 MB | [2] Tesla V100-PCIE-32GB | 33'C, 0 % | 0 / 32480 MB | [3] Tesla V100-PCIE-32GB | 33'C, 0 % | 0 / 32480 MB | [4] Tesla V100-PCIE-32GB | 34'C, 0 % | 0 / 32480 MB | [5] Tesla V100-PCIE-32GB | 33'C, 0 % | 0 / 32480 MB | These are the specs for each GPU card: GPU Architecture NVIDIA Volta NVIDIA Tensor Cores 640 NVIDIA CUDA® Cores 5,120 Double-Precision Performance 7 TFLOPS Single-Precision Performance 14 TFLOPS Tensor Performance 112 TFLOPS GPU Memory 16 GB HBM2 Memory Bandwidth 900 GB/sec ECC YESInterconnect Bandwidth* 32 GB/sec System Interface PCIe Gen3 Form Factor PCIe Full Height/Length Max Power Comsumption 250 W Thermal Solution Passive Compute APIs CUDA, DirectCompute,OpenCLTM, OpenACC
Configuration
pbs node configuration
Qmgr: p n n060 # # Create nodes and set their properties. # # # Create and define node n060 # create node n060 Mom=n060.default.domain set node n060 state = free set node n060 resources_available.arch = linux set node n060 resources_available.host = n060 set node n060 resources_available.mem = 197554184kb set node n060 resources_available.ncpus = 70 set node n060 resources_available.ngpus = 6 set node n060 resources_available.vnode = n060 set node n060 resv_enable = True
pbs queue configuration
pbs queue configuration Qmgr: p q dljun # # Create queues and set their attributes. # # # Create and define queue dljun # create queue dljun set queue dljun queue_type = Execution set queue dljun Priority = 20 set queue dljun acl_user_enable = True set queue dljun acl_users = redacted set queue dljun acl_users += redacted <redacted> set queue dljun resources_max.ncpus = 56 set queue dljun resources_max.ngpus = 5 set queue dljun resources_max.nodect = 1 set queue dljun resources_max.walltime = 300:00:00 set queue dljun resources_default.ncpus = 1 set queue dljun resources_default.nodect = 1 set queue dljun resources_default.nodes = 1 set queue dljun resources_default.walltime = 100:00:00 set queue dljun acl_group_enable = True set queue dljun acl_groups = deeplearning set queue dljun enabled = True set queue dljun started = True # Create and define queue dlyao # create queue dlyao set queue dlyao queue_type = Execution set queue dlyao Priority = 20 set queue dlyao acl_user_enable = True set queue dlyao acl_users = redacted set queue dlyao acl_users += redacted <redacted> set queue dlyao resources_max.ncpus = 12 set queue dlyao resources_max.ngpus = 1 set queue dlyao resources_max.nodect = 1 set queue dlyao resources_max.walltime = 300:00:00 set queue dlyao resources_default.ncpus = 1 set queue dlyao resources_default.nodect = 1 set queue dlyao resources_default.nodes = 1 set queue dlyao resources_default.walltime = 100:00:00 set queue dlyao acl_group_enable = True set queue dlyao acl_groups = aspen set queue dlyao enabled = True set queue dlyao started = True
Installed Applications
'module avail" will list currently installed application e.g: module load anaconda/5.3.1py3 conda info --envs source activate keras pip install soundfile
Q&A
Qs: It seems some modules like tensorflow, keras, theano, pythorch are missing Answer: All these modules are installed. Usage is : module load anaconda/5.3.1py3 Are you familiar with anaconda usage? Useful commands: conda info --envs conda list conda activate # To activate "accelerate" environment, use: # source activate accelerate # # To deactivate an active environment, use: # source deactivate
Qs: Regarding the output, there are some print lines in my code that help me to monitor how my program is working. like the error of model and so on. So is there any way to see this kind of online output on the terminal or log files while the job is being processed by the cluster? Ans: There are a few ways of doing this. 1. You may run an interactive pbs job with "-I" option. For example: qsub -I -q dljun@n060 -W group_list=deeplearning -A deeplearning -l select=1:ncpus=1:ngpus=1:mem=12gb,walltime=100:00:00 After this you will be given a shell and then you can run your command: module load anaconda/5.3.1py3 module load cuda/10.0 source activate tensorflow-gpu python3 /export/home/s5108500/lscratch/Nick/DeepModels/keypoints/baseline_main.py 2. Alternatively, submit the job. Run the script named watch_jobs.sh It will ask for the compute node name and the pbs job number and basically will run this command: tail -f /var/spool/pbs/spool/$JOBNO.n060.* e.g: sh watch_jobs.sh n060: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 58.n060 s2594054 dljun IndyTestDL 45304 1 1 12gb 100:0 R 00:11 n060/0 =========================== Please enter Node Number e.g: n060 n060 Please enter Job number e.g 9066 58 =========================== | 5 Tesla V100-PCIE... On | 00000000:89:00.0 Off | 0 | | N/A 33C P0 26W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ ? +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
GPU issues - deviceQuery
Check if this returns correctly /usr/local/cuda-10.0/samples/bin/x86_64/linux/release/deviceQuery >>>>>>> /usr/local/cuda-10.0/samples/bin/x86_64/linux/release/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 6 CUDA Capable device(s) Device 0: "Tesla V100-PCIE-32GB" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 7.0 Total amount of global memory: 32480 MBytes (34058272768 bytes) (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores GPU Max Clock rate: 1380 MHz (1.38 GHz) Memory Clock rate: 877 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 7 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 20 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: "Tesla V100-PCIE-32GB" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 7.0 Total amount of global memory: 32480 MBytes (34058272768 bytes) (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores GPU Max Clock rate: 1380 MHz (1.38 GHz) Memory Clock rate: 877 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 7 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 21 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 2: "Tesla V100-PCIE-32GB" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 7.0 Total amount of global memory: 32480 MBytes (34058272768 bytes) (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores GPU Max Clock rate: 1380 MHz (1.38 GHz) Memory Clock rate: 877 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 7 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 57 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 3: "Tesla V100-PCIE-32GB" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 7.0 Total amount of global memory: 32480 MBytes (34058272768 bytes) (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores GPU Max Clock rate: 1380 MHz (1.38 GHz) Memory Clock rate: 877 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 7 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 58 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 4: "Tesla V100-PCIE-32GB" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 7.0 Total amount of global memory: 32480 MBytes (34058272768 bytes) (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores GPU Max Clock rate: 1380 MHz (1.38 GHz) Memory Clock rate: 877 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 7 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 136 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 5: "Tesla V100-PCIE-32GB" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 7.0 Total amount of global memory: 32480 MBytes (34058272768 bytes) (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores GPU Max Clock rate: 1380 MHz (1.38 GHz) Memory Clock rate: 877 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 7 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 137 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from Tesla V100-PCIE-32GB (GPU0) -> Tesla V100-PCIE-32GB (GPU1) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU0) -> Tesla V100-PCIE-32GB (GPU2) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU0) -> Tesla V100-PCIE-32GB (GPU3) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU0) -> Tesla V100-PCIE-32GB (GPU4) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU0) -> Tesla V100-PCIE-32GB (GPU5) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU1) -> Tesla V100-PCIE-32GB (GPU0) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU1) -> Tesla V100-PCIE-32GB (GPU2) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU1) -> Tesla V100-PCIE-32GB (GPU3) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU1) -> Tesla V100-PCIE-32GB (GPU4) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU1) -> Tesla V100-PCIE-32GB (GPU5) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU2) -> Tesla V100-PCIE-32GB (GPU0) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU2) -> Tesla V100-PCIE-32GB (GPU1) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU2) -> Tesla V100-PCIE-32GB (GPU3) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU2) -> Tesla V100-PCIE-32GB (GPU4) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU2) -> Tesla V100-PCIE-32GB (GPU5) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU3) -> Tesla V100-PCIE-32GB (GPU0) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU3) -> Tesla V100-PCIE-32GB (GPU1) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU3) -> Tesla V100-PCIE-32GB (GPU2) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU3) -> Tesla V100-PCIE-32GB (GPU4) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU3) -> Tesla V100-PCIE-32GB (GPU5) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU4) -> Tesla V100-PCIE-32GB (GPU0) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU4) -> Tesla V100-PCIE-32GB (GPU1) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU4) -> Tesla V100-PCIE-32GB (GPU2) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU4) -> Tesla V100-PCIE-32GB (GPU3) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU4) -> Tesla V100-PCIE-32GB (GPU5) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU5) -> Tesla V100-PCIE-32GB (GPU0) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU5) -> Tesla V100-PCIE-32GB (GPU1) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU5) -> Tesla V100-PCIE-32GB (GPU2) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU5) -> Tesla V100-PCIE-32GB (GPU3) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU5) -> Tesla V100-PCIE-32GB (GPU4) : Yes deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 6 Result = PASS >>>>>>>>
gpu issues - Sample torch.device.py
more log_device_placement.py ####https://www.tensorflow.org/guide/using_gpu import tensorflow as tf # Creates a graph. a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b) # Creates a session with log_device_placement set to True. sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) # Runs the op. print(sess.run(c))
Source: https://www.tensorflow.org/guide/using_gpu
gpu issues - Sample Tensorflow script
https://www.tensorflow.org/tutorials cat tensorflowTutorial.py ########################### import tensorflow as tf mnist = tf.keras.datasets.mnist (x_train, y_train),(x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(512, activation=tf.nn.relu), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation=tf.nn.softmax) ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit(x_train, y_train, epochs=5) model.evaluate(x_test, y_test)
gpu issues - sample pbs script
Here is sample pbs scripts Sample PBS script: ================== cat pbs.tensor.01 #!/bin/bash #PBS -m abe #PBS -M Youremail@griffith.edu.au #PBS -V #PBS -N testImage #PBS -q dljun@n060 #PBS -W group_list=deeplearning -A deeplearning #PBS -l select=1:ncpus=1:ngpus=1:mem=32gb,walltime=300:00:00 #PBS -j oe module load anaconda/5.3.1py3 #conda info --envs #source activate deeplearning source activate tensorflow-gpu ##nvidia-debugdump -l ##nvidia-smi ###python main.py --cfg cfg/config3.yml --gpu 0 cd $PBS_O_WORKDIR python /export/home/s12345/lpbs/cuda/tensorflowTutorial.py
How do I run multiple tensorflow scripts in the same job
You can background the each job as follows. In the example below, there are 3 jobs hitting a single gpu. We background a job by placing an ambasand (&) at the end of each command like this: python main.py --cfg <PATH>/config5.yml --gpu 0 & e.g: cat pbs.01 >>>>>>>>>>>>>>>>>>> #!/bin/bash -l #PBS -m abe #PBS -M abcde@griffithuni.edu.au #PBS -V #PBS -N testImage #PBS -q dljun@n060 #PBS -W group_list=deeplearning -A deeplearning #PBS -l select=1:ncpus=1:ngpus=1:mem=32gb,walltime=300:00:00 #PBS -j oe cd $PBS_O_WORKDIR module load anaconda/5.3.1py3 source activate tensorflow-gpu echo $CUDA_VISIBLE_DEVICES python main.py --cfg cfg/config3.yml --gpu 0 & python main.py --cfg cfg/config4.yml --gpu 0 & python main.py --cfg cfg/config5.yml --gpu 0 & >>>>>>>>>>>>>>>>>>> Submit the job like this: qsub pbs.01
Reference
- https://www.pbsworks.com/pdfs/PBSAdminGuide18.2.pdf
- https://conf-ers.griffith.edu.au/download/attachments/21332198/xl270d_gen10.pdf?api=v2
- https://www.microway.com/hpc-tech-tips/nvidia-smi_control-your-gpus/
- https://weeraman.com/put-that-gpu-to-good-use-with-python-e5a437168c01
- https://stackoverflow.com/questions/48152674/how-to-check-if-pytorch-is-using-the-gpu
- https://discuss.pytorch.org/t/solved-make-sure-that-pytorch-using-gpu-to-compute/4870/14
https://www.tensorflow.org/guide/using_gpu