Table of Contents |
---|
Introduction
...
This is a special node with a separate batching system
...
be listed in the group "aspen" and must use this for accounting in their scripts: #PBS -W group_list=aspen -A aspen
Special note
...
There is a space on /lscratch for each user. This is a fast SSD and hence it would be advantageous to copy the data to this folder and run the job from it.
No Format As the home directory is shared across all nodes, you can transfer files first to gowonda and you will see it in your home directory on all nodes including n060. If you need to use the local scratch on n060 (it is not shared with gowonda), then move the folder or files from your home directory to your /lscratch/snumber . For performance, it is best to use /scratch space for all computation. e.g: on n060, run this: mv mydataFolder /lscratch/snumber
- There are 5 GPUs for dljun queue and 1 GPu for use by dlyaq queue. To use this, use attribute: ngupus=1 together with the queue name (see sample pbs script above). All jobs will be queued and when a resource becomes available, it will be run on that queue.
- There is a space /project/deeplearning for all members of the group "deeplearning". And there is a space /project/aspen for all members of group "aspen"
...
No Format |
---|
qstat -q server: n060 Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- ----- ----- ---- ----- workq -- -- -- -- 0 0 -- D R dl -- -- 100:00:0 1 0 0 -- D R dljun -- -- 300:00:0 1 0 0 -- E R dlyao -- -- 300:00:0 1 0 0 -- E R ----- ----- 0 0 |
...
Quick run for testing
No Format |
---|
qsub -q dljun@n060 -W group_list=deeplearning -A deeplearning -- /bin/date qstat -1an @n060 |
...
Sample Interactive PBS run
No Format |
---|
Interactive run all queues have been disabled except on iworkq Usage is as follows: qsub -I -q dljun@n060iworkq -W group_list=deeplearning -A deeplearning OR qsub -I -q dlyao@n060iworkq -W group_list=aspen -A aspen |
...
Sample pbs script: To use in queue dljun
No Format |
---|
cat sample.pbs.script-dljun to run on queue named dlyao #!/bin/bash -l #PBS -m abe ## Mail to user #PBS -M YourEmail@griffith.edu.au #PBS -V ## Job name #PBS -N JunTest #PBS -q dljun@n060 #####PBS -q dlyao@n060 ####Other options #PBS -q dlyao@n060 or #PBS -q workq@n060 #PBS -W group_list=deeplearning -A deeplearning ###Other options group_list=aspen -A aspen ### Number of nodes:Number of CPUs:Number of threads per node #PBS -l select=1:ncpus=1:ngpus=1:mem=12gb,walltime=100:00:00 #PBS -j oe ### Add current shell environment to job (comment out if not needed) #PBS -V # The job's working directory echo Working directory is $PBS_O_WORKDIR cd $PBS_O_WORKDIR source $HOME/.bashrc module list echo "Starting job" echo Running on host `hostname` echo Time is `date` echo Directory is `pwd` gpustat nvidia-smi echo "Done with job" |
Another Sample pbs script: To use in queue dlyao
...
No Format |
---|
#!/bin/bash -l #PBS -m abe ## Mail to user #PBS -M YOURNAME@griffith.edu.au #PBS -V ## Job name #PBS -N YaoJobMyName #PBS -q dlyao@n060 ####Other options #PBS -q dlyao@n060 or #PBS -q workq@n060 #PBS -W group_list=aspen -A aspen ##### Other option##s PBS -W group_list=deeplearning -A deeplearning ###Other options group_list=aspen -A aspen ### Number of nodes:Number of CPUs:Number of threads per node #PBS -l select=1:ncpus=1:ngpus=1:mem=12gb,walltime=100:00:00 #PBS -j oe ### Add current shell environment to job (comment out if not needed) #PBS -V # The job's working directory echo Working directory is $PBS_O_WORKDIR cd $PBS_O_WORKDIR source $HOME/.bashrc module list echo "Starting job" echo Running on host `hostname` echo Time is `date` echo Directory is `pwd` gpustat nvidia-smi sleep 100 echo "Done with job" |
...
Specifications
Hardware: HPE Proliant HPE XL270d Gen 10 Node CTO server,
Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
The OS is Centos 7.6 and the batching system is PBS 18.2
This node has 6 nvidia GPU cards (HPE NVIDIA Tesla V100-32GB PCle)
No Format |
---|
nvidia-smi Wed Dec 12 08:28:49 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:14:00.0 Off | 0 | | N/A 32C P0 26W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... On | 00000000:15:00.0 Off | 0 | | N/A 33C P0 25W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-PCIE... On | 00000000:39:00.0 Off | 0 | | N/A 33C P0 25W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-PCIE... On | 00000000:3A:00.0 Off | 0 | | N/A 33C P0 28W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-PCIE... On | 00000000:88:00.0 Off | 0 | | N/A 34C P0 27W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-PCIE... On | 00000000:89:00.0 Off | 0 | | N/A 33C P0 26W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ gpustat n060.default.domain Wed Dec 12 08:29:10 2018 [0] Tesla V100-PCIE-32GB | 32'C, 0 % | 0 / 32480 MB | [1] Tesla V100-PCIE-32GB | 33'C, 0 % | 0 / 32480 MB | [2] Tesla V100-PCIE-32GB | 33'C, 0 % | 0 / 32480 MB | [3] Tesla V100-PCIE-32GB | 33'C, 0 % | 0 / 32480 MB | [4] Tesla V100-PCIE-32GB | 34'C, 0 % | 0 / 32480 MB | [5] Tesla V100-PCIE-32GB | 33'C, 0 % | 0 / 32480 MB | These are the specs for each GPU card: GPU Architecture NVIDIA Volta NVIDIA Tensor Cores 640 NVIDIA CUDA® Cores 5,120 Double-Precision Performance 7 TFLOPS Single-Precision Performance 14 TFLOPS Tensor Performance 112 TFLOPS GPU Memory 16 GB HBM2 Memory Bandwidth 900 GB/sec ECC YESInterconnect Bandwidth* 32 GB/sec System Interface PCIe Gen3 Form Factor PCIe Full Height/Length Max Power Comsumption 250 W Thermal Solution Passive Compute APIs CUDA, DirectCompute,OpenCLTM, OpenACC |
...
Configuration
...
pbs node configuration
No Format |
---|
Qmgr: p n n060 # # Create nodes and set their properties. # # # Create and define node n060 # create node n060 Mom=n060.default.domain set node n060 state = free set node n060 resources_available.arch = linux set node n060 resources_available.host = n060 set node n060 resources_available.mem = 197554184kb set node n060 resources_available.ncpus = 70 set node n060 resources_available.ngpus = 6 set node n060 resources_available.vnode = n060 set node n060 resv_enable = True |
...
No Format |
---|
pbs queue configuration Qmgr: p q dljun # # Create queues and set their attributes. # # # Create and define queue dljun # create queue dljun set queue dljun queue_type = Execution set queue dljun Priority = 20 set queue dljun acl_user_enable = True set queue dljun acl_users = redacted set queue dljun acl_users += redacted <redacted> set queue dljun resources_max.ncpus = 56 set queue dljun resources_max.ngpus = 5 set queue dljun resources_max.nodect = 1 set queue dljun resources_max.walltime = 300:00:00 set queue dljun resources_default.ncpus = 1 set queue dljun resources_default.nodect = 1 set queue dljun resources_default.nodes = 1 set queue dljun resources_default.walltime = 100:00:00 set queue dljun acl_group_enable = True set queue dljun acl_groups = deeplearning set queue dljun enabled = True set queue dljun started = True # Create and define queue dlyao # create queue dlyao set queue dlyao queue_type = Execution set queue dlyao Priority = 20 set queue dlyao acl_user_enable = True set queue dlyao acl_users = redacted set queue dlyao acl_users += redacted <redacted> set queue dlyao resources_max.ncpus = 12 set queue dlyao resources_max.ngpus = 1 set queue dlyao resources_max.nodect = 1 set queue dlyao resources_max.walltime = 300:00:00 set queue dlyao resources_default.ncpus = 1 set queue dlyao resources_default.nodect = 1 set queue dlyao resources_default.nodes = 1 set queue dlyao resources_default.walltime = 100:00:00 set queue dlyao acl_group_enable = True set queue dlyao acl_groups = aspen set queue dlyao enabled = True set queue dlyao started = True |
...
Installed Applications
No Format |
---|
'module avail" will list currently installed application e.g: module load anaconda/5.3.1py3 conda info --envs source activate keras pip install soundfile |
...
No Format |
---|
Qs: Regarding the output, there are some print lines in my code that help me to monitor how my program is working. like the error of model and so on. So is there any way to see this kind of online output on the terminal or log files while the job is being processed by the cluster? Ans: There are a few ways of doing this. 1. You may run an interactive pbs job with "-I" option. For example: qsub -I -q dljun@n060iworkq -W group_list=deeplearning -A deeplearning -l select=1:ncpus=1:ngpus=1:mem=12gb,walltime=100:00:00 After this you will be given a shell and then you can run your command: module load anaconda/5.3.1py3 module load cuda/10.0 source activate tensorflow-gpu python3 /export/home/s5108500/lscratch/Nick/DeepModels/keypoints/baseline_main.py 2. Alternatively, submit the job. Run the script named watch_jobs.sh It will ask for the compute node name and the pbs job number and basically will run this command: tail -f /var/spool/pbs/spool/$JOBNO.n060.* e.g: sh watch_jobs.sh n060: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 58.n060 s2594054 dljun IndyTestDL 45304 1 1 12gb 100:0 R 00:11 n060/0 =========================== Please enter Node Number e.g: n060 n060 Please enter Job number e.g 9066 58 =========================== | 5 Tesla V100-PCIE... On | 00000000:89:00.0 Off | 0 | | N/A 33C P0 26W / 250W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ ? +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ |
...
No Format |
---|
Check if this returns correctly /usr/local/cuda-10.0/samples/bin/x86_64/linux/release/deviceQuery >>>>>>> /usr/local/cuda-10.0/samples/bin/x86_64/linux/release/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 6 CUDA Capable device(s) Device 0: "Tesla V100-PCIE-32GB" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 7.0 Total amount of global memory: 32480 MBytes (34058272768 bytes) (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores GPU Max Clock rate: 1380 MHz (1.38 GHz) Memory Clock rate: 877 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 7 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 20 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: "Tesla V100-PCIE-32GB" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 7.0 Total amount of global memory: 32480 MBytes (34058272768 bytes) (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores GPU Max Clock rate: 1380 MHz (1.38 GHz) Memory Clock rate: 877 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 7 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 21 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 2: "Tesla V100-PCIE-32GB" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 7.0 Total amount of global memory: 32480 MBytes (34058272768 bytes) (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores GPU Max Clock rate: 1380 MHz (1.38 GHz) Memory Clock rate: 877 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 7 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 57 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 3: "Tesla V100-PCIE-32GB" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 7.0 Total amount of global memory: 32480 MBytes (34058272768 bytes) (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores GPU Max Clock rate: 1380 MHz (1.38 GHz) Memory Clock rate: 877 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 7 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 58 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 4: "Tesla V100-PCIE-32GB" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 7.0 Total amount of global memory: 32480 MBytes (34058272768 bytes) (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores GPU Max Clock rate: 1380 MHz (1.38 GHz) Memory Clock rate: 877 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 7 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 136 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 5: "Tesla V100-PCIE-32GB" CUDA Driver Version / Runtime Version 10.0 / 10.0 CUDA Capability Major/Minor version number: 7.0 Total amount of global memory: 32480 MBytes (34058272768 bytes) (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores GPU Max Clock rate: 1380 MHz (1.38 GHz) Memory Clock rate: 877 Mhz Memory Bus Width: 4096-bit L2 Cache Size: 6291456 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 7 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 137 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from Tesla V100-PCIE-32GB (GPU0) -> Tesla V100-PCIE-32GB (GPU1) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU0) -> Tesla V100-PCIE-32GB (GPU2) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU0) -> Tesla V100-PCIE-32GB (GPU3) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU0) -> Tesla V100-PCIE-32GB (GPU4) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU0) -> Tesla V100-PCIE-32GB (GPU5) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU1) -> Tesla V100-PCIE-32GB (GPU0) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU1) -> Tesla V100-PCIE-32GB (GPU2) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU1) -> Tesla V100-PCIE-32GB (GPU3) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU1) -> Tesla V100-PCIE-32GB (GPU4) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU1) -> Tesla V100-PCIE-32GB (GPU5) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU2) -> Tesla V100-PCIE-32GB (GPU0) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU2) -> Tesla V100-PCIE-32GB (GPU1) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU2) -> Tesla V100-PCIE-32GB (GPU3) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU2) -> Tesla V100-PCIE-32GB (GPU4) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU2) -> Tesla V100-PCIE-32GB (GPU5) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU3) -> Tesla V100-PCIE-32GB (GPU0) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU3) -> Tesla V100-PCIE-32GB (GPU1) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU3) -> Tesla V100-PCIE-32GB (GPU2) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU3) -> Tesla V100-PCIE-32GB (GPU4) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU3) -> Tesla V100-PCIE-32GB (GPU5) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU4) -> Tesla V100-PCIE-32GB (GPU0) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU4) -> Tesla V100-PCIE-32GB (GPU1) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU4) -> Tesla V100-PCIE-32GB (GPU2) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU4) -> Tesla V100-PCIE-32GB (GPU3) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU4) -> Tesla V100-PCIE-32GB (GPU5) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU5) -> Tesla V100-PCIE-32GB (GPU0) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU5) -> Tesla V100-PCIE-32GB (GPU1) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU5) -> Tesla V100-PCIE-32GB (GPU2) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU5) -> Tesla V100-PCIE-32GB (GPU3) : Yes > Peer access from Tesla V100-PCIE-32GB (GPU5) -> Tesla V100-PCIE-32GB (GPU4) : Yes deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 6 Result = PASS >>>>>>>> |
gpu issues - Sample torch.device.py
No Format |
---|
more log_device_placement.py ####https://www.tensorflow.org/guide/using_gpu import tensorflow as tf # Creates a graph. a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b) # Creates a session with log_device_placement set to True. sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) # Runs the op. print(sess.run(c)) |
...
No Format |
---|
https://www.tensorflow.org/tutorials cat tensorflowTutorial.py ########################### import tensorflow as tf mnist = tf.keras.datasets.mnist (x_train, y_train),(x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(512, activation=tf.nn.relu), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation=tf.nn.softmax) ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit(x_train, y_train, epochs=5) model.evaluate(x_test, y_test) |
...
gpu issues - sample pbs script
No Format |
---|
Here is sample pbs scripts Sample PBS script: ================== cat pbs.tensor.01 #!/bin/bash #PBS -m abe #PBS -M Youremail@griffith.edu.au #PBS -V #PBS -N testImage #PBS -q dljun@n060 #PBS -W group_list=deeplearning -A deeplearning #PBS -l select=1:ncpus=1:ngpus=1:mem=32gb,walltime=300:00:00 #PBS -j oe module load anaconda/5.3.1py3 #conda info --envs #source activate deeplearning source activate tensorflow-gpu ##nvidia-debugdump -l ##nvidia-smi ###python main.py --cfg cfg/config3.yml --gpu 0 cd $PBS_O_WORKDIR python /export/home/s12345/lpbs/cuda/tensorflowTutorial.py |
module load anaconda/5.3.1py3
#conda info --envs
#source activate deeplearning
source activate tensorflow-gpu
##nvidia-debugdump -l
##nvidia-smi
###python main.py --cfg cfg/config3.yml --gpu 0
cd $PBS_O_WORKDIR
python /export/home/s12345/lpbs/cuda/tensorflowTutorial.py
|
How do I run multiple tensorflow scripts in the same job
No Format |
---|
cat pbs.01
>>>>>>>>>>>>>>>>>>>
#!/bin/bash
#PBS -m abe
#PBS -M myemail@griffith.edu.au
#PBS -V
#PBS -N verc235
#PBS -q dljun@n060
#PBS -W group_list=deeplearning -A deeplearning
#PBS -l select=1:ncpus=16:ngpus=1:mem=32gb,walltime=300:00:00
#cd $PBS_O_WORKDIR
GPUNUM=`echo $CUDA_VISIBLE_DEVICES`
module load anaconda/5.3.1py3
module load cuda/10.0
#conda info --envs
#source activate deeplearning
source activate tensorflow-gpu
##nvidia-debugdump -l
##nvidia-smi
GPUNUM=`echo $CUDA_VISIBLE_DEVICES`
MASTERDIR=/export/home/s1234/scratch/home/DeepXi/ver/c2/5
cd $MASTERDIR/5
python3 deepxi.py --train 1 --gpu $GPUNUM &
cd $MASTERDIR/10
python3 deepxi.py --train 1 --gpu $GPUNUM &
cd $MASTERDIR/15
python3 deepxi.py --train 1 --gpu $GPUNUM &
cd $MASTERDIR/20
python3 deepxi.py --train 1 --gpu $GPUNUM
>>>>>>>>
Submit the job like this:
qsub pbs.01 |
Reference
- https://www.pbsworks.com/pdfs/PBSAdminGuide18.2.pdf
- https://conf-ers.griffith.edu.au/download/attachments/21332198/xl270d_gen10.pdf?api=v2
- https://www.microway.com/hpc-tech-tips/nvidia-smi_control-your-gpus/
- https://weeraman.com/put-that-gpu-to-good-use-with-python-e5a437168c01
- https://stackoverflow.com/questions/48152674/how-to-check-if-pytorch-is-using-the-gpu
- https://discuss.pytorch.org/t/solved-make-sure-that-pytorch-using-gpu-to-compute/4870/14
https://www.tensorflow.org/guide/using_gpu