Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 263 Next »


*Qs 1: How do I cite or mention the cluster in papers? or what is the preferred method?

Ans

Please consult this page: Griffith HPC User Guide#10Acknowledgements

*Qs 2: When using DOCK I'll be searching a fairly large database of molecules (about 1,300,000) but

I will split it into 6 chunks for the search. On the old cluster I used to copy each chunk of the database
to the local machine the calculation was running on. Will I need to do that on the new cluster or can I
simply point to a copy of the database in my local directory ?

Ans

We have a scratch partition for this. There is a link in your home directory ~/scratch. You could make a 
copy of the database on the scratch directory and point DOCK to it during your run. This can be helpful 
if you expect a lot of i/o and can speed things a little. I don't  think the local home directory is 
very busy at the moment but in the future as more users come on-board, it could be and hence it may 
be good idea to start using the ~/scratch (which is a pointer to /scratch/snumber). 

*Qs 3: How can I live monitor the progress of my PBS jobs"

Ans

You may be able to trace your jobs with this command:

tracejob 

type "man tracejob" for syntax.

e.g: tracejob -n4 213883

where n4 indicates to search logs over the past 4 days 
213883 indicates job number.

There is a script to watch a job live. You can run it on gowonda as follows:
watch_jobs.sh

It will ask for node number on which your job is running. This can be got by typing:
qstat -1an 

The last column of the output from above will have the node number. It also needs the job number.

>>>>>>>
cat watch_jobs.sh
echo "==========================="
echo "Please enter Node Number e.g: n004"
read NODE
echo "Please enter Job number  e.g 9066"
read JOBNO
echo "==========================="

ssh $NODE  "tail -f /var/spool/PBS/spool/*$JOBNO*"
>>>>> 

Qs. 4 How to run gpu/cuda jobs?

We would like to run some software on the GPUs on the new computer but are having problems figuring out how to do it. Can you give us any info as we are preparing a paper for a conference.

Ans

Check this: https://griffith.atlassian.net/wiki/spaces/GHCD/pages/4030784/cuda

I have put some documentation here:

http://confluence.rcs.griffith.edu.au:8080/display/GHPC/cuda

A sample PBS script is here:
>>>>>>>>>>>>>>
#!/bin/bash

#PBS -N cuda
#PBS -l ngpus=2
#PBS -l walltime=100:00:00
#PBS -q gpu
module load cuda/4.0
echo "Hello from $HOSTNAME: date = `date`"
nvcc --version
echo "Finished at `date`"
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Basically a queue named gpu has been created and a pbs resource named ngpus created. 
Both needs to be called in the pbs script to send batch jobs to the gpu nodes.

I suggest you do an interactive batch run first to sort out any issues. To do that:

qsub -I -l ngpus=2 -l walltime=100:00:00  -q gpu

This will log you into one of the gpu nodes (you may have to wait if the nodes are busy).

Type this on the gpu node:

 module load cuda/4.0

The cuda binaries are located here: /usr/local/cuda/bin/

Qs. 5: Is opencl available?

Ans

Yes indeed! Check this wiki page for details: https://griffith.atlassian.net/wiki/spaces/GHCD/pages/4035461/OpenCL

Qs. 6: What is the advantage of using the /scratch filesystem while running experiments

Ans

Currently the /scratch filesystem is not as busy as the super busy /export/home filesystem. Here is a benchmark test that was performed on Nov 24th 2011 that illustrates how it could significantly enhance the speed of your calculation.

If we run this with binaries on the /scratch filesystem ( /scratch/s2594054/LennardJones dir), I get this:

===========================================
Energy = -1622.25

Number of devices is 2
Passed clCreateProgramWithSource
Compute using the CPU

Energy value is -1622.24
Time taken was : 1450 microseconds    <=========  *Check this

Compute using the GPU

Energy value is -1618.35
Time taken was : 837 microseconds     


If I run this with binaries on the /export/home filesystem, this is the timing we get (this result was expected as /export/home is a busy filesystem):
=================================================
Energy = -1622.25

Number of devices is 2
Passed clCreateProgramWithSource
Compute using the CPU

Energy value is -1622.24
Time taken was : 2005 microseconds    <=========  *Check this

Compute using the GPU

Energy value is -1618.35
Time taken was : 1061 microseconds    
======================================================

The best scenario is to copy to the binaries to /scratch/snumber/ directory and batch it as something similar to below:

#!/bin/bash
#PBS -m abe
#PBS -M YOUREMAIL@griffith.edu.au
#PBS -N openCL_GPU_job
#PBS -l ngpus=1
#PBS -l walltime=00:10:00
module load <WhatEver>
##echo "Hello from $HOSTNAME: date = `date`"
cd /scratch/s123456/<yourBinDir>
./<yourProg>
#You may copy data (if any) back and forth if needed.
##e.g:
##./LennardJones pop256dd.xyz

Qs. 7: What are the Basic Unix commands that one needs to know to use the linux cluster?

Ans

Finding online help

To find a command or library routine that performs a required function, try a “keyword” search.

You may do a keyword search of the online man pages using “man-k keyword”. The command “apropos keyword” also performs this search.
To find details on a command or library

To find details on using a unix command or library use the command “man command_name”

If no man page is found, it is worth trying the “info” tool.
Controlling your unix environment

Using the modules command.

Listing the files and subdirectories in a directory

basic commands: ls, cd, rm.

Use the "ls" command to list the contents of a directory.
Changing the current working directory

The Unix environment (Shell) records a "current working directory" which is initially set to your home directory.

Many programs, (such as ls above) will look in this directory if no other directory requested.

The "current working directory" may be changed using the "cd" command. See the notes below on directory names.
Deleting (Removing) a file

Use the "rm" command to remove a file.
A few note on Unix directory names.

A Unix file full path name is constructed of the directory and subdirectory names separated by slashes "/".

 ie. /export/home/s123456/work/file1

All Unix full path names start with "/", (There are no Drive/Volume names as in Windows). Hence any filename starting with "/" is a full pathname.

A filename containing one or more slash "/" will refer to subdirectory of the "current working directory".

The current working directory may also be referenced as dot "."

 ie. ./subdirectory/file

The directory containing the "current working directory" may be referenced as dot-dot "..

 ie. ../peer-directory/file

Qs. 8: How to check for "DOS characters" in the pbs script? .

Ans This is a common problem if you use Windows editors like notepad. It can introduce "unwanted characters" in the pbs script file which can cause problems with your pbs script. The problem is that they are not noticeable.

To reveal them, please do the following:
cat -tv <Filename>

To Remove them, run this command:
dos2unix <filename>

cat -tv  pbs.00

#PBS -m abe^M
#PBS -M Email@griffith.edu.au^M
#PBS -N Pile^M
#PBS -l select=1:ncpus=1:mem=4g^M
#PBS -l walltime=03:00:00^M
source $HOME/.bashrc^M
module load matlab/2008b^M
module load ATLAS/3.9.39^M
echo "Starting job"^M
matlab -nodisplay -nodesktop -nosplash < /export/home/s123456/P1_e5f1k4/SBFEmOnePileVerify_04Aug11.m^M
echo "Done with job"^M


dos2unix   pbs.00
dos2unix: converting file pbs.00 to UNIX format ...

cat -tv  pbs.00
#PBS -m abe
#PBS -M Email@griffith.edu.au
#PBS -N Pile
#PBS -l select=1:ncpus=1:mem=4g
#PBS -l walltime=03:00:00
source $HOME/.bashrc
module load matlab/2008b
module load ATLAS/3.9.39
echo "Starting job"
matlab -nodisplay -nodesktop -nosplash < /export/home/s123456/P1_e5f1k4/SBFEmOnePileVerify_04Aug11.m
echo "Done with job"

Qs 9: How do I transfer files to or from the system ?

Qs 10: How can I connect to the systems ?

Qs 11: I need to transfer data from outside of Griffith. Would I be charged for this usage?

Ans

AARNet introduced un-metered Off Peak traffic in the middle of 2009. At that time, Off Peak was defined as 8:00pm to 8:00am each day, including weekends and public holidays. This resulted in an additional 10% of total traffic being reclassified as un-metered.
 
At the beginning of this 2011, Off Peak was increased to the period from 5:00pm to 9:00am. The effect was that a further 10% of total traffic was reclassified as un-metered. Year-to-date, some 80% of total traffic is now in this category.
 
The AARNet Board has just approved a further extension of Off Peak to include all of the weekend, so that that the period from 5:00pm Friday to 9:00am Monday will then be un-metered. This change will occur from July 1st, and will result in another 5% of total traffic becoming un-metered. That mean that around 85% of total AARNet traffic will soon be un-metered and subscription based.

You can check: https://ias.griffith.edu.au/griffith
Check under internet usage!

*Qs 12: I am getting segmentation error. What can I do?

Ans

Check what your stack size is and try increasing it.

ulimit -a
ulimit -s unlimited

Qs 13: How do I specify a walltime (eg of 10 mins) in the pbscript ?

Ans

#PBS -l walltime=00:10:00 

Qs14: How do I specify a memory in the pbsscript

Ans

Memory Settings

The queuing system is now enforcing memory limits on job. If you do not specify a "mem" limit for your job you will receive the default mem limit of 600mb. This corresponds to:

#PBS -l mem=600mb

If your job uses more memory per thread than this and you do not explicitly ask for more, your job will be killed by the pbs scheduler. If your job uses less than this and you do not specify, you will simply be telling pbs that your job needs more memory than it actually uses which will make that memory unavailable to other jobs. It is best if your mem setting accurately reflects your jobs actual memory requirements.

Qs 15: How do I create a service desk case to report a problem with HPC?

-Ans_

https://www.griffith.edu.au/eresearch-services/request-help

Please select "eResearch services.HPC" for category. Please see an example below:

Qs 16: mpi job doesn't seem to be running much faster than it does on my laptop

I'm running a RAxML analysis now (job id: 371395). It is supposed to be an mpi job, but it doesn't seem to be running much faster than it does on my laptop. Is there a way to check that an mpi job is actually running on multiple cores? I tried 'qstat -f 371395', but I can't tell if it's reporting the used memory or the reserved memory.

If you have time, could you please take a look at my submit script, to make sure it's OK: /scratch/s2831058/120524_NIMR_RAxML/120524_NIMR_RAxML_submit.sh

-Ans_

First checked where it is running:

>>>>>>>>>>>>>>>>>>
qstat -1an|grep s1111111
371395.pbsserve s1111111 mpi      nimr_raxml   6915   1   8    8gb 48:00 R 13:08 n003/0*0
>>>>>>>>>>>>>>>>>>>>

This shows it is running on node n003 and has requested 8 cpus. 
Next check if it is actually using all the cpus. 
To find out, you will need to ssh to the node.

ssh n003

You will need to run a tool like "top" to check if it is utilizing these processors. Running "top" shows it is only using 1 CPU.

>>>>>>>>>
top - 11:29:45 up 81 days,  1:30,  1 user,  load average: 1.00, 1.00, 1.00
Tasks: 608 total,   2 running, 606 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.6%us,  0.1%sy,  0.0%ni, 96.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  12189240k total,  3954752k used,  8234488k free,   187132k buffers
Swap:  8193140k total,    45688k used,  8147452k free,   395396k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6959 s2831058  20   0 2573m 2.3g 2976 R 100.1 19.8 791:59.76 raxmlHPC-MPI
11248 root      20   0 17492 1680  956 R  0.7  0.0   0:00.04 top
10972 root      20   0     0    0    0 S  0.3  0.0   0:00.68 flush-0:17
11182 root      20   0     0    0    0 S  0.3  0.0   0:00.06 flush-0:19

>>>>>>>>>>>>>>
See the load average is 1.00 when it should be about 8 if it is using 8 processors . 
To get a detailed CPU use report, type "top" and type 1  or "mpstat -P ALL"
(We have hyperthreading enabled and hence you will see 24 CPUs instead of 12 CPUs). 
There is another program called "htop" tjat gives even better visuals about the running processes.

>>>>>>>>>>>>>>>>>
top - 11:42:33 up 81 days,  1:43,  1 user,  load average: 1.04, 1.03, 1.01
Tasks: 608 total,   2 running, 606 sleeping,   0 stopped,   0 zombie
Cpu0  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.7%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.7%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 :  0.0%us,  0.2%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu16 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu17 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu18 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu19 :  0.4%us,  0.0%sy,  0.0%ni, 99.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu20 :  0.0%us,  0.4%sy,  0.0%ni, 99.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu21 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu22 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu23 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  12189240k total,  3955736k used,  8233504k free,   187276k buffers
Swap:  8193140k total,    45688k used,  8147452k free,   396956k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6959 s2831058  20   0 2573m 2.3g 2976 R 100.1 19.8 804:44.49 raxmlHPC-MPI
11732 root      20   0     0    0    0 S  0.7  0.0   0:00.32 flush-0:17
11680 root      20   0     0    0    0 S  0.3  0.0   0:00.32 flush-0:19
11846 root      20   0 17492 1684  960 R  0.3  0.0   0:00.04 top
24632 ganglia   20   0  143m 1984 1104 S  0.3  0.0   2:05.04 gmond
    1 root      20   0 21400 1276 1044 S  0.0  0.0   0:01.97 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:03.00 kthreadd
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.03 migration/0

>>>>>>>>>>>>>>>>>
From the mpstat output (see last column), it is pretty much idle.
>>>>>>>>>
 mpstat -P ALL
Linux 2.6.32-131.0.15.el6.x86_64 (n003)         05/25/2012      _x86_64_        (24 CPU)

11:38:55 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
11:38:55 AM  all   10.92    0.00    0.61    0.02    0.00    0.00    0.00    0.00   88.45
11:38:55 AM    0   37.80    0.00    1.05    0.19    0.00    0.00    0.00    0.00   60.96
11:38:55 AM    1   35.52    0.00    2.31    0.00    0.00    0.00    0.00    0.00   62.17
11:38:55 AM    2   35.72    0.00    2.28    0.00    0.00    0.00    0.00    0.00   61.99
11:38:55 AM    3   36.64    0.00    1.34    0.00    0.00    0.00    0.00    0.00   62.02
11:38:55 AM    4   33.67    0.00    2.49    0.00    0.00    0.00    0.00    0.00   63.84
11:38:55 AM    5   33.85    0.00    2.51    0.00    0.00    0.00    0.00    0.00   63.64
11:38:55 AM    6   20.58    0.00    0.60    0.18    0.00    0.02    0.00    0.00   78.63
11:38:55 AM    7   18.85    0.00    0.58    0.00    0.00    0.00    0.00    0.00   80.57
11:38:55 AM    8    2.28    0.00    0.25    0.01    0.00    0.00    0.00    0.00   97.47
11:38:55 AM    9    2.43    0.00    0.30    0.02    0.00    0.00    0.00    0.00   97.25
11:38:55 AM   10    2.94    0.00    0.24    0.02    0.00    0.00    0.00    0.00   96.79
11:38:55 AM   11    1.84    0.00    0.25    0.00    0.00    0.00    0.00    0.00   97.91
11:38:55 AM   12    0.01    0.00    0.02    0.00    0.00    0.00    0.00    0.00   99.97
11:38:55 AM   13    0.02    0.00    0.07    0.00    0.00    0.00    0.00    0.00   99.91
11:38:55 AM   14    0.01    0.00    0.02    0.00    0.00    0.00    0.00    0.00   99.98
11:38:55 AM   15    0.00    0.00    0.01    0.00    0.00    0.00    0.00    0.00   99.99
11:38:55 AM   16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.99
11:38:55 AM   17    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.99
11:38:55 AM   18    0.01    0.00    0.03    0.01    0.00    0.00    0.00    0.00   99.95
11:38:55 AM   19    0.01    0.00    0.10    0.00    0.00    0.00    0.00    0.00   99.89
11:38:55 AM   20    0.01    0.00    0.03    0.01    0.00    0.00    0.00    0.00   99.95
11:38:55 AM   21    0.01    0.00    0.02    0.01    0.00    0.00    0.00    0.00   99.97
11:38:55 AM   22    0.01    0.00    0.03    0.00    0.00    0.00    0.00    0.00   99.96
11:38:55 AM   23    0.01    0.00    0.02    0.00    0.00    0.00    0.00    0.00   99.97
>>>>>>>>>>>

There is something wrong with how the job was submitted. Please take a look at it.

Qs 17: How do I check if my job is short of memory (swapping memory)?

-Ans_

Log into the node that is running your job.
qstan -1an|grep <jobNumber>

This gives the node (last column). Please log into that node (ssh n???) 
and run "top".

The values under RES show the amount of physical memory (RAM) being used by the process. This should be the amount of pmem 
in your job request. If you are not using pmem but using mem, you will have to add up for all processes (1.3g+1.2g+.....=10.2g).
And that was for that instance when top was run.  You requested memory (mem= ) was not enough. 

Please look at the swap line from the output of "top":
 
Swap:  8193140k total,  5132424k used,  3060716k free

Here you can see that it is using 5132424k of swap mem which ideally should be very low (=~ 0 k). So you are using a lot of swap. 

>>>>>>>>>>>>>>>>>>>>>>>
top - 13:29:10 up 81 days,  3:30,  3 users,  load average: 8.00, 8.00, 7.74
Tasks: 621 total,   9 running, 612 sleeping,   0 stopped,   0 zombie
Cpu(s): 33.8%us,  0.2%sy,  0.0%ni, 66.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  12189240k total, 11812780k used,   376460k free,    14036k buffers
Swap:  8193140k total,  5132424k used,  3060716k free,    38532k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15095 s2831058  20   0 2640m 1.3g 3164 R 100.0 11.2  44:15.23 raxmlHPC-MPI
15096 s2831058  20   0 2640m 1.2g 3196 R 100.0 10.6  44:08.10 raxmlHPC-MPI
15097 s2831058  20   0 2640m 1.3g 3240 R 100.0 11.2  44:08.22 raxmlHPC-MPI
15098 s2831058  20   0 2640m 1.4g 3224 R 100.0 12.3  44:13.74 raxmlHPC-MPI
15100 s2831058  20   0 2640m 1.6g 3212 R 100.0 13.9  44:11.65 raxmlHPC-MPI
15101 s2831058  20   0 2640m 1.1g 3080 R 100.0  9.8  44:04.24 raxmlHPC-MPI
15094 s2831058  20   0 2640m 1.0g 3196 R 99.7  8.9  44:11.99 raxmlHPC-MPI
15099 s2831058  20   0 2641m 1.3g 3184 R 99.7 11.2  44:20.96 raxmlHPC-MPI
17329 root      20   0     0    0    0 S  2.7  0.0   0:00.91 flush-0:17
17313 s2831058  20   0 17728 1544 1032 S  2.0  0.0   0:01.21 htop
 3161 root      20   0 2940m  17m 2332 S  0.3  0.1 257:23.80 DNA
17371 root      20   0  103m 3816 2972 S  0.3  0.0   0:00.01 sshd
17405 root      20   0 17492 1672  956 R  0.3  0.0   0:00.05 top
    1 root      20   0 21400 1144  980 S  0.0  0.0   0:01.97 init


>>>>>>>>>>>>>>>>>>>>>>>>>
Another tool you can use is "htop". See attached pictures for the swap row. It should be a very short line or no line 
if it is not swapping. If it is swapping heavily, it will have a long line.
 
I would suggest you request more memory. PBS will then dispatch your jobs to the bigger memory nodes. Your jobs may have to wait
a little longer (it will be queued up) but it will then use the appropriate bigger memory nodes (currently our 4 biggest nodes 
have 96 G RAM).

Here htop output shows a heavily swapping system:

Here htop output shows a lightly swapping system:

Another tool you can use is called "free".

Qs 18: How do I modify the number of requested cores, memory , walltime etc after submitting the job?

-Ans_

qalter -l select=1:ncpus=12:mem=5gb <jobID>
qalter -l walltime=<hh:mm:ss> <jobid>

To release a job on status "Hold"
qrls -hs <jobid>

Qs 19: How do I delete all my pbs jobs?

-Ans_

qdel `qselect -u username`

Qs 20: What is the syntax to add group information in pbs script? (For example, to access software that are resticted to previledged users like vasp group to use VASP software)

-Ans_

Please put this line in the pbs script:

#PBS -W group_list=nimrodusers

or simply use it in qsub as follows:

qsub -l walltime=800:00:00,select=1:ncpus=2 -W group_list=nimrodusers  -N bm2-12 -q gpu -- /export/home/s123456/blade2.sh

Qs 21: How do I make advance reservation?

-Ans_
Pleave review this documentation.

 Qs 22: If running the same everytime, what is the best way to submit the job

-Ans_

An excellent guide from USyd can be found here

If running the same everytime, it is best to use job arrays (rather than running qsub multiple times).

So you could do:
qsub -N <jobname> -J 1-10 job_submit.pbs

To run it 10 times. 

#!/bin/bash
#PBS -J 1-100
#PBS -m abe
#PBS -M YOUREMAIL@griffithuni.edu.au
#PBS -N MQworker
#PBS -l select=1:ncpus=1:mem=2g,walltime=2:00:00
cd  $PBS_O_WORKDIR
./myBinary ${PBS_ARRAY_INDEX} 3600 >> output_${PBS_ARRAY_INDEX}

Another example: 

qsub -l walltime=00:00:10 -J 1-4 -- /bin/sleep 3

Another Example

cat array_demo_pbs

#Demonstrate the behaviour of job arrays
#!/bin/bash
#PBS -m abe
#PBS -M EMAILID@griffith.edu.au
#PBS -N TestArray
#PBS -l select=1:ncpus=1:mem=1gb,walltime=00:01:00
#PBS -J 1-10
#PBS -q workq
sleep 60

#Make a new Directory for this job in array inside our current directory
mkdir $PBS_O_WORKDIR/Demo$PBS_ARRAY_INDEX

#Change into it
cd $PBS_O_WORKDIR/Demo$PBS_ARRAY_INDEX

#Send some output to a file
printf " This is a job number $PBS_ARRAY_INDEX in the array\n" >job.PBS_ARRAY_INDEX.log


Further example

#!/bin/bash
#PBS -m abe
#PBS -M YourEmail@griffith.edu.au
#PBS -N IndyMarxACT_v3
#PBS -q routeq
#PBS -J 1-20
#PBS -l select=1:ncpus=1:mem=1g,walltime=20:00:00
cd $PBS_O_WORKDIR
module load R/4.0.3
R CMD BATCH /export/home/snumber/pbs/array/scripts/script_$PBS_ARRAY_INDEX.R

Qs 23: How do I use screen to leave a terminal session running even after logging out of the system?

-Ans_
Screen can be used when you want to leave a terminal session running even after logging out of the system.

For an easy reference, here's a list of the most common screen commands that you'll want to know. This isn't exhaustive, but it should be enough for most users to get started using screen happily for most use cases.

screen -d -m -S shared
screen -ls
screen -x shared



    Start Screen: screen
    Detatch Screen: Ctrl-a d
    Re-attach Screen: screen -x or screen -x PID
    Split Horizontally: Ctrl-a S
    Split Vertically: Ctrl-a |
    Move Between Windows: Ctrl-a Tab
    Name Session: Ctrl-a A
    Log Session: Ctrl-a H
    Note Session: Ctrl-a h



Qs 24: How do I mount a filesystem on remote server on gowonda

First ask the system admin on gowonda to do this:
>>>>>>>>>
usermod -a -G fuse <username>
>>>>>>>>>

You need the SSH server installed and running on the computer you want to mount (for most unix-like OSes including MacOSX, it is a matter of installing openssh-server package). 
Then check that SSH itself is working with:
>>>>>>>>>>>>>>>>>>>>>>
ssh user@remote!P  
>>>>>>>>>>>>>>>>>>>>>>
This will ask for confirmation the first time, then ask for a password every time you try to connect. Once connected, press Ctrl+d to
disconnect, then run the sshfs command 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
sshfs user@remoteiP:/directory/to/mount /mount/point 

e.g 
On gowonda:
mkdir /tmp/mnt1 
sshfs s12345@132.234.1.1:/opt/import  /tmp/mnt1

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
As sshfs uses Fuse. it will refuse to mount onto a non-empty directory, so make sure there's nothing in the mount point beforehand. Now
the contents of the remote directory should be available to you. When you are finished, you unmount the remote directory with
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
fusermount -u /mount/point
e.g:
fusermount -u  /tmp/mnt1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
After doing this a couple of times. you will find it incredibly annoying having to type your password in each time. Fortunately, SSH can use keys for logging in. First generate your keys on the local machine with 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
ssh-keygen -t rsa
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Don't set a password when asked, just press Enter. otherwise you'll defeat the point of creating the key. This creates a pair of files in ~/.ssh. which are your private and public key. Your private key should never be shared, but you need to copy the public one to the remote
computer. with this command:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
$ ssh-copy-id -i -/.ssh/id_rsa.pub user@remoteiP
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

This will ask for the password one more time, but after that you can run the above ssh and sshfs commands without being asked for a password again. All of these commands should be run as your normal user. You can omit the user@ part of the ssh and sshfs commands if you have the same username on both computers. 


Source: Linux Format 5 Apr 2016 p.92

Qs25: PBS environmental variables

Several environment variables are provided to PBS jobs. Some are taken from the user's environment and carried with a job, and others are created by PBS. There are also some environment variables that you can explicitly create for exclusive use by PBS jobs.

All PBS-provided environment variable names start with the characters "PBS_". Some start with "PBS_O_", which indicates that the variable is taken from the job's originating environment (that is, the user's environment).

A few useful PBS environment variables are described in the following list:

PBS_O_WORKDIR - Contains the name of the directory from which the user submitted the PBS job
PBS_O_PATH - Value of PATH from submission environment
PBS_JOBID - Contains the PBS job identifier
PBS_JOBDIR - Pathname of job-specific staging and execution directory
PBS_NODEFILE - Contains a list of vnodes assigned to the job
TMPDIR - The job-specific temporary directory for this job. Defaults to /tmp/pbs.job_id on the vnodes.    

Qs26: How can a specific execution node be requested

As the cluster runs a mix of new and old nodes, you may need to request a specific node (e.g when the code was compiled for the new node architecture). Here is the procedure

Wwhen the code was compiled for the new node architecture and the app is run on the old nodes, you may get errors like this

Please verify that both the operating system and the processor support Intel(R) MOVBE, F16C, FMA, BMI, LZCNT, AVX2, AVX512DQ, AVX512F, ADX, AVX512CD, AVX512BW and AVX512VL instructions.

Solution:
========
The new nodes are:
gc-prd-hpcn001
gc-prd-hpcn002
gc-prd-hpcn003
gc-prd-hpcn004
gc-prd-hpcn005
gc-prd-hpcn006

To check if the resource is free, please run this command:
pbsnodes  -aSj|egrep "gc-prd-hpcn002|gc-prd-hpcn001|gc-prd-hpcn003|gc-prd-hpcn004|gc-prd-hpcn005|gc-prd-hpcn006"

Let's say gc-prd-hpcn006 has free memory and free cpus.

>>>>>>>>>Here is a sample pbs script to send a job to gc-prd-hpcn006<<<<<<<<<
#!/bin/bash
#PBS -N SpecificHost
###PBS -m abe
###PBS -M ID@griffithuni.edu.au
#PBS -l select=1:ncpus=1:host=gc-prd-hpcn006:mem=8gb,walltime=24:00:00
echo "Starting job: "
cd $PBS_O_WORKDIR
echo "Hello"
sleep 10
>>>>>>>>>>>>>>>>>>>>>>>

For an interactive run, you can do this:

qsub -I -X -q workq  -l select=1:ncpus=1:host=gc-prd-hpcn006:mem=8gb,walltime=24:00:00

If hosts from a range is acceptable:
qsub -I -X -q workq  -l select=1:ncpus=1:host=gc-prd-hpcn001-006:mem=1gb,walltime=00:01:00

PS: If you need to send the job to a different queue. Please find the queue a node belongs to with this command first:
qhost|egrep "gc-prd-hpcn002|gc-prd-hpcn001|gc-prd-hpcn003|gc-prd-hpcn004|gc-prd-hpcn005|gc-prd-hpcn006"

Qs27: Getting the error "A library with BLAS API not found. Please specify library location"

You could try a trace.
cmake --trace . 2>&1 | tee /tmp/cmakeOut.txt

You may have to explicitly mention the path, For example:
cmake . -DLAPACK_LIBRARIES=/sw/library/lapack/lapack-3.6.0/3.6.0/lib64/liblapack.so -DBLAS_LIBRARIES=/sw/library/blas/CBLAS/lib/cblas_LINUX.so
 
OR:
 cmake . -DLAPACK_LIBRARIES=/sw/library/lapack/lapack-3.6.0/3.6.0/lib64/liblapack.so -DBLAS_LIBRARIES=/sw/library/blas/CBLAS/lib/cblas_LINUX.so -DCMAKE_INSTALL_PREFIX=/sw/simbody/353


Qs28:  How do I check number of CPUs my job is using

You can find out on which compute node your job is running:
qstat -1an|grep snumber
>>>>>>>>>>>>>>>>>>>
e.g:
qstat -1an|grep s2761086
4598354.pbsserv s2761086 workq    DT_k-e_04   21795   1   1   10gb 99999 R 523:2 n010/0
Here you see it is running on n010
Then you can do this:
ssh nodename -t "htop"  or ssh nodename -t "htop -u username" 
e.g.ssh n010 -t "htop -u s2761086"
Press <F2> key, go to "Columns", and add PROCESSOR under "Available Columns".
The currently used CPU ID of each process will appear under "CPU" column.



Qs29:   How to customize an environmental variable using modules

mkdir -p ~/sw/Modules
cd ~/sw/Modules

Create a modules file for the application (here it is named xbeach)
>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
vi xbeach

#%Module######################################################################
##
##      XBeach modulefile
##
proc ModulesHelp { } {
        puts stderr "Sets up shell environment to use XBeach  "
        puts stderr " "
}

module load gnu/4.9.2

set base        /gpfs1/groups/gccmss/project/sw/xbeach
set base_path   $base

##prepend-path    INCLUDE             $base_path/include
prepend-path    LD_LIBRARY_PATH     $base_path/lib
prepend-path    PATH         $base_path/bin
##prepend-path    MANPATH      $base_path/man
##setenv          NCBI         $base_path/local
>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Edit the .bash_profile file and add the local MODULEPATH
>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
vi ~/.bash_profile
export MODULEPATH=$MODULEPATH:~/sw/Modules
>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Log out and log back in.
Check if the module is available
module avail xbeach


Qs 30: Provide an example pbs script that can be used on flashlite HPC located in UQ

#! /bin/bash
### Template PBS script to submit Delft3d job to queue gccm
#PBS -m e
#PBS -N brisb0809
#PBS  -l  walltime=200:00:00
#PBS  -l  nodes=1:ppn=4:intel,mem=30gb,vmem=30gb
#PBS  -A qris-gu
source $HOME/.bashrc
module load delft3d-rev5624

cd /gpfs1/groups/gccmss/project/delft3d/brisb_aug0809/
##input data file
export MDWFILE=fm_wave.mdw
export ARGFILE=config_flow2d3d.xml
## RUN ##
# start flow
echo "=== start Delft3D-FLOW in the background ==="
#$MPIRUNEXEC -np $NHOSTS d_hydro.exe $ARGFILE &
#deltares_hydro.exe $ARGFILE -keepXML &
#deltares_hydro.exe $ARGFILE  &
d_hydro.exe $ARGFILE &
# start wave
echo "=== start Delft3D-WAVE in the background ==="
#$MPIRUNEXEC -np $NHOSTS wave.exe $MDWFILE 1
wave.exe $MDWFILE 1
echo "=== calculation finished ==="
echo "Finished job "`date`


Qs 31: Multi Cores are requested and allocated by PBs but job runs only on 1 core. Why is that?

This contribution is from Nicholas Dhal and is acknowledged. Nick is an active Grifith HPC user.


>>>>>>>>>

When a job is executed, the processes running the job are divided up amongst the requested number of cores on the HPC compute node. The processes have an associated setting called 'affinity' which is a logical (true or false) test that specifies which CPU core the processes are allowed to run on. In serial cases, this usually makes no difference because everything is on a single CPU only. However, in parallel runs with multiple cores, sometimes the affinity is set incorrectly by the scheduler and this limits performance.
For instance, if a 4 core job is requested, the scheduler will allocate 4 cores to the job (lets call them cores 2, 6, 11 and 12). If the job is running correctly, the four main main processes that do the majority of the data processing will each have there own core and use 100% of their respective core. If the job is running incorrectly, the main processes will share a CPU (or combination of CPU's). They could all be on a single CPU (lets say CPU 6 from our example) and thus each process would only be running at 25% (100% / 4 processes). The job would still have additional CPU's allocated to it that would be unused (in our example cores 2, 11 and 12). In these cases the affinity for each process would specify that the processes are only allowed on certain CPU's and wouldn't include the other CPU's which have been allocated to the job. In our example the affinity for each process would be set to core 6 only.
To check the affinity (and the performance of each process on the compute node) of the process running, a command called 'taskset' is used. These steps are used:
  1. Using PuTTy (or similar), login to the HPC headnode.
  2. Identify the compute node the simulation is running on. (Try typing 'qstat -1an|grep -u s123456' to get a list of running jobs with node numbers).
  3. Login to the node where the process will be. To login type 'ssh nxxx', where xxx is the node number.
  4. Identify the process id. Either by looking at a cleanup file (seen through WinSCP in the directory of execution) or by looking at a list of all running processes on the node (step 5).
  5. To see a list of running processes on the node, run the command 'htop'. Put it into tree mode for best view (F5). You can then see the CPU the processes are on and how much of the CPU the process is using.
  6. Once ready to find the affinity setting, exit 'htop' but stay logged into the node. Run the command 'taskset -p xxxx' where xxxx is the process id.
  7. The value returned is a hexadecimal code that should be converted to binary (just google hex to binary for converter).
  8. The binary code will be a zero where the core isn't allowed and a one where the core is allowed.
  9. To set the affinity, we recommend allowing all cores (the scheduler will automatically set it to the ones you are actually allocated). The hex code for all cores is F (one F for each 4 cores on the node). To set run the command 'taskset -p FFF xxxx' where the number of F's is the number of cores on the node divided by 4 and xxxx is the process id.
  10. You will have to repeat set 9 for all main process id's that are running on your job (just the main processes that use 100% of CPU). If you have multiple nodes, you will have to repeat for each node.
Applying this to our example above. If the node was 006, the number of cores on the node is 12. Type "cpuinfo" to find the number of cores(typically 12 ,16 or 24). So, starting at step 3 it would be:
  • ssh n006
  • htop
  • (look around and find process id, lets say 2167)
  • F10
  • taskset -p 2167
  • the value returned would be '20' for only the core 6. This is binary for 000000100000 where the one 1 is at the sixth place (reading right to left)
  • taskset -p fff 2167 (if the number of cores in the node was 24, it would be taskset -p ffffff 2167)
  • taskset -p 2167
  • exit
Now the process 2167 would be allocated to all the cores that the job was allocated. The second to last command is checking the affinity after setting it. The value returned should be 'c22' which converted to binary is 110000100010. This is cores 2,6,11,12, all the cores allocted to the job. The last command exits from the compute node back to the headnode. Then repeat for remaining processes.
Above, the term 'main processes' was used. In many simulations there are processes that do interface work, message passing and these don't use very much CPU resources. That is why we refer to 'main processes' as the ones that do the actual data processing and the ones that limit performance. When in htop on the node, these main processes should have CPU % above 0.0, the rest can be ignored.
The command 'htop' uses some CPU resources on the node, so limit your time logged into the node (no more than 10 minutes). To exit 'htop' use the key 'F10'. To exit the node use the command 'exit'.

goo.gl/koYII3


>>>>>>>>>>

Qs 32: How to check the remaining licenses on the license server

/usr/local/bin/lmstat -a -c 27006@gc-prd-erslic.corp.griffith.edu.au
/sw/sysadmin/intelflexlm/lmstat -a -c 27006@gc-prd-erslic.corp.griffith.edu.au
/usr/local/bin/lmstat -a -c 27004@gc-prd-erslic.corp.griffith.edu.au


Qs 33: User defined Job Priority

The queue system uses some algorithm to maximize the use of the resources while prioritizing some jobs over others. Normal users can only make their jobs have lower (eg, use negative arguments to p) priority than normal.  This lets you shuffle around the order of your jobs with respect to each other
qalter -p value job 
qalter -p  -1020  101
qalter -l prio=1000 <jobID>

 qsub -p priority job-script
priority is a number between -1024 and +1023. A higher number means higher priority. The default priority is 0 

Ref: 

Qs 34: How do I request access to external clusters like NCI's Raijin Cluster

Here are some of the HPC resources available to Griffith researchers:

1. Griffith users can use facilities like NCI, Pawsey. 
For NCI, project allocations are usually done on a quarter basis and are on a "use it or lose it" basis. 

To get access please follow the instructions here:
http://nci.org.au/access/user-registration/new-project-application/

You will need to get a new user account first as specified on the page. When you apply for a project please select to propose a new project and select QCIF as the funding body. You will need to write a project proposal and request computational time per quarter. You may contact QCIF representative, Marlies Hankel (m.hankel@uq.edu.au) if you have any questions related NCI, Pawsey etc.

2. The following resources are available through QCIF/QRISCloud:

a. euramoo cluster
b. flashlite cluster
c. special compute nodes with GPUs, big memory etc
d. virtual machines
e. virtual storage

To request an account, please follow the links here:
https://www.qriscloud.org.au/index.php/services

For example, to request an account on the euramoo cluster, please click the link "Register to use Euramoo". To get further imformation about a particular resource, please click on the "more" link besides the link for a particular resource. For example, click on the "more" link beside the "Request to use Flashlite" to get more information about the flashlite cluster.


Qs 35: Sample script on the awoonga cluster


#!/bin/bash
#PBS -m abe
#PBS -M YOUREMAIL@griffith.edu.au
#PBS  -A qris-gu
#PBS -l nodes=1:ppn=1,mem=3GB,vmem=3GB,walltime=01:00:00
#PBS -N TestOnly
######################################################################

######################################################################
#### This section is setting up and running your executable or script
######################################################################
module load R/3.2.3
module list
cd  $PBS_O_WORKDIR
#Now do some things
echo -n "What time is it ? "; date
echo -n "Who am I ? " ; whoami
echo -n "Where am I ?"; pwd
echo -n "What's my PBS_O_WORKDIR ?"; echo $PBS_O_WORKDIR
echo -n "What's my TMPDIR ?"; echo $TMPDIR
echo "Sleep for a while"; sleep 1m
echo -n "What time is it now ? "; date

ref: https://www.qriscloud.org.au/support/qriscloud-documentation/92-awoonga-user-guide#batch_system


Qs 36: How to install Local R Packages without root access

Ref: http://www.ceci-hpc.be/r_packages.html

To enable you to download R packages from outside Griffith, you may do this:
source /usr/local/bin/s3proxy.sh


Load the R module 
module load R/4.1.3nopkgs
(Older R modules are also available if needee.g module load R/4.0.3, module load R/3.6.1 or module load anaconda3/2019.07py3; source activate R)
Create file named ~/.Renviron

nano ~/.Renviron
## Linux - check version of R/OS
R_LIBS=~/R/x86_64-pc-linux-gnu-library/3.6
OR R_LIBS=~/R/x86_64-pc-linux-gnu-library/4.0

(Create this directory if it doesn't exist:e.g. mkdir -p ~/R/x86_64-pc-linux-gnu-library/3.6)

The .libPaths() command lists the places where R will 
search for libraries, and use the first item of the list as target for 
new package installs.

Try this: module load R/4.1.3nopkgs
R
 > .libPaths()
  [1] "/usr/lib64/R/library" "/usr/share/R/library"

Let's install the dummy package
  > install.packages('dummy')
R tries to install it in the global library
  Installing package(s) into ‘/usr/lib64/R/library’
  (as ‘lib’ is unspecified)
  Warning in install.packages("dummy") :
    'lib = "/usr/lib64/R/library"' is not writable
but quickly notes that it cannot write in the global place, and asks 
whether it should create a local library. Simply answer 'y'.
  Would you like to create a personal library
  '~/R/x86_64-redhat-linux-gnu-library/2.13'
  to install packages into?  (y/n) y

Another example:
install.packages('raster', type='source') 
install.packages('RCurl', type='source')
install.packages('rgdal', type='source',  configure.args="--with-proj-share=~/opt/sw/library/proj/4.8.0/share/proj")


Qs 37: How to install custom packages on flashlite/awoonga clusters without root access

Sometimes, it is possible to install specific versions of software without root access. Please find below an example of how the R package named "rgdal" was installed in a local directory

module load R/3.2.3
cd /tmp
Download the packages you want
>>>>>>>>>>
e.g:
wget https://cran.r-project.org/src/contrib/Archive/rgdal/rgdal_1.2-10.tar.gz
wget https://cran.r-project.org/src/contrib/Archive/rgdal/rgdal_1.2-4.tar.gz
wget https://cran.r-project.org/src/contrib/Archive/sp/sp_1.2-2.tar.gz
>>>>>>>>>>>
R CMD INSTALL /tmp/rgdal_1.2-10.tar.gz 
* installing to library '~/R/x86_64-pc-linux-gnu-library/3.2'
ERROR: this R is version 3.2.3, package 'rgdal' requires R >= 3.3.0

Tried with a lower version of rgdal
>>>>>>>>>
R CMD INSTALL rgdal_1.2-4.tar.gz  
* installing to library '/~/R/x86_64-pc-linux-gnu-library/3.2'
ERROR: dependency 'sp' is not available for package 'rgdal'
* removing '/~/R/x86_64-pc-linux-gnu-library/3.2/rgdal'
So I tried:
>>>>>>>>

It said sp package was needed.

Try a lower version of sp

R CMD INSTALL sp_1.2-2.tar.gz 
That worked.
But another error was received:
>>>>>>>>>>>>>>>>>>
R CMD INSTALL rgdal_1.2-4.tar.gz
checking for gdal-config... no
configure: error: gdal-config not found or not executable.
ERROR: configuration failed for package 'rgdal'
>>>>>>>>>>>>>>
Tried to load this package using the command "module load gdal/2.0.2"

Got another error:

checking for proj_api.h... no

configure: error: proj_api.h not found in standard or given locations.
Try this: module load proj/4.9.1
Still got the error "configure: error: proj_api.h not found in standard or given locations."
A cursory look at the proj/4.9.1 module file showed that it was not set up correctly (module display proj/4.9.1)

Fix it by creating a custom module file:
mkdir ~/.moduleshome
cd ~/.moduleshome

vi ~/.bash_profile
export MODULEPATH=$MODULEPATH:~/.moduleshome
source ~/.bash_profile
Create the custom module file for proj: 
>>>>
mkdir -p ~/.moduleshome/proj/
vi ~/.moduleshome/proj/mine-4.9.1
>>>>>>>>
setenv         PROJHOME /opt/proj 
prepend-path     PATH /opt/proj/bin 
prepend-path     LD_LIBRARY_PATH /opt/proj/lib 
prepend-path     LDFLAGS -L/opt/proj/lib 
prepend-path     LIBRARY_PATH /opt/proj/lib 
prepend-path     PKG_CONFIG_PATH /opt/proj/lib/pkgconfig 
prepend-path     MANPATH /opt/proj/share/man 
prepend-path     INCLUDE /opt/proj/include 
prepend-path     CPLUS_INCLUDE_PATH /opt/proj/include 
prepend-path     CPATH /opt/proj/include 
prepend-path     C_INCLUDE_PATH /opt/proj/include 
prepend-path     FPATH /opt/proj/include 
prepend-path     CFLAGS -I/opt/proj/include 
prepend-path     CPPFLAGS -I/opt/proj/include 
prepend-path     CXXFLAGS -I/opt/proj/include 
prepend-path     FCFLAGS -I/opt/proj/include 
prepend-path     FFLAGS -I/opt/proj/include 
-------------------------------------------------------------------
>>>>>>>>>
Then:
module unload load proj/4.9.1
module load proj/mine-4.9.1
 R CMD INSTALL rgdal_1.2-4.tar.gz
SUCCESS!


Qs 38: How do I get started on the awoonga cluster

You may request an account on Awoonga by going to this link:
https://www.qriscloud.org.au/services

Please click the link "Register to use Awoonga”

Usage:


Griffith users have access to directories /sw/GU and /sw/Modules/GU.
Griffith Support team has write permissions via the group "suppport4".
All others have read access to these directories.

Griffith support team has created modulefiles, under /sw/Modules/GU, 

All users are asked to add this line into their ~/.bashrc file in their home directory:

module use /sw/Modules/GU

Parallel to the several /sw/{support team) directories

We recommend the compiler versions from the SDSC rolls, so
module load gnu or intel
module load any libraries
Build (configure, make, make install)

module avail will list all the modules.


Qs. 39: Proper way to set up library path on custom installation


vi  ~/.bashrc
module use /export/home/YOURSNUMBER/sw/Modules/GU
>>>>>>>
mkdir -p ~/sw/Modules/GU
cd ;HOMEDIR=`pwd`;echo $HOMEDIR
echo "module use $HOMEDIR/sw/Modules/GU" >>~/.bashrc
source ~/.bashrc
>>>>>>>

Create a module file like this
vi ~/sw/Modules/GU/postgress

>>>>>>>>>>>
#%Module######################################################################
##
##      PostGress  modulefile
##
proc ModulesHelp { } {
        puts stderr "Sets up shell environment to use Postgress"
        puts stderr " "
}

set base_path   /export/home/snumber/sw/PostgresSQL



prepend-path    PATH            $base_path/bin
#prepend-path    MANPATH         $base_path/share/man
prepend-path    MANPATH         $base_path/share
##setenv         BLASTDB         $dbbase_path/blastdb

prepend-path    LIBRARY_PATH            $base_path/lib
prepend-path    LD_LIBRARY_PATH $base_path/lib
prepend-path    LDFLAGS                 -L$base_path/lib
prepend-path    PKG_CONFIG_PATH         $base_path/lib/pkgconfig
prepend-path    INCLUDE             $base_path/include
prepend-path    CPLUS_INCLUDE_PATH      $base_path/include
prepend-path    CPATH                   $base_path/include
prepend-path    C_INCLUDE_PATH          $base_path/include
prepend-path    FPATH                   $base_path/include
prepend-path    CFLAGS                  -I$base_path/include
prepend-path    CPPFLAGS                -I$base_path/include
prepend-path    CXXFLAGS                -I$base_path/include
prepend-path    FCFLAGS                 -I$base_path/include
prepend-path    FFLAGS                  -I$base_path/include


>>>>>>>>>>>>

source ~/.bashrc

check if it works:
module display postgress


Qs40: How to install a python based application while not having root privilege


Here is an example to install a python application named busco locally:

>>>>>>>>>>>>>>>>>>>>>
mkdir ~/sw
cd ~/sw
cd /export/home/s2981868/sw/busco
#You may load any version of python we have on the cluster
module load  python/3.6.1
cd ~/sw/busco
python setup.py install --prefix=/export/home/s123456/sw/busco 2>&1 | tee buscoInstallLog.txt

>>>>>>>>>>>>>>>>>>>>>>
That's all.  Now you need to set the PYTHONPATH variable properly

The easiest would be to setup a module file locally. Or to add this in .bashrc 
PYTHONPATH /export/home/s12345/sw/busco/lib/python3.6/site-packages

The preferred option is to add it as a module:
You can do this one time 
>>>>>>>>>>>>>>>setting up custom/local module env<<<<<<<<<<<<<<<<
mkdir ~/sw/Modules
cd ;HOMEDIR=`pwd`;echo $HOMEDIR
echo "module use $HOMEDIR/sw/Modules/" >>~/.bashrc
source ~/.bashrc
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Set up the module file
>>>>>>>>>>>>>>>>>>>>>>
Create a module file in /export/home/s12345/sw/Modules/busco

>>>>>>>>>
#%Module######################################################################
##
##      python modulefile
##
proc ModulesHelp { } {
        puts stderr "Sets up paths for busco with python 3.6.1"
}

module-whatis   "adds custom PYTHONPATH directories to PATH etc. "


set base_path           /export/home/s12345/sw/busco
prepend-path    PATH    $python_base/scripts
prepend-path    PYTHONPATH      $python_base/lib/python3.6/site-packages
prepend-path    LD_LIBRARY_PATH $python_base/lib:$python_base/lib/python3.6


>>>>>>>>>
++++++++++++++++
module display busco/custom-busco
-------------------------------------------------------------------
/export/home/s12345/sw/Modules//busco/custom-busco:


module-whatis	 adds custom PYTHONPATH directories to PATH etc.
prepend-path	 PATH /export/home/s12345/sw/busco/scripts
prepend-path	 PYTHONPATH /export/home/s12345/sw/busco/lib/python3.6/site-packages
prepend-path	 LD_LIBRARY_PATH /export/home/s12345/sw/busco/lib:/export/home/s12345/sw/busco/lib/python3.6
-------------------------------------------------------------------

+++++++++++++++++
Simply do this:
module load busco/custom-busco


To run this program:

Now you can do this:
/export/home/s12345/sw/busco/scripts/run_BUSCO.py -i /export/home/s12345/scratch/SRO_trinity.fasta -o SRO_BUSCO -l ~/scratch/metazoa_odb9 -m tran -c 4

Once satisfied, you can write the pbs script and the config.ini (/export/home/s2981868/sw/busco/config/)


>>>>>>>>>>
#!/bin/bash -l
#PBS -m abe
#PBS -M MYEMAIL@griffith.edu.au
#PBS -N BUSCO_SRO
#PBS -l walltime=250:00:00
#PBS -l select=1:ncpus=4:mpiprocs=4
cd $PBS_O_WORKDIR
source $HOME/.bashrc

module load module load python/3.6.1
module display busco/custom-busco

/export/home/s12345/sw/busco/scripts/run_BUSCO.py -i /export/home/s12345/scratch/SRO_trinity.fasta -o SRO_BUSCO -l ~/scratch/metazoa_odb9 -m tran -c 4


Qs41: How to install python modules without root access

ref: https://stackoverflow.com/questions/7465445/how-to-install-python-modules-without-root-access

In this example, python 2.7.10 is used but this is applicable to any version of python.

module load python/2.7.10

run the netcheck script to log into the internet
sh /sw/sysadmin/netcheck (old cluster)
OR on the new cluster 
source /usr/local/bin/s3proxy.sh

You may use pip/easy_install/python setup.py to install to a local directory

pip
===
pip install --user package_name
pip install --install-option="--prefix=$HOME/scripts" package_name
e.g pip install --install-option="--prefix=/export/home/s2819099/scripts"  nose


easy_install
==========
easy_install --prefix=$HOME/scripts package_name
which will install into $HOME/scripts/lib/pythonX.Y/site-packages
You will need to manually create $HOME/scripts/lib/pythonX.Y/site-packages

and add it to your PYTHONPATH environment variable (otherwise easy_install will complain -- btw run the command above once 
to find the correct value for X.Y).

e.g: 
export PYTHONPATH=/export/home/s2819099/scripts/lib/python2.7/site-packages
easy_install --prefix=/export/home/s2819099/scripts nose

setup.py
========
Source: http://docs.python.org/install/index.html#alternate-installation
python <lxml_distrib_dir>/setup.py install --home=<dir>
e.g:
python setup.py install --home=/export/home/s2819099/scripts

virtualenv
==========
$ curl -O https://raw.github.com/pypa/virtualenv/master/virtualenv.py
$ python virtualenv.py my_new_env
$ . my_new_env/bin/activate
(my_new_env)$ pip install package_name

Source and more info: https://virtualenv.pypa.io/en/latest/installation/

Finish the installation
========================
export PYTHONPATH=$PYTHONPATH:$HOME/scripts/lib/python2.7/site-packages

Please logout of the internet on gowonda, otherwise all internet charges from all users will be charged to your account.
sh /sw/sysadmin/netcheck -logout

 


Qs42 : User defined dependency

Source: http://web.mit.edu/longjobs/www/faq.html

To change a dependency after you have submitted a job, use qalter -W depend=type:argument jobid; to remove the dependency completely, use qalter -W depend=type jobid (i.e., omit :argument). For example:
       
    athena% qstat -f 1175 | grep depend
    depend = afterok:1171.hydrogen.mit.edu@hydrogen.mit.edu
  
To make 1175 wait for 1172 instead of 1171: 

·                 athena% qalter -W depend=afterok:1172 1175
·                 athena% qstat -f 1175 | grep depend
·                 depend = afterok:1172.hydrogen.mit.edu@hydrogen.mit.edu
       
To clear the dependency: 

·                 athena% qalter -W depend=afterok 1175
·                 athena% qstat -f 1175 | grep depend
·                 athena%
       
 


Qs 43: How do I get access to QCIF share of NCI @Raijin machine? How do I get access to QCIF GPU machine

How do I get access to QCIF share of NCI @Raijin machine?

https://www.qriscloud.org.au/index.php/services/compute#NCI

How do I get access to QCIF GPU machine Wiener (located @UQ)?
Check under "Use Specialised compute"
https://www.qriscloud.org.au/index.php/services/compute#SpecialCompute

https://www.qriscloud.org.au/index.php/services/compute#NCI

Check under "Use Specialised compute" https://www.qriscloud.org.au/index.php/services/compute#SpecialCompute


Qs 44: How do I use wget on GriffithHPC? How do I get outside access to download for example conda packages?

You can go through a proxy to access external collections using wget. You may create a file named .wgetrc (vi ~/.wgetrc) with the following contents

use_proxy = on
http_proxy = http://s3proxy.itc.griffith.edu.au:3128/
HTTP_PROXY = http://s3proxy.itc.griffith.edu.au:3128/
https_proxy = https://s3proxy.itc.griffith.edu.au:3128/
HTTPS_PROXY = https://s3proxy.itc.griffith.edu.au:3128/



You may create a file named s3proxy.sh (vi ~/s3proxy.sh) with the following contents
export http_proxy = http://s3proxy.itc.griffith.edu.au:3128/
export HTTP_PROXY = http://s3proxy.itc.griffith.edu.au:3128/
export https_proxy = https://s3proxy.itc.griffith.edu.au:3128/
export HTTPS_PROXY = https://s3proxy.itc.griffith.edu.au:3128/

Simply source this package (~/s3proxy.sh) to get temporary access to external packages. Please note that all downloads are monitored. Also note that your home directory should not go above 200GB in size.


Qs 45: How I install my own conda environment without root access


#Gain access to the outside: source /usr/local/bin/s3proxy.sh
#Load the anaconda module e.g module load anaconda3/2023.09)
#Add the following entry into .condarc (nano ~/.condarc)
channels:
  - defaults

#Create the following folders:

mkdir -p ~/.conda/envs
mkdir -p ~/.conda/pkgs

See if the following commands work!
#conda info
#	conda search flask


conda create --name env1
#e.g conda create --name trinity_env --clone root
#To remove an environment, conda remove -n env --all

source activate env1

#e.g source activate trinity_env

Now you can install your packages:
conda search -c bioconda trinity

For example to install version trinity version 2.9.1:

e.g conda install -n trinity_env -c bioconda trinity=2.9.1
conda create --name rstan --channel conda-forge r-dplyr r-rmapshaper r-sf

To check all versions available of the rstan package:
conda search r-rstan --channel conda-forge


Qs 46: How do I transfer a centrally installed application to my local folder 


You may wish to transfer the app locally due to write issues on the app folder
This can be easily accomplished. Here is an example to transfer SWAAT-CUT app:

cd ~
mkdir ~/sw
cp -r /sw/misc/swatcup ~/sw
Change permission like this

chown -R yoursnumber:yoursnumber ~/sw/swatcup

Now you would have it locally.

Then setup modules to load 
https://conf-ers.griffith.edu.au/display/GHCD/FAQ+-+Gowonda#FAQ-Gowonda-Qs29:Howtocustomizeanenvironmentalvariableusingmodules 

Follow this link to setup the modules environment


After following the above link and creating the ~/sw/Modules directory, you can simply copy the parent module file and make changes to it where needed.

cp  /sw/Modules/misc/swatcup/swatcup ~/sw/Modules

vi ~/sw/Modules/swatcup
(make the changes)

This module will be available to you on the login. Or if needed immediately, source ~/.bash_profile  


Qs 47: How do you check the current status of the cluster?

On the login node, please run the commands: pnodes, pjobs and pqueues 

press "Q" to quit

"qhost" ​will list all the nodes. Please note that not all nodes are available to all users. Only a subset is available.

"pbsnodes -aSj" will give an indication of the currently available resources. Again, not all available resources can be used and pbs will determine where and when a job is run.

For example, there are nodes with 72 cores but if 72 cores is requested, your wait time would be a long time (weeks or months) unless the cluster is not busy at all.

"qstat -q" will give the queue configuration.


To view the current cluster status, you can also use the elinks text browser on the login node to view the status:

elinks http://localhost:3000/nodes

elinks http://localhost:3000/jobs

elinks http://localhost:3000/queues

(You can press "Q" to quit from the below text-based browsers

 Qs 48: Why isn’t my job running?

There are many reasons that your job may have to wait in the queue longer than you would like. Here are some of them.

1.System load is high. It’s frustrating for everyone! There are peaks and lows. Unfortunately it is not possible to predict when we will experience this. We have noted that near the beginning of the semester, midway through the semester and at the end of the semester, the load is high. Load is low during the semester break. However this may not be the case all the time.
2.A system downtime has been scheduled and jobs are being held. Check the message of the day, which is displayed every time you login, or emails to the LML - GHPC list.
3. You or your group have used a lot of resources in the last few days, causing your job priority to be lowered (“fairness policy”). Priority is a complicated function of many factors, including the processor count and walltime requested, the length of time the job has been waiting, and how much other computing has been done by the user and their group over the last several days.
4. You or your group are at the maximum processor count or running job count and your job is being held.
5. Your job is requesting specialised resources, such as large memory, gpus or certain software licences, that are in high demand.
6. Your job is requesting a lot of resources. It takes time for the resources to become available.
7. Your job is requesting incompatible or nonexistent resources and can never run. For example, if you request 200GB memory, it will never run as the maximum capacity of a node is 170GB distributed among many jobs. 
8. Job is unnecessarily stuck in batch hold because of system problems (very rare!).


Qs 49: How do I copy files and folders from the old cluster to the new cluster

From the new login node (gc-prd-hpclogin1.rcs.griffith.edu.au), you can see your old home and scratch:

ls /ngumbin/home/s123456
You will see tons of files from old home

ls /ngumbin/scratch/s123456

Again you will see a lot of files and folders from your old scratch.

If you wish to copy a folder named FAS-917 located in the old scratch into the new scratch, you can do this on the new login node.

screen
cp -r /ngumbin/scratch/s123456/FAS-917 ~/scratch &
You will see the copy. Just press ENTER and type
screen -d
After this you can even logout and the copy will continue till it is done.

Please note screen is used to run the command even after you logout. 
The mount point to the new home and scrtach is not available on the old login node gowonda/gowonda2 but instead the old home and scratch are available on the new login node "gc-prd-hpclogin1"


Qs 50: How do you set up Griffith VPN?

https://intranet.secure.griffith.edu.au/computing/remote-access/virtual-private-network


You may follow the instructions on the Griffith vpn site.
https://intranet.secure.griffith.edu.au/computing/remote-access/virtual-private-network


Qs 51: interactive jobs on the gpu queues

You will need to be part of the gpu queues: gpuq, dljun and dlyao group. 
Interactive pbs run
=================
You log into n060 and run this command:
qsub -I -V -q gpuq -l select=1:ncpus=1:ngpus=1:mem=12gb,walltime=10:00:00
(Depending on your need, you may wish to alter the walltime or other resources like memory etc)
It will start a pbs job in interactive mode with a wall time of 10 hours. If there are resource available currently, it would start the job immediately, else you may have to wait.  
Other examples:
qsub -V -I -q gpuq -l select=1:ncpus=1:ngpus=1 -l walltime=0:30:00
qsub -V -I -q dljun -l select=1:ncpus=1:ngpus=1 -l walltime=0:30:00
qsub -V -I -q dlyao -l select=1:ncpus=1:ngpus=1 -l walltime=0:30:00


Qs 52: Are there additional storage facilities at Griffith

We offer 3 research storage services at Griffith University for storage of research data. Check out https://research-storage.griffith.edu.au/  for more details.

You can take a short quiz at https://research-storage.griffith.edu.au/compare to see which service is most appropriate for you depending how you would like to use the data. 
Each service has a separate form to request an account on/Project Space request. 
Please follow the links on the page above to get to them. On the HPC there is a 200GB quota set for your account. 
So you would need to upload segments of the data that you wish to analyse. 
This quota can be temporarily increased but is done on a case by case basis and needs to be requested through the HPC administrator 

To transfer files to the the HPC you can read through the FAQs https://conf-ers.griffith.edu.au/display/GHCD/FAQ+-+Griffith+HPC+Cluster
There is a section on this very question in there.


Qs 53: Sample pbs script to run mpijobs


openmpi

#!/bin/bash 
#PBS -m abe
#PBS -M YOUREMAIL@griffith.edu.au
#PBS -N Inversion
#PBS -l select=1:ncpus=16:mpiprocs=16:mem=1gb,walltime=300:00:00
##     processor cores
slots=16
cd $PBS_O_WORKDIR
module load mpi/openmpi/4.0.2
module load cmake/3.15.5        
echo "Starting job"
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
make clean
make SolveInversion_NoRot
mpiexec $PBS_O_WORKDIR/SolveInversion_NoRot > $PBS_O_WORKDIR/Inversion_export.dat
#NP=`wc -l < $PBS_NODEFILE`
#mpirun  -hostfile $PBS_NODEFILE -np $NP mdrun_mpi_d  -deffnm md01


intelmpi (MPI_DIR = /sw/intel/ps/2019up5/compilers_and_libraries_2019.5.281/linux/mpi/intel64)


#!/bin/bash
#PBS -m abe
#PBS -M YOUREMAIL@griffith.edu.au
#PBS -N Inversion
#PBS -l select=1:ncpus=16:mpiprocs=16:mem=1gb,walltime=300:00:00
## The number of chunks is given by the select =<NUM > above
##$PBS_NODEFILE is a node-list file created with select and ncpus options by PBS
PROCS=16
module load intel/2019up5/mpi
module load cmake/3.15.5        
echo "Starting job"
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
cd $PBS_O_WORKDIR
make clean
make SolveInversion_NoRot
mpiexec -n $PROCS    $PBS_O_WORKDIR/SolveInversion_NoRot > $PBS_O_WORKDIR/Inversion_export.dat


mpich

module load mpi/mpich/3.3.2-gnu


Qs 54: How do to limit python program to use a certain number of cpus

A sample pbs script below provides an example

#!/bin/bash 
#PBS -m abe
#PBS -M YourEmail@griffith.edu.au
#PBS -N  PepDock
#PBS -q dljun@n060
#PBS -W group_list=deeplearning -A deeplearning
### Number of nodes:Number of CPUs:Number of threads per node.
#PBS -l select=1:ncpus=3:mem=12gb,walltime=100:00:00
cd $PBS_O_WORKDIR
NSLOTS=3
##module load  galaxyPepDock
module load misc/galaxypepdock/galaxyPepDock
GalaxyPepDock.centos7 -t test -p ACE2.pdb -s RBD-mimic1.fasta


Qs 55: How to Use Local Scratch Storage (/lscratch)

The cluster is equipped with 

  1.  home file system (your /export/home/snumber). This is common to all compute nodes 
  2. a global scratch file system (your /scratch/snumber or links to it from /export/home/snumber/scratch). This is common to all compute nodes 
  3. a local temporary scratch file system (/lscratch/snumber). A local scratch is only visible from within the compute node it belongs to.

Since /lscratch is a local disk mounted on the node, it's faster than network storage (home or global scratch). Most local nodes have limited disk space (<20gb)

Because of this limited disk space, a node may be forced to go offline if a large directory is created within /scratch on the node and then not deleted once the job ends.

Your pbs job must delete the directory created after the job has finished running.

A job script template of using the local scratch disks on a compute node is shown as below (Ack: Adapted from  UOW HPC Guide)


#!/bin/bash
#PBS -N jobName
####PBS -m abe
####PBS -M YourEmail@griffith.edu.au
#PBS -q workq
#PBS -l select=1:ncpus=1:mem=2gb,walltime=5:00:00
#======================================================# 
# USER CONFIG
#======================================================#
INPUT_FILE="hello.py"
OUTPUT_FILE="$PBS_JOBNAME.out"
MODULE_NAME="python/3.7.4"
PROGRAM_NAME="python"
# Set as true if you need those /lscratch files.
COPY_SCRATCH_BACK=true
#======================================================#
# MODULE is loaded 
#======================================================#
NP=‘wc -l < $PBS_NODEFILE‘
source /etc/profile.d/modules.sh
module load $MODULE_NAME
cat $PBS_NODEFILE
#======================================================#
# SCRATCH directory is created at the local disks 
#======================================================#
SCRDIR=/lscratch/$LOGNAME/$PBS_JOBID
if [ ! -d "$SCRDIR" ]; then
mkdir $SCRDIR
fi
#======================================================#
# TRANSFER input files to the scratch directory 
#======================================================#
# just copy input file
cp -r $PBS_O_WORKDIR/$INPUT_FILE $SCRDIR
# copy everything (Option) 
#cp -r $PBS_O_WORKDIR/* $SCRDIR
#======================================================#
# PROGRAM is executed with the output or log file
# direct to the working directory
#======================================================#
echo "START TO RUN WORK"
cd $SCRDIR
# Run a system wide sequential program
##$PROGRAM_NAME < $INPUT_FILE >& $PBS_O_WORKDIR/$OUTPUT_FILE
$PROGRAM_NAME $INPUT_FILE >& $SCRDIR/$OUTPUT_FILE
###$PROGRAM_NAME $INPUT_FILE >& $PBS_O_WORKDIR/$OUTPUT_FILE
# Run a MPI program (Option)
###For openmpi, use the following syntax####
#module load mpi/openmpi/4.0.2
#mpiexec $PROGRAM NAME < $INPUT FILE >& $OUTPUT FILE
####For intel mpi, use the following syntax####
#module load intel/2019up5/mpi
#mpiexec -n $NP  $PROGRAM NAME < $INPUT FILE >& $OUTPUT FILE
# mpirun -np $NP $PROGRAM NAME < $INPUT FILE >& $OUTPUT FILE
# Run a OpenMP program(Option) # export OMP NUM THREADS=$NP
# $PROGRAM NAME < $INPUT FILE >& $OUTPUT FILE
sleep 60
#======================================================# 
# RESULTS are migrated back to the working directory
#======================================================#
if [[ "$COPY_SCRATCH_BACK" == *true* ]]
then
    echo "COPYING SCRACH FILES TO " $PBS_O_WORKDIR/$PBS_JOBID 
    cp -rp $SCRDIR/* $PBS_O_WORKDIR
    if [ $? != 0 ]; then
        {
             echo "Sync ERROR: problem copying files from $tdir to $PBS_O_WORKDIR;" 
                     echo "Contact HPC admin for a solution."
             exit 1
        }
    fi
fi
#======================================================#
# DELETING the local scratch directory 
#======================================================#
cd $PBS_O_WORKDIR
if [[ "$SCRDIR" == *scratch* ]]
then
    echo "DELETING SCRATCH DIRECTORY" $SCRDIR
    rm -rf $SCRDIR
    echo "ALL DONE!"
fi
#======================================================#
# ALL DONE 
#======================================================#
##  End-of-job summary
echo "qstat -H $PBS_JOBID"
echo "qstat -xf $PBS_JOBID"

Another simple example


#!/bin/bash
#PBS -N jobName
###PBS -m abe
###PBS -M Myemail@griffith.edu.au
#PBS -q workq
#PBS -l select=1:ncpus=1:mem=2gb,walltime=5:00:00
##Find the node on which the pbs job is running
PBSCOMPUTENODE=`hostname`
##echo $PBSCOMPUTENODE
##Make the directory if neeeded to copy into on the pbs
mkdir -p /lscratch/$LOGNAME/$PBS_JOBID
cp -rp /export/home/$LOGNAME/Data /lscratch/$LOGNAME/$PBS_JOBID
#Run command to process the data on the lscratch dir. E.g as below
du -kh /lscratch/$LOGNAME/$PBS_JOBID
echo "Hello World" >  /lscratch/$LOGNAME/$PBS_JOBID/welcome.txt
#Copy the output to the shared home dir
cp -rp /lscratch/$LOGNAME/$PBS_JOBID/welcome.txt /export/home/$LOGNAME/Data
###This data will now be available on the shared drive.
##FYI: the .o file (output file) will have the node on which this job was run.
#The /lscratch data must be deleted after the copy
rm -r -f /lscratch/$LOGNAME/$PBS_JOBID

Manual copy:

To get the content of a remote /lscratch folder
ssh remotehostname "ls -la /lscratch/snumber"
e.g: ssh n061 "ls -la /lscratch/s5284664"


To copy something from a lscratch folder on a remote to your scratch folder:
scp -r remotehostname:/lscratch/snumber/FolderName ~/scratch
scp -r n061:/lscratch/s5284664/AnjuFolder ~/scratch


To copy into remote host's lscratch folder
scp -r ~/folder remotehost:/lscratch/snumber

e.g scp -r /export/home/s5284664/folder n061:/lscratch/s5284664/

Qs.56: NCMAS process and application

NCMAS facilities overview and who should apply
https://youtu.be/7ZZVk4HtdDY

NCMAS process and application 2021
https://youtu.be/hmV_j5GFgI0


Qs 57: What kind of storage and compute is available on Griffith HPC

The following is applicable generally but if it would be best to discuss it with the cluster admin for your situation.

Compute:

All compute resources within Griffith HPC are shared with other researchers. We use a job scheduler called PBS to intelligently schedule jobs. If there is a sudden surge in jobs by others, we can see lots of jobs being queued up, (sometimes for several days). It will all depend on how busy the cluster is.

What can be done if your research group needs a dedicated resource on the Griffith HPC? We have limited rack space available to add dedicated nodes. This means, if your project can buy the nodes (we can arrange a quote on request), they can be racked up solely for use by your project. Please note that rack space is limited and sometimes this option may not be available.  You may contact the cluster admin if you need information about obtaining dedicated nodes.

Storage:

Currently, storage is premium on the Griffith HPC cluster. The recommended best practice is to keep the home directory and scratch directory under 200GB. We recognise the need for temporary surge in space and have allowed that with permission from the cluster administrator and with the expectation that space would be brought back to normal within a reasonable amount of time. If you need it a little longer than expected, you will need to contact the cluster admin again to make suitable arrangement.

We do have shared /project space for projects where a group of researchers from a research group/project can share files work space. However, due to resource issues, this space was not upgraded with the cluster upgrade in mid 2019. We are still using the old /project space and it is nearly full. Without an upgrade to the storage subsystem, it is not possible to accommodate new projects in this space.

What can you do if you need to bring in large amount of data? This will depend on how large the data is.

If it is under 500GB, and subject to space availability, we may consider accommodating it within /scratch space. You would need to request the cluster admin to create a shared folder on /scratch. It will have to be reviewed and renewed yearly. Space management will be the joint responsibility of all members of the research group. An email can be sent automatically when space reaches 50% , 75% and 90% capacity.

If the data is big, there are options but with some drawbacks that cluster users need to be aware of. The best option (if your project funding allows) is to buy a new storage subsystem that can be hosted within the HPC network. The drawback is that it can be expensive (indicative cost: 30k-50K but actual quote can be arranged on request). The advantage is that it will be connected to the fast InfiniBand network within the HPC network and data transfer will be quick. All backend compute nodes will be able to see the data and there will no additional overhead (like copy in and out).

The other option is to put the data on a storage device like Griffith’s research drive and then transfer the data thorough sftp as they become needed. We are not able to mount research drive to the cluster head node (gc-prd-hpclogin1/gowonda) at present. A two hop transfer to local desktop and HPC is suggested. We are also not able to mount cloud based storage devices like aws, azure, etc on the cluster. The mounted data (if any e.g through sshfs) cannot be used directly for compute by the compute nodes:

  1. The mount point will not be seen by compute nodes and hence, it needs to be copied to /scratch (or home directory). Once it is copied to /scratch, it can be seen by all compute nodes and used as working data from then onwards. The problem is that depending on the size of the data, this can take a while. The output needs to be copied back to the research drive. All of this cannot be done as part of the pbs script and needs to be performed manually. Also, there is the risk of leaving the data behind on /scratch space after use and then forgetting about it. Over time, the /scratch space will fill up and disrupt your jobs and other user’s HPC operations.
  2. The copy in and copy out process will be much slower as it will not be using the fast infiniband interconnect network within the HPC cluster. It would use Griffith network to transfer files. This copy in and copy out will be additional overhead for the researcher/research group.
  3. The shared folder in /scratch will have space limitation  (200GB typically). Hence, keeping it well managed is critical.

Please contact cluster admin to discuss this and other cluster issues so that we can enable your HPC work.

Indy |Senior Systems Engineer / Griffith HPC Administrator
eResearch Services (eResearch Support Services), Office of Digital Solutions
Griffith University  | Gold Coast Campus | QLD 4222 | G11 Room 4.42
T +61 7 5552 7259 | Mob 0434 600 814| email

griffith.edu.au | HPC User Guide | Submit a Support Ticket


Qs. 58: How can I find the list of nodes and licenses in HPC?


These commands can help

1.List nodes and usage: pbsnodes -aSj
2. qhost
3. Jobs queued and Running: qstat -1an
4. license info: 
comsol: /usr/local/bin/lmstat -a -c 27006@gc-prd-erslic.corp.griffith.edu.au
abacus: /usr/local/bin/lmstat -a -c 27005@gc-prd-erslic.corp.griffith.edu.au
ArcGis: /usr/local/bin/lmstat -a -c 27004@gc-prd-erslic.corp.griffith.edu.au
Matlab: /usr/local/bin/lmstat -a -c 27001@gc-prd-erslic.corp.griffith.edu.au
Ansys: /sw/ansys/2020R2/shared_files/licensing/linx64/ansysli_util -liusage
Ansys: /sw/ansys/2020R2/shared_files/licensing/linx64/ansysli_util -statli 2325@gc-prd-erslic.corp.griffith.edu.au

5. Different queues: qstat -q
6. Additional Info: pqueues, pjobs and pnodes


Qs. 59: How do I purchase a licensed software to be installed on the Griffith HPC?


You can start the process by requesting a quote from Griffith software purchasing
https://intranet.secure.griffith.edu.au/computing/software

Click on "Request a Quote"

You can request the software through them and they can guide you through the legal stuff as well .


Qs 60: How to obtain the number of abaqus licenses?


License issued:

/usr/local/bin/lmstat -a -c 27005@gc-prd-erslic.corp.griffith.edu.au|grep 'Users of standard'

Users of standard:  (Total of 10 licenses issued;  Total of 10 licenses in use)


Remaining licenses
/usr/local/bin/lmstat -a -c 27005@gc-prd-erslic.corp.griffith.edu.au|grep 'Users of standard' | awk '{ printf("%d\n", $6-$11); }'
0


Qs61: How to run R code in parallel

Parallelisation using plyr and doParallel

We have doMC, plyr and DoParallel in R/4.0.3

Threads vs. cores

There is often a lot of confusion between CPU threads and cores. A CPU core is the actual computation unit. Threads are a way of multi-tasking, and allow multiple simultaneous tasks to share the same CPU core. Multiple threads do not substitute for multiple cores. Because of this, compute-intensive workloads (like R) are typically only focused on the number of CPU cores available, not threads. (Ref: https://jstaf.github.io/hpc-r/parallel/)

Example:
module load R/4.0.3

> library(plyr)

> library(doParallel)

Loading required package: foreach

Loading required package: iterators

Loading required package: parallel

> cores <- detectCores()

> cores

[1] 72

> registerDoParallel(cores=12)

> fake_func <- function(x) {

+   Sys.sleep(0.1)

+   return(x)

+ }

> 

> library(microbenchmark)

> microbenchmark(

+   serial = llply(1:24, fake_func),

+   parallel = llply(1:24, fake_func, .parallel = TRUE),

+   times = 1

+ )

Unit: milliseconds

     expr       min        lq      mean    median        uq       max neval

   serial 2424.3580 2424.3580 2424.3580 2424.3580 2424.3580 2424.3580     1

 parallel  226.2199  226.2199  226.2199  226.2199  226.2199  226.2199     1

> 




Qs62: My app is not running properly on the gpu node. It is stuck with no errors but with long application start-up times


You may be running into issues outlines here

Use this variable:  export CUDA_CACHE_DISABLE=1
This can be added to your pbs script or from command line (for interactive runs)
make sure you use the /lscratch directory for all runs. 
It looks to me that our /scratch and /export/home are too slow . Here is an explanation from the link above

Cache stored on a Slow Network Share
============================
On Linux, the default location of the CUDA JIT cache is in your home directory. On clusters, it is not uncommon to mount home directories
 with relatively poor performance to the compute nodes (by using the Lustre file system for scratch space, but only NFS for the home directory, 
for example). We have seen cases where this relatively slow connection to the home directory (and thus the JIT cache) resulted in very long 
application start-up times when the application was not built with code for the right SM version. Even more confusing, start-up time can vary 
from node to node due to intricacies of the NFS set up.

In this situation, it is best to build the application to avoid JIT entirely, and alternatively, to set CUDA_CACHE_PATH to point to a 
location on a fast file system.


Qs63: How do I connect remotely to my files on Griffith's G and H drives and network storage

https://intranet.secure.griffith.edu.au/computing/remote-access#network


Qs64: How do I analyse a core dump


If errors like "Program received signal SIGSEGV: Segmentation fault - invalid memory reference." are received, you may force a core dump to be generated by simply running the program on the login node outside of pbs
e.g:
module load quantum-espresso/6.7.0
PRE='Fe'
turbo_eels.x < ${PRE}tddft.in > ${PRE}tddft.out
This generated a core dump. To analyse a core dump
gdb program coredump
(gdb) where
(gdb)  bt full
Please google "analysing core dump" for various techniques.

e.g gdb turbo_eels.x core.36881

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /sw/quantum-espresso/6.7.0/bin/turbo_eels.x...(no debugging symbols found)...done.
[New LWP 36881]
Core was generated by `turbo_eels.x'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000040d7d1 in lr_alloc_init_k.3770 ()
Missing separate debuginfos, use: debuginfo-install blas-3.4.2-8.el7.x86_64 fftw-libs-double-3.3.3-8.el7.x86_64 glibc-2.17-260.el7_6.6.x86_64 lapack-3.4.2-8.el7.x86_64 libgfortran-4.8.5-36.el7_6.2.x86_64 zlib-1.2.7-18.el7.x86_64
 (gdb) where
#0  0x000000000040d7d1 in lr_alloc_init_k.3770 ()
#1  0x0000000000413ea0 in lr_alloc_init_ ()
#2  0x0000000000406bde in MAIN__ ()
#3  0x0000000000408fec in main ()

Other stuff:
===========
readelf -Wa core.36881
objdump -s  core.36881


Qs 65: Why is my GPU job not running when there are free GPU resources available

I have submitted two jobs in the n060 node. But one of my job is in the queue. Same case is happened in the gpuq. But I saw that some gpus are available. Now What I need to do?

>>>>>>>
Please run the command qstatt to check what is running on n060:

>>>>>
qstatt

94437.gc-prd-hp s5084400 dljun    KF_RD_F2    37878   1   1  100gb 100:0 R 20:41 n060/1

94493.gc-prd-hp s5084397 dljun    OUR         17110   1   1  100gb 200:0 R 01:41 n060/2

94501.gc-prd-hp s5084400 dljun    KF_PN_F2_3  20741   1   1  100gb 100:0 R 00:18 n060/0

94503.gc-prd-hp s5084397 dljun    OUR_2_32      --    1   1  100gb 200:0 Q   --   -- 

94504.gc-prd-hp s5084397 gpuq     Our_2_32      --    1   1  100gb 200:0 Q   --   -- 

>>>>
The other command to run is:
gpustat

n060  Thu Mar 25 11:09:19 2021

[0] Tesla V100-PCIE-32GB | 54'C,  90 % | 31114 / 32480 MB | s5084400(31103M)

[1] Tesla V100-PCIE-32GB | 51'C,  91 % | 31114 / 32480 MB | s5084397(31103M)

[2] Tesla V100-PCIE-32GB | 33'C,   0 % |     0 / 32480 MB |

[3] Tesla V100-PCIE-32GB | 32'C,   0 % |     0 / 32480 MB |

[4] Tesla V100-PCIE-32GB | 49'C,  92 % | 31114 / 32480 MB | s5084400(31103M)

[5] Tesla V100-PCIE-32GB | 33'C,   0 % |     0 / 32480 MB |

[6] Tesla V100-PCIE-32GB | 31'C,   0 % |     0 / 32480 MB |

[7] Tesla V100-PCIE-32GB | 33'C,   4 % |     0 / 32480 MB |


>>>>>>>>>>>>>>>>
Run the command "free" to find how much total memory is available
free

              total        used        free      shared  buff/cache   available

Mem:      395591556    17993044    40322676      710764   337275836   375287544

Swap:       2097148       17408     2079740



>>>>>>>>>>>>>>>>
gpustat shows free gpus
The command qstatt shows what is running.
If you look at it, there are 3 jobs running each with 100GB memory request. Which means a total of 300GB for these jobs. (Never mind if they are using that memory or not)

The problem is clear. For the queued job, you have requested  100GB. The command "free" tells you the maximum amount of memory on that node which is 375GB (the last column). So already 300GB is being used with the running jobs.
Additional jobs request will have to be below 75GB. The queued jobs requests additional 100GB each. It obviously does not have this as it will put it over the max 375GB. So, you must wait for the other remaining jobs to finish or reduce your memory requirements (e.g to less than 75GB in this case).

The bottleneck could be the CPUs,walltime, GPUs, memory (as in this example), etc.
I hope this explains why the jobs are still queued.


 There are cloud options (eg Microsoft Azure) that are available to all researchers on a paid basis if your research group has the funding for it


Qs66: How do I parallelise my run


Ref: https://www.quantum-espresso.org/Doc/user_guide/node18.html

Understanding Parallelism

Broadly, there are two different parallelization paradigms

  1. Message-Passing (MPI). A copy of the executable runs on each CPU; each copy lives in a different world, with its own private set of data, and communicates with other executables only via calls to MPI libraries. MPI parallelization requires compilation for parallel execution, linking with MPI libraries, execution using a launcher program (depending upon the specific machine). The number of CPUs used is specified at run-time either as an option to the launcher or by the batch queue system.
  2. OpenMP. A single executable spawn subprocesses (threads) that perform in parallel specific tasks. OpenMP can be implemented via compiler directives (explicit OpenMP) or via multithreading libraries (library OpenMP). Explicit OpenMP require compilation for OpenMP execution; library OpenMP requires only linking to a multithreading version of mathematical libraries, e.g.: ESSLSMP, ACML_MP, MKL (the latter is natively multithreading). The number of threads is specified at run-time in the environment variable OMP_NUM_THREADS.

MPI is the well-established, general-purpose parallelization. In QUANTUM ESPRESSO several parallelization levels, specified at run-time via command-line options to the executable, are implemented with MPI. This is your first choice for execution on a parallel machine.

The support for explicit OpenMP is steadily improving. Explicit OpenMP can be used together with MPI and also together with library OpenMP. Beware conflicts between the various kinds of parallelization! If you don't know how to run MPI processes and OpenMP threads in a controlled manner, forget about mixed OpenMP-MPI parallelization.

A lot of examples have been given for mpi runs (search for mpi in this FAQ).

Ref: https://www2.le.ac.uk/offices/itservices/ithelp/services/hpc/dirac/run-a-computation/job-types

Open MP/Threaded jobs

OpenMP and threaded jobs are parallel in nature, but only scale as far as the resources available within a single node. They cannot take advantage of processors across multiple compute nodes.

For these jobs, you must additionally request the number of processors that the job will require. As the job cannot be spread across more than one node, then chunk must equal 1 ((select=1), and ncpus can be any value from 1 to the number of physical cores per node (max 72 for gc-prd-hpcn nodes).It is 16 or less for  older n00 nodes).  In the example below, 8 cores on a single node have been requested:


#!/bin/bash
#PBS -m abe
#PBS -M i.siva@griffith.edu.au
#PBS -N SimpleTest
#PBS -q workq
#PBS -l select=1:ncpus=8:mem=1g,walltime=01:01:10
module load intel/intelParallelStudio2019
export OMP_NUM_THREADS=$NCPUS
cd $PBS_O_WORKDIR
./hello_world.exe


Here is the source code for hello_world

hello_world.c

#define NPROCS 8

int main (int argc, char *argv[]) {

   int nthreads, num_threads=NPROCS, tid;

  /* Set the number of threads */
  omp_set_num_threads(num_threads);

  /* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(nthreads, tid)
  {

  /* Each thread obtains its thread number */
  tid = omp_get_thread_num();

  /* Each thread executes this print */
  printf("Hello World from thread = %d\n", tid);

  /* Only the master thread does this */
  if (tid == 0)
     {
      nthreads = omp_get_num_threads();
      printf("Total number of threads = %d\n", nthreads);
     }

   }  /* All threads join master thread and disband */

}


Sample output

./hello_world.exe
Hello World from thread = 4
Hello World from thread = 0
Total number of threads = 8
Hello World from thread = 2
Hello World from thread = 3
Hello World from thread = 5
Hello World from thread = 6
Hello World from thread = 7
Hello World from thread = 1


Qs67: What are the extra benefits of using National Computing Infrastructure over in-house HPC?

https://my.nci.org.au/mancini/ncmas/2022/

"Simple answer, size. Most who put care into NCMAS to get an allocation could not survive or build a research career on inhouse HPC.

The national facilities might also offer access to large ram or GPU, or particular software, that is not available on inhouse HPC.

If inhouse resources are sufficient to so the research required then putting the effort into an NCMAS application is not worth it.

But if research stagnates or projects are put on hold or are not done as there are not enough resources available

to do them then getting more from external facilities is one way to go."

Qs68: NCI access to Griffith Researchers

Griffith researchers can get access to NCI through QCIF's NCI share. 

https://www.qriscloud.org.au/index.php/services

Please select the QCIF's NCI share from the link above/  

(You will see NCI share as an option under QRIScompute in the 2nd column).

Further details can be obtained from QCIF contact person, Marlies Hankel.

Additionally and in parallel, you can also apply directly when the application opens

https://my.nci.org.au/mancini/ncmas/2022/

Please note that projects will be given a fixed allocation which is given per quarter on a use it or loose it basis. Allocations cannot be carried forward or backward into other quarters. Standard disk space per project is 75GB in /scratch and if a project needs more you will need to contact help@nci.org.au.

Students cannot be a lead CI on an NCI project however, for the QCIF share postdocs can be. For NCMAS the lead CI is required to have an ARC or NHMRC grant or equivalent which is why larger groups apply for NCMAS. A grant is not required for a project under QCIF. However, the QCIF allocations are small, around 20-50 thousand per quarter. Larger allocations are only available through NCMAS.

Some applications like Mathematica and Matlab are licensed software. Mathematica is only available to ANU researchers on NCI. For Matlab, Griffith will need to get in contact with NCI to set up their institutional license. At the moment this is not available so one cannot use it. Unless you have your own license. But also in that case you would need to get in touch with NCI first to see if you can use Matlab on Gadi or not.

In general, allocations are given in service units SUs. 1 core hour is charged at 2 SUs. So if you have a calculation running using 4 cores and taking 48 hours then you will be charged 4*48*2=384 SUs for that calculation.

If a larger disk space (e.g 300GB) is needed, you would need to talk to NCI to increase the space in /scratch to accommodate this. If a larger RAM (e.g 400GB ), then you would need to make sure you run in a queue that supports that ram request. They could be charged more than 2 SUs per core hour though, so you would need to factor that in.

But talk to NCI, help@nci.org.au, first to see if you could use the application (e.g Matlab) onNCI before you even consider applying for an allocation on NCI.


Qs69: How to install bioinformatics software in your home directory

Requirement: Install fastqc v0.11.9 ,samtools v1.15.1, bcftools v1.15.1 ,bwa v0.7.17 (r1188), bwa-mem2 v2.2.1, GATK v4.2.6.1  
Please check to make sure if these packages are available through conda.
Once that is confirmed, do this:

To get internet access from the cluster, run this command:
source /usr/local/bin/s3proxy.sh

Load the anaconda module to create virtual environment to install the given software (if available through conda)
module load anaconda3/2022.10 

conda search -c bioconda gatk4
We get the following results. Repeat the same for the other applications if needed.
(fastqc                        0.11.9               0  bioconda )
(samtools                      1.15.1      h1170115_0  bioconda)
(bcftools                      1.15.1      h0ea216a_0  bioconda)
(bwa                           0.7.17      pl5.22.0_2  bioconda )
(bwa-mem2                       2.2.1      he513fc3_0  bioconda)
(gatk                             3.8          py36_4  bioconda)
(gatk4                        4.2.6.1      hdfd78af_0  bioconda  )
As you can see all are available for the version required.

If you do not have a virtual environment already, you may create one like this by specifying the version of python needed if required.

>>>>>>
mkdir -p ~/.conda/envs;mkdir -p ~/.conda/pkgs

Edit the ~/.condarc file (nano ~/.condarc) and place the following content:

channels:
  - defaults


The do the following to create the virtual environment
conda create --name environmentName
e.g 
conda create --name javed

Activate this environment by doing this:

source activate javed

>>>>>>>
Once you are in the virtual environment, you simply use the conda command to install the version of the application you need.
e.g
conda install -c bioconda fastqc=0.11.9
conda install -c bioconda samtools=1.15.1
conda install -c bioconda bcftools=1.15.1
conda install -c bioconda bwa=0.7.17
conda install -c bioconda bwa-mem2=2.2.1
conda install -c bioconda gatk4=4.2.6.1

Once installed, these applications would be available in your pbs script by having these 2 lines in the pbs script

module load anaconda3/2021.11 
source activate javed  (or whatever the name of the virtual env)


Sometimes a particular virtual environment may conflict with the install request due to other applications already installed in that environment conflicting with the the requirement. Simply install a new virtual environment to install the problem application.

source deactivate javed
conda create --name javed2
source activate javed2
You may have to load further modules to enable it to install. (e.g module load library/zlib/1.2.12 ) or install updated versions through conda of dependencies. (e.g:conda install zlib=1.2.12)

Now that you have installed everything, you can now use it in a pbs script (see below for cat ~/pbs/pbs.01). 
To submit the job, you can do this "qsub pbs.01".


#!/bin/bash
#PBS -N MyTest
#PBS -m abe
#PBS -M myEmail@griffithuni.edu.au
#PBS -q workq
#PBS -l select=1:ncpus=1:mem=12gb,walltime=5:00:00
module load anaconda3/2021.11
source activate javed
echo "Starting job: "
cd  $PBS_O_WORKDIR
tensorboard -help

Qs70: How do I run Jupyter notebook on the HPC cluster

Please have a look at:

Jupyter Notebook


Qs71: Can I do x11 forwarding and run GUI applications on the HPC cluster

Reference: https://kb.hlrs.de/platforms/index.php/Batch_System_PBSPro_(vulcan)#DISPLAY:_X11_applications_on_interactive_batch_jobs

The login or head node of each cluster is a resource that is shared by many users. Running a GUI job on the login node is prohibited and may adversely affect other users. X11 Forwarding is only possible for interactive jobs.

Please note that there is a performance penalty when running a GUI job on the compute nodes using the method outlined below. 

Set up X11 forwarding

To use X11 port forwarding, Install Xming X Server on Windows laptop/desktop first. Install the xming fonts package as well.
See instructions here: https://griffith.atlassian.net/wiki/spaces/GHCD/pages/4035477/xming

On a mac laptop/desktop, you can install quartz 

Install a ssh client e.g. putty, please follow this instruction
https://griffith.atlassian.net/wiki/spaces/GHCD/pages/4030965/putty

On a mac desktop/laptop, install Xquartz (https://www.xquartz.org/)

On a linux laptop/desktop:

ssh -Y gc-prd-hpclogin1.rcs.griffith.edu.au

Once you are logged on the login node, do the following

Test if X11 is working by typing: xclock

If  the clock pops up, it means it is working. Do do not run any GUI applications on the login node as it is a shared node and should not be used for running any gui application.

Set up pbs job with X11 forwarding

X11 Forwarding is only possible for interactive jobs

Sample interactive run
#On the login node which has been sshed into with X11 forwarding, run the following command to submit an interactive pbs job:

qsub -I -X -q workq  -l select=1:ncpus=1:mem=8gb,walltime=24:00:00

If you need a specific node, you can mention the node as follows
qsub -I -X -q workq  -l select=1:ncpus=1:host=gc-prd-hpcn006:mem=8gb,walltime=24:00:00

Start your gui application

Once the interactive job runs on a compute node, please run your GUI application :

e.g:

xclock (to test if X11 forwarding worked)

##module load matlab/2021a

##matlab

  • Because the connection from the PBS job to your local system must be maintained, you cannot exit from either your local system or from the login node window until the job completes.
  • Application performance on remote X11 servers deteriorates with latency, so unless your local system is physically close to the login node, job performance may not be optimal.


Here is a step by step instruction

From your local computer
ssh -Y snumber@10.250.250.3

Once you log in, make sure you are able to open "xclock"

Once you are happy xclock works, 
cd <dataDirectory>
Start pbs with X11 forwarding and in interactive mode (this is only possible in interactive mode)
e.g: 
qsub -X -I -l select=1:ncpus=1:mem=12gb,walltime=5:00:00

Once the job has been placed in a compute node, you wil be taken there.
The prompt will change from snumber@gc-prd-hpclogin1 to one of the compute node snumber@gc-prd-hpcn003

Now, type "xclock" to make sure all is well.

Then do this:
cd $PBS_O_WORKDIR
module load misc/febio/3.6
module load matlab/2021a
module load anaconda3/2021.11
source activate nataliya
cp ~/pathdef.m .
#Run your matlab here
matlab -nodisplay -nosplash -nodesktop -r "run('/export/home/snumber/FEA_scripts/Mat_Visco_PinkySil/C_inverse_FEA_uniaxial_viscoelastic.m');exit;"



Qs72: How do I run Singularity container on the cluster?

We will show you with an example. We have a container named "trinityrnaseq.v2.14.0.simg" located in /sw/Containers/singularity/images/


a. Copy the container to your home directory ~/Containers

mkdir ~/Containers
cp -i /sw/Containers/singularity/images/trinityrnaseq.v2.14.0.simg ~/Containers

b, Create a run file inside scratch folder

mkdir ~/scratch/jobs
create a run file inside this folder (nano ~/scratch/jobs/trinity.sh)
cat trinity.sh

dir=/scratch/jobs/sz_ons_totalRNA
out_dir=/scratch/jobs/trinity_out
Trinity --seqType fq --max_memory 19G --normalize_reads --left $dir/100020001_1P.fastq --right $dir/100020001_2P.fastq --SS_lib_type RF --CPU 15 --output $out_dir/10201_r1r2_trinity_out

Make it an executable:
chmod +x  trinity.sh

Make sure data is available inside: /export/home/s123456/scratch/jobs
(or wherever in scratch)

c. Sample pbs script named sin.pbs1

#!/bin/bash 
#PBS -m abe
#PBS -M caio.damski@griffithuni.edu.au
#PBS -N TestMitoZ
#PBS -q workq
#PBS -l select=1:ncpus=16:mem=21gb,walltime=19:00:00
cd  $PBS_O_WORKDIR
singularity exec -B /scratch/s123456:/scratch --pwd /scratch/s123456:/scratch --pwd /scratch  /export/home/s123456/containers/trinityrnaseq.v2.14.0.simg "/scratch/jobs/trinity.sh"
exit
sleep 2
d. If needed and as a test,do an interactive run like this: 
qsub -I -l select=1:ncpus=16:mem=21gb,walltime=19:00:00

Qs73: How do I make my conda environment available to other users in the team

https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html

Exporting the environment.yml file
Note

If you already have an environment.yml file in your current directory, it will be overwritten during this task.

Activate the environment to export: 

Note

Replace myenv with the name of the environment. 

module load anaconda3/2023.09

source activate myenv

Export your active environment to a new file:

conda env export > environment.yml
Note

This file handles both the environment's pip packages and conda packages.

Email or copy the exported environment.yml file to the other person.



To setup this environment:
module load anaconda3/2023.09
source /usr/local/bin/s3proxy.sh
conda env create -n ENVNAME --file ENV.yml
e.g
conda env create -n labmcdougall --file /tmp/environment.yml
conda env create -n labmcdougallRoot --file /tmp/environment.yml
#To remove an environment, conda remove -n env --all

If an environment is already available, and you wish to create a copy locally,

mkdir -p ~/.conda/envs
mkdir -p ~/.conda/pkgs
vi ~/.condarc

channels:
  - defaults

module load anaconda3/2022.10
source /usr/local/bin/s3proxy.sh
e.g if an env called n061 needs be copied:
cd ~/.conda/envs #This step is important
conda create --prefix=n061  --clone /sw/anaconda3/2022.10/envs/n061
source activate ~/.conda/envs/n061



Qs74: How to use the bastion server (jump host) to log into the cluster

HPC Bastion servers provide Multi-Factor Authentication (MFA) as an additional layer of cybersecurity. One will need to use appropriate  methods that Griffith supports (pingID app, yubi keys, etc) to authenticate. Unfortunately this option is no longer available to HPC users.

ssh -o ProxyCommand="ssh  -W %h:%p s123456@gc-prd-bastion-1.itc.griffith.edu.au" s123456@10.250.250.3

OR
ssh -l s123456 \
      -o 'ProxyCommand ssh -l s123456 %h nc 10.250.250.3 22' \
      -o 'HostKeyAlias 10.250.250.3' \
      gc-prd-bastion-1.itc.griffith.edu.au

You may update your ~/.ssh/config file and make the command simpler.

Host hpclogin1
        Hostname 10.250.250.3
        ProxyCommand ssh s123456@gc-prd-bastion-1.itc.griffith.edu.au -W %h:%p
 
Now all I have to do is type the following ssh command 
ssh s123456@hpclogin1

Note 1:
=======
OpenSSH version 7.3 or above: If there are multiple jump hosts, you can set multiple jump host using a comma-separated list and the servers will be visited in the order listed:

Host hpclogin1
        Hostname 10.250.250.3
        ProxyCommand ssh gc-prd-bastion-1.itc.griffith.edu.au,na-prd-bastion-1.itc.griffith.edu.au -W %h:%p
        User s123456

Note 2: Multihop transfers
===========================

sftp -o ProxyCommand="ssh  -W %h:%p s123456@gc-prd-bastion-1.itc.griffith.edu.au" s123456@10.250.250.3

scp -o ProxyCommand="ssh -W %h:%p user1@server1" user2@server2:/<remotePath> <localpath>

To copy a file named core.10437 from HPC login node to local machine:
=====================================================================
scp -o ProxyCommand="ssh -W %h:%p s123456@gc-prd-bastion-1.itc.griffith.edu.au" s123456@10.250.250.3:/export/home/s123456/core.10437 . 

To copy a directory named tmp2 from local machine to HPC login node
====================================================================
scp -o ProxyCommand="ssh -W %h:%p s123456@gc-prd-bastion-1.itc.griffith.edu.au" -r tmp2 s123456@10.250.250.3:/export/home/s123456/

Ref1, Ref2 

Qs75: How to use MFA on the QRIScloud cluster bunya (UQ cluster) to transfer files

UQ's bunya uses google authenticator.

You may use Cyberduck for sftp connections. The popular Windows client, WinSCP, works very well with MFA. Unfortunately the popular Windows/Linux client, Filezilla, does not handle the 2nd authentication so always do things from a terminal and CLI on Linux clients. Mac users can use CyberDuck. Here are some screenshots from CyberDuck:


mac/linux users, can also use sshfs to mount (Though we have not tested this, this may be available for windows users also)

Ref 1: https://osxfuse.github.io/

Ref2: https://phoenixnap.com/kb/sshfs

e.g

mkdir -p ~/mnt/bunya

sshfs myusername@bunya.rcc.uq.edu.au:/home/myusername/ ~/mnt/bunya  

One-time password (OATH) for `myusername': 

Use the native file explorer that comes with the OS to browse this folder and transfer files.

Bunya Referencs: https://services.qriscloud.org.au/credential

Qs76: My job is not running. How do I check if resources are available


Look at the available nodes with this command qhost|grep gc-prd|grep -v login:
qhost|grep gc-prd|grep -v login
vnode           state           OS       hardware host            queue        mem     ncpus   nmics   ngpus  comment
--------------- --------------- -------- -------- --------------- ---------- -------- ------- ------- ------- ---------

gc-prd-hpcn002  job-busy        --       --       gc-prd-hpcn002  --            188gb      72       0       0 --
gc-prd-hpcn003  job-busy        --       --       gc-prd-hpcn003  --            188gb      72       0       0 --
gc-prd-hpcn004  free            --       --       gc-prd-hpcn004  --            188gb      72       0       0 --
gc-prd-hpcn005  job-busy        --       --       gc-prd-hpcn005  --            188gb      72       0       0 --
gc-prd-hpcn006  job-busy        --       --       gc-prd-hpcn006  --            188gb      72       0       0 --
gc-prd-hpcn001  job-busy        --       --       gc-prd-hpcn001  --            188gb      72       0       0 --



They are the available nodes. Next check if their resources are all taken. This can be seen from the output of the command pbsnodes -aSj|grep gc-prd|grep -v login:

pbsnodes -aSj|grep gc-prd|grep -v login

pbsnodes -aSj|grep gc-prd|grep -v login
                                                        mem       ncpus   nmics   ngpus
vnode           state           njobs   run   susp      f/t        f/t     f/t     f/t   jobs
--------------- --------------- ------ ----- ------ ------------ ------- ------- ------- -------
gc-prd-hpcn002  job-busy             5     5      0   38gb/188gb    0/72     0/0     0/0 214921,214922,214923,214924,215171
gc-prd-hpcn003  job-busy             5     5      0   38gb/188gb    0/72     0/0     0/0 214925,214926,214927,214942,215172
gc-prd-hpcn004  free                 6     6      0    2gb/188gb    6/72     0/0     0/0 170274,214943,214944,215089,215176,215236
gc-prd-hpcn005  job-busy             5     5      0   38gb/188gb    0/72     0/0     0/0 214928,214929,214930,214931,215173
gc-prd-hpcn006  job-busy             5     5      0   38gb/188gb    0/72     0/0     0/0 214932,214933,214934,214935,215174
gc-prd-hpcn001  job-busy             5     5      0   38gb/188gb    0/72     0/0     0/0 214936,214937,214938,214939,215175

You may try sending the job to queues named:

bigmem OR 
longbigmem OR
small OR
small_long OR
medium OR
express OR
The line in pbs script should be like this
#PBS -q small
Please note that not many nodes are assigned to these queues. 

Qs77: Reusing SSH connections

Mac/Linux

To make it more convenient for users who use multiple terminal sessions simultaneously, SSH can reuse an existing connection if connecting from Linux or Mac.  After the initial login, subsequent terminals can use that connection, eliminating the need to enter the username and password each time for every connection.  To enable this feature, add the following lines to your ~/.ssh/config file:

~/.ssh/config

Host 10.250.250.*
ControlMaster auto
ControlPath /tmp/%r@%h:%p
ControlPersist 2h
You may not have an existing ~/.ssh/config. If not, simply create the file and set the permissions appropriately first: touch ~/.ssh/config && chmod 600 ~/.ssh/config
This will enable connection reuse when connecting to any host via SSH or SCP.
Note that once inside the cluster, it is possible to move laterally to other nodes without additional MFA. e.g ssh n061

Windows (PuTTY)

To enable connection reuse in PuTTY, enable the “Share SSH connections if possible” option under the “SSH” configuration section.

First, select the saved PuTTY config for the cluster and click “Load”.

You can configure Putty to Share SSH connections if possible via the SSH option in the Connection Catagory when configuring a new connection.

As long as your existing connection remains active you can start new sessions without re-authenticating by using Duplicate Session command to start new sessions.


 sftp transfer:   Behaviour noted on CyberDuck and winSCP: One would get a ping ID prompt when 1st logging in and then another one when a file transfer is initiated. However, you can select multiple files to download or upload without additional pingID prompts for each file (unlike filezilla which prompts for each file) . Hence my recommendation is to use one of these apps (cyberduck or winscp) if on Windows or Mac with winscp being the preferred choice.
    
Please note that once the batch of transfers are done, and you wish to transfer another batch with few more files, another pingID prompt  will occur (just 1) . Hence, you will need to expect a prompt anytime a new file transfer is initiated. 

My recommendation is to use winscp or CyberDuck for file transfers. Please avoid filezilla. If the situation suits, you may consider using youbikeys instead of pingID app.


Qs78: winscp connections fail with error


Your WinSCP session to transfer files to/from the HPC may fail with the following error. Curiously, it possible to login to the HPC via the terminal without any issues:
It accepts my s-number password, but does not ask for the authenticator app and throws the error above.

Ans: 

This can be due to any of the following reasons: 
  1. You have more than 2 devices configured with pingID (e.g apps on multiple phones or desktop)
  2. If you have 2 or less apps configured, can you reinstall the pingID app?. Contact IT support 07 373 55555 if you have any issues re-installing pingid app. 
  3. You have another authenticator like google authenticator installed. This does not work well winscp on the Griffith HPC cluster. You must configure pingID app.


Contact IT support 07 373 55555 if you have any issues re-installing pingid app. 

Qs79: List of ssh/sftp clients

ssh clients

======= 

multiplatform: 

  1. https://hyper.is/
  2. https://github.com/alacritty/alacritty

Windows:

  1. http://mobaxterm.mobatek.net/download-home-edition.html
  2. putty
  3. Filezillia
  4. Windows WSL system lets you run the linux versions of ssh under windows.
       wsl --install
    This should get you command line:  ssh,  scp,  and sftp; 

scp/sftp clients

===========

multiplatform: CyberDuck

Windows: winscp (recommended), filezila,

In Ubuntu 14.0.4 its under Files > Connect to Server in the Menu or Network > Connect to Server in the sidebar

in Connect to server: ssh://s123456@10.250.250.3

In Fedora, go to menu FileConnect To Server, select the appropriate protocol, enter required details and simply connect. Just make sure that the SSH server is running on the other side. It works great.

Others include nautilius, thunar, gftp, gigolo, etc

Or use wine to run windows apps.

  1. Run sudo apt-get install wine (run this one time only, to get 'wine' in your system, if you don’t have it)
  2. Download the latest WinSCP portable package https://winscp.net/eng/download.php
  3. Make a folder and put the content of the ZIP file in this folder
  4. Open a terminal
  5. Type wine WinSCP.exe

Done! WinSCP will run like in a Windows environment!


Usage of linux scp command is as follows

To transfer from remote to locally: scp -r remotehostname:/export/home/snumber/FolderName ~/ 

e.g scp -r s5323827@10.250.250.3:/export/home/s5323827/FolderName ~/
To transfer from local to remote:
scp -r Foldername s5323827@10.250.250.3:/export/home/s5323827/
You will get a pingID MFA request for this.

Qs80: Tunnel setup

Tunneling setup:

From command line e.g: 

ssh -N -f -L 8889:gc-prd-hpcn002:8889 s123456@gc-prd-hpclogin1.rcs.griffith.edu.au

Or use an app like:

https://davrodpin.github.io/mole/#windows


Qs81: Sending a large file to external collaborators 


You may use filesender from Aarnet
It is a very handy tool, when you cannot email a file because of the file extension. You can even allow external people to send you data (using vouchers).


Qs82: Explain how to improve the chances of a job run

Qs82
You can investigate which queue has free resources currently with these commands:

pbsnodes -aSJ
qhost
qstat -q
qmgr -c 'print queue longbigmem'. #substitute longbigmem with the queue name
However, not all queues can be used by all as some are reserved for researchers who bought dedicated compute nodes from their own funds. Examples of these queues are: dljun,dlyao,sparks2,omero,aspen,gpuq,gpuq2 etc. Please also note there are walltime and memory restrictions on some queue. 

The default queue is workq has the least amount of restrictions and is recommended for general purpose runs. 

For some queues, there may be just one node configured. Also, some nodes are down for hardware reason. As an example, let's look at the queue named 'small_long'

>>>>>>>>
 qhost|grep small_long

n017            free            --       --       n017            small_long     47gb      12       0       0 --


This shows one node is configured for this queue. Next, let's check what is available currently for this node


pbsnodes -aSj | egrep "mem|vnode|n017"

                                                        mem       ncpus   nmics   ngpus

vnode           state           njobs   run   susp      f/t        f/t     f/t     f/t   jobs

n017            free                 1     1      0    35gb/47gb   11/12     0/0     0/0 221143



We can see it has 35GB free out of a total of 47gb. It has 11 cores free out of 12 cores. 

From this, we can see that if your job asks for 35GB or less and 11 or less cores, and your job is sent to this queue, it will probably run. Please note that free resources will change dynamically.


Qs83: pytorch is not able to find the cuda device

Ref

n061 nvidia driver is 530.30.02 and cuda toolkit 12.1. As this is the latest (as of March 2023), pytorch was not compatible. The nvidia drivers have been downgraded to 520.61.05 and cuda toolkit 11.8. Even after the downgrade, torch still could not detect a cuda device.The workaround was to manually compile it 

Here is the installation notes:
>>>>>>
Edit ~/.condarc (vi ~/.condarc)
>>>>>>
channels:
  - defaults

envs_dirs:
  - /lscratch/s12345/.conda/envs
  - /export/home/s12345/.conda/envs

>>>>>>>
source /usr/local/bin/s3proxy.sh
module load anaconda3/2022.10
module load gcc/11.2.0
module load cmake/3.26.4
module load cuda/11.4
#If you have an existing environment, you can use it. If not create it with: conda create -n myTorch
source activate myTorch
conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools  cffi typing_extensions future six requests dataclasses #cmake
conda install -c pytorch magma-cuda118

mkdir  /tmp/bela  #Any name is fine. Here I named it bela. It is temp dir
cd /tmp/bela
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
git checkout v1.13.1 # Or any version you want
git submodule sync
git submodule update --init --recursive
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py install 2>&1 | tee pythonSetupLogs.txt
>>>>>>>

Now you can test your installation:
source activate myenv
python
Python 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>>

Another test:
python isCuda.py 

CUDA AVILABLE

>>>>cat isCuda.py<<<<<
import torch
if torch.cuda.is_available():
    print('CUDA AVILABLE')
else:
    print('NO CUDA')
>>>>>>>>>>>>>>>>>>
 
To install torchvision from source:
conda install -c conda-forge libjpeg-turbo
git clone https://github.com/uploadcare/pillow-simd
cd pillow-simd
python setup.py install 2>&1 | tee pythonInstallPillowSimd.txt
git clone https://github.com/pytorch/vision.git
cd vision
git checkout v0.14.1
python setup.py install 2>&1  | tee pythonInstalltorchvision.txt
 For tensorflow:
conda install tensorflow=2.12.0=gpu_py311h65739b5_0  -c pytorch -c nvidia
>>>>>>>>>
Another way:
source /usr/local/bin/s3proxy.sh
module load anaconda3/2023.09
conda create -n mytorchA100 -c pytorch -c nvidia
source activate mytorchA100
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
conda install tensorflow=2.12.0=gpu_py311h65739b5_0  -c pytorch -c nvidia
>>>>>>>>
To ensure that PyTorch was installed correctly, we can verify the installation by running sample PyTorch code. Here we will construct a randomly initialized tensor.

import torch
x = torch.rand(5, 3)
print(x)

The output should be something similar to:

tensor([[0.3380, 0.3845, 0.3217],
        [0.8337, 0.9050, 0.2650],
        [0.2979, 0.7141, 0.9069],
        [0.1449, 0.1132, 0.1375],
        [0.4675, 0.3947, 0.1426]])



Source: https://pytorch.org/get-started/locally/


Qs84: How do I use the tensorflow singularity/docker containers on the A100 gpu node

ssh n061
#Check if they work first
cd /lscratch/sw/Containers
singularity  shell  --nv /lscratch/sw/Containers/tensorflow_23.05-tf2-py3mine.sif
python
import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs:", len(physical_devices))

#use it in a pbs script
Check: https://griffith.atlassian.net/wiki/spaces/GHCD/pages/4030761/Singularity

# You can create your own containers
#https://catalog.ngc.nvidia.com/containers
singularity pull  docker://nvcr.io/nvidia/tensorflow:22.01-tf2-py3  # Get the version you desire
singularity  shell  --nv <container.sif>


Qs85: What can help if the home, sw and scratch space is slow on the A100 gpu node

You can temporarily use the local scratch to install application and run data out of.
#Please note /lscratch is temporary storage and will disappear next time the server is imaged. So always have a backup.

To install from /lscratch, you can do the following:
mkdir -p /lscratch/s12345/.conda/envs
mkdir -p /lscratch/s12345/.conda/pkgs

Edit ~/.condarc (vi ~/.condarc)
>>>>>>
channels:
  - defaults

envs_dirs:
  - /lscratch/s12345/.conda/envs
  - /export/home/s12345/.conda/envs
>>>>>>

module use /lscratch/sw/Modules  #put it in your ~/.bashrc if you are going to use this often
module load anaconda3/2023.03
source /usr/local/bin/s3proxy.sh
conda create -n A100local   #any name will do
source activate A100local
Now you can install the packages with
conda search -c pytorch pytorch #example of a package 
conda search -c pytorch-nightly  pytorch #example of a package from development site
conda search tensorflow #example of a package 
conda install <package> 
e.g: conda install -c pytorch pytorch=2.0.1=py3.11_cuda11.8_cudnn8.7.0_0
e.g: conda install  -c pytorch-nightly tensorflow=2.12.0=gpu_py310hfda07e1_0
e.g. conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia


Qs86: Useful external video tutorials

  1. https://www.youtube.com/playlist?list=PLmu61dgAX-aYsRsejVfwHVhpPU2381Njg

Qs87: I do not receive email sent to the LML - GHPC group.

Requests of this type should be directed to IT Service Centre to raise though the Digital Solutions service management system (Cherwell / GSM). 
The easiest way to do this is to email your request to ithelp@griffith.edu.au. The request will then be redirected to the correct team. 
Given the email list is an LML, this will be Identity Services. 
(e.g Reference GSM case Incident#1081850)
The solution provided for this was this:
>>>>>>
Regarding your Service Request 1081850 emails sent to LML email list are not received by me, logged on 7/11/2023 2:49 PM, 
we have the following question or update:

We have noticed a discrepancy in the Group Membership for LML - GHPC and your Staff Azure GroupMembership was missing from this Azure Group used for email.  A re-synchronization of this group has increased the associated Azure Group Membership from 178 to 216 members of which you were one of the people missing. Thank you for notifying us of this anomaly.

You will receive emails for those sent to this group from now on.

None of the other LML - Groups (you have membership for) have Staff Azure/Staff Email as a Target System  and would never generate email.

This Service Request will now be marked resolved on the basis of the action taken and information provided.

>>>>>> 

Qs88: Install perl modules without root access.

Install perl locally

>>>>>>>>installation notes<<<<<<<<<<<<
Download the perl source file and unzip/untar
wget https://www.cpan.org/src/5.0/perl-5.38.2.tar.gz --no-check-certificate  
 ./Configure -des -Dprefix=~/sw/perl/5.38.2 2>&1 | tee configureLogs.txt
   make 2>&1 |tee makeLogs.txt
   make test 2>&1 | tee  maketestLogs.txt
   make install 2>&1 | tee makeInstallLogs.txt
>>>>>>>>End of perl installation<<<<<

echo "check_certificate = off" >> ~/.wgetrc

Edit the .bash_profile file and add the local MODULEPATH
>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
vi ~/.bash_profile
export MODULEPATH=$MODULEPATH:~/sw/Modules
>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
#copy the main perl module file locally
 cp -r /sw/Modules/lang/perl/ ~/sw/Modules
module load perl/5.38.2

You now have a a copy of the perl installation in your ~/sw/perl directory and made a module file from ~/Modules/perl/5.38.2

source /usr/local/bin/s3proxy.sh

I guess you want to install cpan -i File:FindLib
This will install this module now. Simple press return for username and password for proxy.

Qs89: n061 gpu node : pytorch usage

module load anaconda3/2022.10

source activate TorchA100

Qs90: How to run the pytorch (container method and conda env method) on the ICT cluster

Container method:

Ref1: Singularity#Convertingadockerimageintosingularityimage

Minimal docker images are kept here:
/sw/Containers/docker

mkdir -p /export/home/snumber/sw/Containers
cd /export/home/snumber/sw/Containers
Convert the tarball to a Singularity image. 
module load singularity/4.1.3 
singularity build --sandbox pytorch docker-archive:///sw/Containers/docker/pytorchnew.tar
(it takes quite sometimes to create the sandbox. Please be patient. You only have to do this once!)

for testing:
singularity shell -e -B /scratch/snumber:/scratch/snumber -B /export/home/snumber:/export/home/snumber /export/home/snumber/sw/Containers/pytorch

e.g:
Singularity> python
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True

So we see it works!

So now you need to integrate this into a slurm or pbs script. 
PBS examples are here



Slurm would be like this:

create a run script:
mkdir /export/home/snumber/slurm/data
create a file /export/home/snumber/slurm/data/myrun1.sh

cat  /export/home/snumber/slurm/data/myrun1.sh
>>>>>>>>>
cd /export/home/snumber/slurm/data
python /export/home/snumber/slurm/data/isCuda.py
>>>>>>>>>>

make it executable:
chmod +x  /export/home/snumber/slurm/data/myrun1.sh

Here is the content of the slurm.01 file

>>>>cat slurm.02 >>>>>>>>>>>>> 
#!/bin/bash 

####SBATCH --account=def-yoursnumber 

#SBATCH --job-name=helloWord 

#SBATCH --cpus-per-task=1 

#SBATCH --mem-per-cpu=1500MB 

#SBATCH --gres=gpu:a100:1 

###SBATCH --qos=work 

#SBATCH --qos=work 

###SBATCH --mem=4000M               # memory per node 

#SBATCH --time=0-03:00

module load singularity/4.1.3

##./program                         # you can use 'nvidia-smi' for a test

singularity exec -e -B /scratch:/scratch -B /export/home/snumber:/export/home/snumber /export/home/snumber/sw/Containers/pytorch "data/myrun1.sh"

>>>>>>>>>>>

To submit the slurm job:
sbatch  slurm.02


Alternatively, to run interactively:
srun --export=PATH,TERM,HOME,LANG  --job-name=hello_word --cpus-per-task=1 --mem-per-cpu=1500MB --time=1:00:00 --qos=work --gres=gpu:a100:1 --pty /bin/bash -l


singularity shell -e -B /scratch/snumber:/scratch/snumber -B /export/home/snumber:/export/home/snumber /export/home/snumber/sw/Containers/pytorch
cd data
sh myrun1.sh 






>>>Some errors you can ignore while building the sandbox>>>
INFO:    Starting build...
INFO:    Fetching OCI image...
INFO:    Extracting OCI image...
2024/05/20 12:37:31  warn rootless{usr/local/nvm/versions/node/v16.20.2/bin/corepack} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2024/05/20 12:37:32  warn rootless{usr/local/nvm/versions/node/v16.20.2/bin/npm} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2024/05/20 12:37:32  warn rootless{usr/local/nvm/versions/node/v16.20.2/bin/npx} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"

2024/05/20 12:38:17  warn rootless{usr/lib/libarrow.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2024/05/20 12:38:17  warn rootless{usr/lib/libarrow.so.1400} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2024/05/20 12:38:17  warn rootless{usr/lib/libarrow_acero.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2024/05/20 12:38:17  warn rootless{usr/lib/libarrow_acero.so.1400} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2024/05/20 12:38:17  warn rootless{usr/lib/libarrow_dataset.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2024/05/20 12:38:17  warn rootless{usr/lib/libarrow_dataset.so.1400} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2024/05/20 12:38:34  warn rootless{usr/lib/libparquet.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2024/05/20 12:38:34  warn rootless{usr/lib/libparquet.so.1400} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
INFO:    Inserting Singularity configuration...
INFO:    Creating sandbox directory...
INFO:    Build complete: pytorch
>>>>

The other method is to create a conda environment

#Go into an interactive run first
srun --export=PATH,TERM,HOME,LANG  --job-name=hello_word --cpus-per-task=1 --mem-per-cpu=60GB --time=19:00:00 --qos=work --gres=gpu:a100_1g.10gb:1  --pty /bin/bash -l
source /usr/local/bin/s3proxy.sh
module load  cuda/12.1
module  load anaconda3/2024.02
conda create --name pytorchCuda121 pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
source activate pytorchCuda121

Qs91: Do you recommend installing miniconda : pytorch usage example

We do not recommend installing miniconda as we provide conda as a module.
module load anaconda3/2024.02 

If you have installed it, comment out the miniconda section from ~./bashrc which can mess with other non-miniconda env. 
After commenting it out re-log back in.

Here is how you can test pytorch on your home dir for cuda availability:

To test this, please do this;
cd ~/slurm 
There are a couple of sample scripts to run slurm jobs

We will run an interactive job to troubleshoot this problem:
For example:
srun --export=PATH,TERM,HOME,LANG  --job-name=hello_word --cpus-per-task=1 --mem-per-cpu=50GB --time=1:00:00 --qos=work --gres=gpu:a100:1 --pty /bin/bash -l

it would put you inside a job.

squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            <snip>
              1718    LocalQ    myrun s5284664  R    4:17:59      1 dgxlogin


Once inside a job, check if it detects gpu in pytorch app. (if already installed)

module load anaconda3/2024.02 
conda info --envs
# conda environments:
#
base                  *  /export/home/s5305964/miniconda3
PHDenv                   /export/home/s5305964/miniconda3/envs/PHDenv

Unfortunately, you have installed it in miniconda. So leaving aside that, I would build a new env

To install new env and packages, do this:

source /usr/local/bin/s3proxy.sh  (need to get access the internet within a slurm job)
conda create --name pytorchCuda121 pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

(See Qs 90 : https://griffith.atlassian.net/wiki/spaces/GHCD/pages/4030751/FAQ+-+Griffith+HPC+Cluster#FAQ-GriffithHPCCluster-Qs90%3AHowtorunthepytorch(containermethodandcondaenvmethod)ontheICTcluster)

source activate pytorchCuda121
 python ~/isCuda.py 
CUDA AVILABLE
>>>>>>>>>isCuda.py>>>>>>>>>>>>>
import torch
if torch.cuda.is_available():
    print('CUDA AVILABLE')
else:
    print('NO CUDA')

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

PS: Comment out the miniconda install from ~./bashrc which can mess with this env.

If you ever need to use the miniconda env, you can do this:

source deactivate
module purge

__conda_setup="$('~/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
eval "$__conda_setup"
source activate PHDenv
python isCuda.py 
CUDA AVILABLE


  • No labels