FAQ - Griffith HPC Cluster
*Qs 1: How do I cite or mention the cluster in papers? or what is the preferred method?
Ans
Please consult this page: Griffith HPC User Guide#10Acknowledgements
*Qs 2: When using DOCK I'll be searching a fairly large database of molecules (about 1,300,000) but
I will split it into 6 chunks for the search. On the old cluster I used to copy each chunk of the database
to the local machine the calculation was running on. Will I need to do that on the new cluster or can I
simply point to a copy of the database in my local directory ?
Ans
We have a scratch partition for this. There is a link in your home directory ~/scratch. You could make a copy of the database on the scratch directory and point DOCK to it during your run. This can be helpful if you expect a lot of i/o and can speed things a little. I don't think the local home directory is very busy at the moment but in the future as more users come on-board, it could be and hence it may be good idea to start using the ~/scratch (which is a pointer to /scratch/snumber).
*Qs 3: How can I live monitor the progress of my PBS jobs"
Ans
You may be able to trace your jobs with this command: tracejob type "man tracejob" for syntax. e.g: tracejob -n4 213883 where n4 indicates to search logs over the past 4 days 213883 indicates job number. There is a script to watch a job live. You can run it on gowonda as follows: watch_jobs.sh It will ask for node number on which your job is running. This can be got by typing: qstat -1an The last column of the output from above will have the node number. It also needs the job number. >>>>>>> cat watch_jobs.sh echo "===========================" echo "Please enter Node Number e.g: n004" read NODE echo "Please enter Job number e.g 9066" read JOBNO echo "===========================" ssh $NODE "tail -f /var/spool/PBS/spool/*$JOBNO*" >>>>>
Qs. 4 How to run gpu/cuda jobs?
We would like to run some software on the GPUs on the new computer but are having problems figuring out how to do it. Can you give us any info as we are preparing a paper for a conference.
Ans
Check this: https://griffith.atlassian.net/wiki/spaces/GHCD/pages/4030784/cuda
I have put some documentation here: http://confluence.rcs.griffith.edu.au:8080/display/GHPC/cuda A sample PBS script is here: >>>>>>>>>>>>>> #!/bin/bash #PBS -N cuda #PBS -l ngpus=2 #PBS -l walltime=100:00:00 #PBS -q gpu module load cuda/4.0 echo "Hello from $HOSTNAME: date = `date`" nvcc --version echo "Finished at `date`" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Basically a queue named gpu has been created and a pbs resource named ngpus created. Both needs to be called in the pbs script to send batch jobs to the gpu nodes. I suggest you do an interactive batch run first to sort out any issues. To do that: qsub -I -l ngpus=2 -l walltime=100:00:00 -q gpu This will log you into one of the gpu nodes (you may have to wait if the nodes are busy). Type this on the gpu node: module load cuda/4.0 The cuda binaries are located here: /usr/local/cuda/bin/
Qs. 5: Is opencl available?
Ans
Yes indeed! Check this wiki page for details: https://griffith.atlassian.net/wiki/spaces/GHCD/pages/4035461/OpenCL
Qs. 6: What is the advantage of using the /scratch filesystem while running experiments
Ans
Currently the /scratch filesystem is not as busy as the super busy /export/home filesystem. Here is a benchmark test that was performed on Nov 24th 2011 that illustrates how it could significantly enhance the speed of your calculation.
If we run this with binaries on the /scratch filesystem ( /scratch/s2594054/LennardJones dir), I get this: =========================================== Energy = -1622.25 Number of devices is 2 Passed clCreateProgramWithSource Compute using the CPU Energy value is -1622.24 Time taken was : 1450 microseconds <========= *Check this Compute using the GPU Energy value is -1618.35 Time taken was : 837 microseconds If I run this with binaries on the /export/home filesystem, this is the timing we get (this result was expected as /export/home is a busy filesystem): ================================================= Energy = -1622.25 Number of devices is 2 Passed clCreateProgramWithSource Compute using the CPU Energy value is -1622.24 Time taken was : 2005 microseconds <========= *Check this Compute using the GPU Energy value is -1618.35 Time taken was : 1061 microseconds ======================================================
The best scenario is to copy to the binaries to /scratch/snumber/ directory and batch it as something similar to below:
#!/bin/bash #PBS -m abe #PBS -M YOUREMAIL@griffith.edu.au #PBS -N openCL_GPU_job #PBS -l ngpus=1 #PBS -l walltime=00:10:00 module load <WhatEver> ##echo "Hello from $HOSTNAME: date = `date`" cd /scratch/s123456/<yourBinDir> ./<yourProg> #You may copy data (if any) back and forth if needed. ##e.g: ##./LennardJones pop256dd.xyz
Qs. 7: What are the Basic Unix commands that one needs to know to use the linux cluster?
Ans
Finding online help To find a command or library routine that performs a required function, try a “keyword” search. You may do a keyword search of the online man pages using “man-k keyword”. The command “apropos keyword” also performs this search. To find details on a command or library To find details on using a unix command or library use the command “man command_name” If no man page is found, it is worth trying the “info” tool. Controlling your unix environment Using the modules command. Listing the files and subdirectories in a directory basic commands: ls, cd, rm. Use the "ls" command to list the contents of a directory. Changing the current working directory The Unix environment (Shell) records a "current working directory" which is initially set to your home directory. Many programs, (such as ls above) will look in this directory if no other directory requested. The "current working directory" may be changed using the "cd" command. See the notes below on directory names. Deleting (Removing) a file Use the "rm" command to remove a file. A few note on Unix directory names. A Unix file full path name is constructed of the directory and subdirectory names separated by slashes "/". ie. /export/home/s123456/work/file1 All Unix full path names start with "/", (There are no Drive/Volume names as in Windows). Hence any filename starting with "/" is a full pathname. A filename containing one or more slash "/" will refer to subdirectory of the "current working directory". The current working directory may also be referenced as dot "." ie. ./subdirectory/file The directory containing the "current working directory" may be referenced as dot-dot ".. ie. ../peer-directory/file
Qs. 8: How to check for "DOS characters" in the pbs script? .
Ans This is a common problem if you use Windows editors like notepad. It can introduce "unwanted characters" in the pbs script file which can cause problems with your pbs script. The problem is that they are not noticeable.
To reveal them, please do the following:
cat -tv <Filename>
To Remove them, run this command:
dos2unix <filename>
cat -tv pbs.00 #PBS -m abe^M #PBS -M Email@griffith.edu.au^M #PBS -N Pile^M #PBS -l select=1:ncpus=1:mem=4g^M #PBS -l walltime=03:00:00^M source $HOME/.bashrc^M module load matlab/2008b^M module load ATLAS/3.9.39^M echo "Starting job"^M matlab -nodisplay -nodesktop -nosplash < /export/home/s123456/P1_e5f1k4/SBFEmOnePileVerify_04Aug11.m^M echo "Done with job"^M dos2unix pbs.00 dos2unix: converting file pbs.00 to UNIX format ... cat -tv pbs.00 #PBS -m abe #PBS -M Email@griffith.edu.au #PBS -N Pile #PBS -l select=1:ncpus=1:mem=4g #PBS -l walltime=03:00:00 source $HOME/.bashrc module load matlab/2008b module load ATLAS/3.9.39 echo "Starting job" matlab -nodisplay -nodesktop -nosplash < /export/home/s123456/P1_e5f1k4/SBFEmOnePileVerify_04Aug11.m echo "Done with job"
Qs 9: How do I transfer files to or from the system ?
Qs 10: How can I connect to the systems ?
Qs 11: I need to transfer data from outside of Griffith. Would I be charged for this usage?
Ans
AARNet introduced un-metered Off Peak traffic in the middle of 2009. At that time, Off Peak was defined as 8:00pm to 8:00am each day, including weekends and public holidays. This resulted in an additional 10% of total traffic being reclassified as un-metered. At the beginning of this 2011, Off Peak was increased to the period from 5:00pm to 9:00am. The effect was that a further 10% of total traffic was reclassified as un-metered. Year-to-date, some 80% of total traffic is now in this category. The AARNet Board has just approved a further extension of Off Peak to include all of the weekend, so that that the period from 5:00pm Friday to 9:00am Monday will then be un-metered. This change will occur from July 1st, and will result in another 5% of total traffic becoming un-metered. That mean that around 85% of total AARNet traffic will soon be un-metered and subscription based.
You can check: https://ias.griffith.edu.au/griffith
Check under internet usage!
*Qs 12: I am getting segmentation error. What can I do?
Ans
Check what your stack size is and try increasing it. ulimit -a ulimit -s unlimited
Qs 13: How do I specify a walltime (eg of 10 mins) in the pbscript ?
Ans
#PBS -l walltime=00:10:00
Qs14: How do I specify a memory in the pbsscript
Ans
Memory Settings
The queuing system is now enforcing memory limits on job. If you do not specify a "mem" limit for your job you will receive the default mem limit of 600mb. This corresponds to:
#PBS -l mem=600mb
If your job uses more memory per thread than this and you do not explicitly ask for more, your job will be killed by the pbs scheduler. If your job uses less than this and you do not specify, you will simply be telling pbs that your job needs more memory than it actually uses which will make that memory unavailable to other jobs. It is best if your mem setting accurately reflects your jobs actual memory requirements.
Qs 15: How do I create a service desk case to report a problem with HPC?
-Ans_
https://www.griffith.edu.au/eresearch-services/request-help
Please select "eResearch services.HPC" for category. Please see an example below:
Qs 16: mpi job doesn't seem to be running much faster than it does on my laptop
I'm running a RAxML analysis now (job id: 371395). It is supposed to be an mpi job, but it doesn't seem to be running much faster than it does on my laptop. Is there a way to check that an mpi job is actually running on multiple cores? I tried 'qstat -f 371395', but I can't tell if it's reporting the used memory or the reserved memory. If you have time, could you please take a look at my submit script, to make sure it's OK: /scratch/s2831058/120524_NIMR_RAxML/120524_NIMR_RAxML_submit.sh
-Ans_
First checked where it is running: >>>>>>>>>>>>>>>>>> qstat -1an|grep s1111111 371395.pbsserve s1111111 mpi nimr_raxml 6915 1 8 8gb 48:00 R 13:08 n003/0*0 >>>>>>>>>>>>>>>>>>>> This shows it is running on node n003 and has requested 8 cpus. Next check if it is actually using all the cpus. To find out, you will need to ssh to the node. ssh n003 You will need to run a tool like "top" to check if it is utilizing these processors. Running "top" shows it is only using 1 CPU. >>>>>>>>> top - 11:29:45 up 81 days, 1:30, 1 user, load average: 1.00, 1.00, 1.00 Tasks: 608 total, 2 running, 606 sleeping, 0 stopped, 0 zombie Cpu(s): 3.6%us, 0.1%sy, 0.0%ni, 96.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 12189240k total, 3954752k used, 8234488k free, 187132k buffers Swap: 8193140k total, 45688k used, 8147452k free, 395396k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6959 s2831058 20 0 2573m 2.3g 2976 R 100.1 19.8 791:59.76 raxmlHPC-MPI 11248 root 20 0 17492 1680 956 R 0.7 0.0 0:00.04 top 10972 root 20 0 0 0 0 S 0.3 0.0 0:00.68 flush-0:17 11182 root 20 0 0 0 0 S 0.3 0.0 0:00.06 flush-0:19 >>>>>>>>>>>>>> See the load average is 1.00 when it should be about 8 if it is using 8 processors . To get a detailed CPU use report, type "top" and type 1 or "mpstat -P ALL" (We have hyperthreading enabled and hence you will see 24 CPUs instead of 12 CPUs). There is another program called "htop" tjat gives even better visuals about the running processes. >>>>>>>>>>>>>>>>> top - 11:42:33 up 81 days, 1:43, 1 user, load average: 1.04, 1.03, 1.01 Tasks: 608 total, 2 running, 606 sleeping, 0 stopped, 0 zombie Cpu0 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 0.7%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 0.0%us, 0.2%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu13 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu15 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu16 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu17 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 0.4%us, 0.0%sy, 0.0%ni, 99.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.0%us, 0.4%sy, 0.0%ni, 99.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 12189240k total, 3955736k used, 8233504k free, 187276k buffers Swap: 8193140k total, 45688k used, 8147452k free, 396956k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6959 s2831058 20 0 2573m 2.3g 2976 R 100.1 19.8 804:44.49 raxmlHPC-MPI 11732 root 20 0 0 0 0 S 0.7 0.0 0:00.32 flush-0:17 11680 root 20 0 0 0 0 S 0.3 0.0 0:00.32 flush-0:19 11846 root 20 0 17492 1684 960 R 0.3 0.0 0:00.04 top 24632 ganglia 20 0 143m 1984 1104 S 0.3 0.0 2:05.04 gmond 1 root 20 0 21400 1276 1044 S 0.0 0.0 0:01.97 init 2 root 20 0 0 0 0 S 0.0 0.0 0:03.00 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 0:00.03 migration/0 >>>>>>>>>>>>>>>>> From the mpstat output (see last column), it is pretty much idle. >>>>>>>>> mpstat -P ALL Linux 2.6.32-131.0.15.el6.x86_64 (n003) 05/25/2012 _x86_64_ (24 CPU) 11:38:55 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 11:38:55 AM all 10.92 0.00 0.61 0.02 0.00 0.00 0.00 0.00 88.45 11:38:55 AM 0 37.80 0.00 1.05 0.19 0.00 0.00 0.00 0.00 60.96 11:38:55 AM 1 35.52 0.00 2.31 0.00 0.00 0.00 0.00 0.00 62.17 11:38:55 AM 2 35.72 0.00 2.28 0.00 0.00 0.00 0.00 0.00 61.99 11:38:55 AM 3 36.64 0.00 1.34 0.00 0.00 0.00 0.00 0.00 62.02 11:38:55 AM 4 33.67 0.00 2.49 0.00 0.00 0.00 0.00 0.00 63.84 11:38:55 AM 5 33.85 0.00 2.51 0.00 0.00 0.00 0.00 0.00 63.64 11:38:55 AM 6 20.58 0.00 0.60 0.18 0.00 0.02 0.00 0.00 78.63 11:38:55 AM 7 18.85 0.00 0.58 0.00 0.00 0.00 0.00 0.00 80.57 11:38:55 AM 8 2.28 0.00 0.25 0.01 0.00 0.00 0.00 0.00 97.47 11:38:55 AM 9 2.43 0.00 0.30 0.02 0.00 0.00 0.00 0.00 97.25 11:38:55 AM 10 2.94 0.00 0.24 0.02 0.00 0.00 0.00 0.00 96.79 11:38:55 AM 11 1.84 0.00 0.25 0.00 0.00 0.00 0.00 0.00 97.91 11:38:55 AM 12 0.01 0.00 0.02 0.00 0.00 0.00 0.00 0.00 99.97 11:38:55 AM 13 0.02 0.00 0.07 0.00 0.00 0.00 0.00 0.00 99.91 11:38:55 AM 14 0.01 0.00 0.02 0.00 0.00 0.00 0.00 0.00 99.98 11:38:55 AM 15 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 99.99 11:38:55 AM 16 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.99 11:38:55 AM 17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.99 11:38:55 AM 18 0.01 0.00 0.03 0.01 0.00 0.00 0.00 0.00 99.95 11:38:55 AM 19 0.01 0.00 0.10 0.00 0.00 0.00 0.00 0.00 99.89 11:38:55 AM 20 0.01 0.00 0.03 0.01 0.00 0.00 0.00 0.00 99.95 11:38:55 AM 21 0.01 0.00 0.02 0.01 0.00 0.00 0.00 0.00 99.97 11:38:55 AM 22 0.01 0.00 0.03 0.00 0.00 0.00 0.00 0.00 99.96 11:38:55 AM 23 0.01 0.00 0.02 0.00 0.00 0.00 0.00 0.00 99.97 >>>>>>>>>>>
There is something wrong with how the job was submitted. Please take a look at it.
Qs 17: How do I check if my job is short of memory (swapping memory)?
-Ans_
Log into the node that is running your job. qstan -1an|grep <jobNumber> This gives the node (last column). Please log into that node (ssh n???) and run "top". The values under RES show the amount of physical memory (RAM) being used by the process. This should be the amount of pmem in your job request. If you are not using pmem but using mem, you will have to add up for all processes (1.3g+1.2g+.....=10.2g). And that was for that instance when top was run. You requested memory (mem= ) was not enough. Please look at the swap line from the output of "top": Swap: 8193140k total, 5132424k used, 3060716k free Here you can see that it is using 5132424k of swap mem which ideally should be very low (=~ 0 k). So you are using a lot of swap. >>>>>>>>>>>>>>>>>>>>>>> top - 13:29:10 up 81 days, 3:30, 3 users, load average: 8.00, 8.00, 7.74 Tasks: 621 total, 9 running, 612 sleeping, 0 stopped, 0 zombie Cpu(s): 33.8%us, 0.2%sy, 0.0%ni, 66.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 12189240k total, 11812780k used, 376460k free, 14036k buffers Swap: 8193140k total, 5132424k used, 3060716k free, 38532k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15095 s2831058 20 0 2640m 1.3g 3164 R 100.0 11.2 44:15.23 raxmlHPC-MPI 15096 s2831058 20 0 2640m 1.2g 3196 R 100.0 10.6 44:08.10 raxmlHPC-MPI 15097 s2831058 20 0 2640m 1.3g 3240 R 100.0 11.2 44:08.22 raxmlHPC-MPI 15098 s2831058 20 0 2640m 1.4g 3224 R 100.0 12.3 44:13.74 raxmlHPC-MPI 15100 s2831058 20 0 2640m 1.6g 3212 R 100.0 13.9 44:11.65 raxmlHPC-MPI 15101 s2831058 20 0 2640m 1.1g 3080 R 100.0 9.8 44:04.24 raxmlHPC-MPI 15094 s2831058 20 0 2640m 1.0g 3196 R 99.7 8.9 44:11.99 raxmlHPC-MPI 15099 s2831058 20 0 2641m 1.3g 3184 R 99.7 11.2 44:20.96 raxmlHPC-MPI 17329 root 20 0 0 0 0 S 2.7 0.0 0:00.91 flush-0:17 17313 s2831058 20 0 17728 1544 1032 S 2.0 0.0 0:01.21 htop 3161 root 20 0 2940m 17m 2332 S 0.3 0.1 257:23.80 DNA 17371 root 20 0 103m 3816 2972 S 0.3 0.0 0:00.01 sshd 17405 root 20 0 17492 1672 956 R 0.3 0.0 0:00.05 top 1 root 20 0 21400 1144 980 S 0.0 0.0 0:01.97 init >>>>>>>>>>>>>>>>>>>>>>>>> Another tool you can use is "htop". See attached pictures for the swap row. It should be a very short line or no line if it is not swapping. If it is swapping heavily, it will have a long line. I would suggest you request more memory. PBS will then dispatch your jobs to the bigger memory nodes. Your jobs may have to wait a little longer (it will be queued up) but it will then use the appropriate bigger memory nodes (currently our 4 biggest nodes have 96 G RAM).
Here htop output shows a heavily swapping system:
Here htop output shows a lightly swapping system:
Another tool you can use is called "free".
Qs 18: How do I modify the number of requested cores, memory , walltime etc after submitting the job?
-Ans_
qalter -l select=1:ncpus=12:mem=5gb <jobID> qalter -l walltime=<hh:mm:ss> <jobid> To release a job on status "Hold" qrls -hs <jobid>
Qs 19: How do I delete all my pbs jobs?
-Ans_
qdel `qselect -u username`
Qs 20: What is the syntax to add group information in pbs script? (For example, to access software that are resticted to previledged users like vasp group to use VASP software)
-Ans_
Please put this line in the pbs script: #PBS -W group_list=nimrodusers or simply use it in qsub as follows: qsub -l walltime=800:00:00,select=1:ncpus=2 -W group_list=nimrodusers -N bm2-12 -q gpu -- /export/home/s123456/blade2.sh
Qs 21: How do I make advance reservation?
-Ans_
Pleave review this documentation.
Qs 22: If running the same everytime, what is the best way to submit the job
-Ans_
An excellent guide from USyd can be found here
If running the same everytime, it is best to use job arrays (rather than running qsub multiple times). So you could do: qsub -N <jobname> -J 1-10 job_submit.pbs To run it 10 times.
#!/bin/bash #PBS -J 1-100 #PBS -m abe #PBS -M YOUREMAIL@griffithuni.edu.au #PBS -N MQworker #PBS -l select=1:ncpus=1:mem=2g,walltime=2:00:00 cd $PBS_O_WORKDIR ./myBinary ${PBS_ARRAY_INDEX} 3600 >> output_${PBS_ARRAY_INDEX}
Another example:
qsub -l walltime=00:00:10 -J 1-4 -- /bin/sleep 3
Another Example
cat array_demo_pbs
#Demonstrate the behaviour of job arrays #!/bin/bash #PBS -m abe #PBS -M EMAILID@griffith.edu.au #PBS -N TestArray #PBS -l select=1:ncpus=1:mem=1gb,walltime=00:01:00 #PBS -J 1-10 #PBS -q workq sleep 60 #Make a new Directory for this job in array inside our current directory mkdir $PBS_O_WORKDIR/Demo$PBS_ARRAY_INDEX #Change into it cd $PBS_O_WORKDIR/Demo$PBS_ARRAY_INDEX #Send some output to a file printf " This is a job number $PBS_ARRAY_INDEX in the array\n" >job.PBS_ARRAY_INDEX.log
Further example
#!/bin/bash #PBS -m abe #PBS -M YourEmail@griffith.edu.au #PBS -N IndyMarxACT_v3 #PBS -q routeq #PBS -J 1-20 #PBS -l select=1:ncpus=1:mem=1g,walltime=20:00:00 cd $PBS_O_WORKDIR module load R/4.0.3 R CMD BATCH /export/home/snumber/pbs/array/scripts/script_$PBS_ARRAY_INDEX.R
Qs 23: How do I use screen to leave a terminal session running even after logging out of the system?
-Ans_
Screen can be used when you want to leave a terminal session running even after logging out of the system.
For an easy reference, here's a list of the most common screen commands that you'll want to know. This isn't exhaustive, but it should be enough for most users to get started using screen happily for most use cases. screen -d -m -S shared screen -ls screen -x shared Start Screen: screen Detatch Screen: Ctrl-a d Re-attach Screen: screen -x or screen -x PID Split Horizontally: Ctrl-a S Split Vertically: Ctrl-a | Move Between Windows: Ctrl-a Tab Name Session: Ctrl-a A Log Session: Ctrl-a H Note Session: Ctrl-a h
Qs 24: How do I mount a filesystem on remote server on gowonda
First ask the system admin on gowonda to do this: >>>>>>>>> usermod -a -G fuse <username> >>>>>>>>> You need the SSH server installed and running on the computer you want to mount (for most unix-like OSes including MacOSX, it is a matter of installing openssh-server package). Then check that SSH itself is working with: >>>>>>>>>>>>>>>>>>>>>> ssh user@remote!P >>>>>>>>>>>>>>>>>>>>>> This will ask for confirmation the first time, then ask for a password every time you try to connect. Once connected, press Ctrl+d to disconnect, then run the sshfs command >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sshfs user@remoteiP:/directory/to/mount /mount/point e.g On gowonda: mkdir /tmp/mnt1 sshfs s12345@132.234.1.1:/opt/import /tmp/mnt1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> As sshfs uses Fuse. it will refuse to mount onto a non-empty directory, so make sure there's nothing in the mount point beforehand. Now the contents of the remote directory should be available to you. When you are finished, you unmount the remote directory with >>>>>>>>>>>>>>>>>>>>>>>>>>>>> fusermount -u /mount/point e.g: fusermount -u /tmp/mnt1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> After doing this a couple of times. you will find it incredibly annoying having to type your password in each time. Fortunately, SSH can use keys for logging in. First generate your keys on the local machine with >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ssh-keygen -t rsa >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Don't set a password when asked, just press Enter. otherwise you'll defeat the point of creating the key. This creates a pair of files in ~/.ssh. which are your private and public key. Your private key should never be shared, but you need to copy the public one to the remote computer. with this command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> $ ssh-copy-id -i -/.ssh/id_rsa.pub user@remoteiP >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This will ask for the password one more time, but after that you can run the above ssh and sshfs commands without being asked for a password again. All of these commands should be run as your normal user. You can omit the user@ part of the ssh and sshfs commands if you have the same username on both computers. Source: Linux Format 5 Apr 2016 p.92
Qs25: PBS environmental variables
Several environment variables are provided to PBS jobs. Some are taken from the user's environment and carried with a job, and others are created by PBS. There are also some environment variables that you can explicitly create for exclusive use by PBS jobs. All PBS-provided environment variable names start with the characters "PBS_". Some start with "PBS_O_", which indicates that the variable is taken from the job's originating environment (that is, the user's environment). A few useful PBS environment variables are described in the following list: PBS_O_WORKDIR - Contains the name of the directory from which the user submitted the PBS job PBS_O_PATH - Value of PATH from submission environment PBS_JOBID - Contains the PBS job identifier PBS_JOBDIR - Pathname of job-specific staging and execution directory PBS_NODEFILE - Contains a list of vnodes assigned to the job TMPDIR - The job-specific temporary directory for this job. Defaults to /tmp/pbs.job_id on the vnodes.
Qs26: How can a specific execution node be requested
As the cluster runs a mix of new and old nodes, you may need to request a specific node (e.g when the code was compiled for the new node architecture). Here is the procedure
Wwhen the code was compiled for the new node architecture and the app is run on the old nodes, you may get errors like this Please verify that both the operating system and the processor support Intel(R) MOVBE, F16C, FMA, BMI, LZCNT, AVX2, AVX512DQ, AVX512F, ADX, AVX512CD, AVX512BW and AVX512VL instructions. Solution: ======== The new nodes are: gc-prd-hpcn001 gc-prd-hpcn002 gc-prd-hpcn003 gc-prd-hpcn004 gc-prd-hpcn005 gc-prd-hpcn006 To check if the resource is free, please run this command: pbsnodes -aSj|egrep "gc-prd-hpcn002|gc-prd-hpcn001|gc-prd-hpcn003|gc-prd-hpcn004|gc-prd-hpcn005|gc-prd-hpcn006" Let's say gc-prd-hpcn006 has free memory and free cpus. >>>>>>>>>Here is a sample pbs script to send a job to gc-prd-hpcn006<<<<<<<<< #!/bin/bash #PBS -N SpecificHost ###PBS -m abe ###PBS -M ID@griffithuni.edu.au #PBS -l select=1:ncpus=1:host=gc-prd-hpcn006:mem=8gb,walltime=24:00:00 echo "Starting job: " cd $PBS_O_WORKDIR echo "Hello" sleep 10 >>>>>>>>>>>>>>>>>>>>>>> For an interactive run, you can do this: qsub -I -X -q workq -l select=1:ncpus=1:host=gc-prd-hpcn006:mem=8gb,walltime=24:00:00 If hosts from a range is acceptable: qsub -I -X -q workq -l select=1:ncpus=1:host=gc-prd-hpcn001-006:mem=1gb,walltime=00:01:00 PS: If you need to send the job to a different queue. Please find the queue a node belongs to with this command first: qhost|egrep "gc-prd-hpcn002|gc-prd-hpcn001|gc-prd-hpcn003|gc-prd-hpcn004|gc-prd-hpcn005|gc-prd-hpcn006"
Qs27: Getting the error "A library with BLAS API not found. Please specify library location"
You could try a trace. cmake --trace . 2>&1 | tee /tmp/cmakeOut.txt You may have to explicitly mention the path, For example: cmake . -DLAPACK_LIBRARIES=/sw/library/lapack/lapack-3.6.0/3.6.0/lib64/liblapack.so -DBLAS_LIBRARIES=/sw/library/blas/CBLAS/lib/cblas_LINUX.so OR: cmake . -DLAPACK_LIBRARIES=/sw/library/lapack/lapack-3.6.0/3.6.0/lib64/liblapack.so -DBLAS_LIBRARIES=/sw/library/blas/CBLAS/lib/cblas_LINUX.so -DCMAKE_INSTALL_PREFIX=/sw/simbody/353
Qs28: How do I check number of CPUs my job is using
You can find out on which compute node your job is running: qstat -1an|grep snumber >>>>>>>>>>>>>>>>>>> e.g: qstat -1an|grep s2761086 4598354.pbsserv s2761086 workq DT_k-e_04 21795 1 1 10gb 99999 R 523:2 n010/0 Here you see it is running on n010 Then you can do this: ssh nodename -t "htop" or ssh nodename -t "htop -u username" e.g.ssh n010 -t "htop -u s2761086" Press <F2> key, go to "Columns", and add PROCESSOR under "Available Columns". The currently used CPU ID of each process will appear under "CPU" column.
Qs29: How to customize an environmental variable using modules
mkdir -p ~/sw/Modules cd ~/sw/Modules Create a modules file for the application (here it is named xbeach) >>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< vi xbeach #%Module###################################################################### ## ## XBeach modulefile ## proc ModulesHelp { } { puts stderr "Sets up shell environment to use XBeach " puts stderr " " } module load gnu/4.9.2 set base /gpfs1/groups/gccmss/project/sw/xbeach set base_path $base ##prepend-path INCLUDE $base_path/include prepend-path LD_LIBRARY_PATH $base_path/lib prepend-path PATH $base_path/bin ##prepend-path MANPATH $base_path/man ##setenv NCBI $base_path/local >>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Edit the .bash_profile file and add the local MODULEPATH >>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< vi ~/.bash_profile export MODULEPATH=$MODULEPATH:~/sw/Modules >>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Log out and log back in. Check if the module is available module avail xbeach
Qs 30: Provide an example pbs script that can be used on flashlite HPC located in UQ
#! /bin/bash ### Template PBS script to submit Delft3d job to queue gccm #PBS -m e #PBS -N brisb0809 #PBS -l walltime=200:00:00 #PBS -l nodes=1:ppn=4:intel,mem=30gb,vmem=30gb #PBS -A qris-gu source $HOME/.bashrc module load delft3d-rev5624 cd /gpfs1/groups/gccmss/project/delft3d/brisb_aug0809/ ##input data file export MDWFILE=fm_wave.mdw export ARGFILE=config_flow2d3d.xml ## RUN ## # start flow echo "=== start Delft3D-FLOW in the background ===" #$MPIRUNEXEC -np $NHOSTS d_hydro.exe $ARGFILE & #deltares_hydro.exe $ARGFILE -keepXML & #deltares_hydro.exe $ARGFILE & d_hydro.exe $ARGFILE & # start wave echo "=== start Delft3D-WAVE in the background ===" #$MPIRUNEXEC -np $NHOSTS wave.exe $MDWFILE 1 wave.exe $MDWFILE 1 echo "=== calculation finished ===" echo "Finished job "`date`
Qs 31: Multi Cores are requested and allocated by PBs but job runs only on 1 core. Why is that?
This contribution is from Nicholas Dhal and is acknowledged. Nick is an active Grifith HPC user.
>>>>>>>>>
- Using PuTTy (or similar), login to the HPC headnode.
- Identify the compute node the simulation is running on. (Try typing 'qstat -1an|grep -u s123456' to get a list of running jobs with node numbers).
- Login to the node where the process will be. To login type 'ssh nxxx', where xxx is the node number.
- Identify the process id. Either by looking at a cleanup file (seen through WinSCP in the directory of execution) or by looking at a list of all running processes on the node (step 5).
- To see a list of running processes on the node, run the command 'htop'. Put it into tree mode for best view (F5). You can then see the CPU the processes are on and how much of the CPU the process is using.
- Once ready to find the affinity setting, exit 'htop' but stay logged into the node. Run the command 'taskset -p xxxx' where xxxx is the process id.
- The value returned is a hexadecimal code that should be converted to binary (just google hex to binary for converter).
- The binary code will be a zero where the core isn't allowed and a one where the core is allowed.
- To set the affinity, we recommend allowing all cores (the scheduler will automatically set it to the ones you are actually allocated). The hex code for all cores is F (one F for each 4 cores on the node). To set run the command 'taskset -p FFF xxxx' where the number of F's is the number of cores on the node divided by 4 and xxxx is the process id.
- You will have to repeat set 9 for all main process id's that are running on your job (just the main processes that use 100% of CPU). If you have multiple nodes, you will have to repeat for each node.
- ssh n006
- htop
- (look around and find process id, lets say 2167)
- F10
- taskset -p 2167
- the value returned would be '20' for only the core 6. This is binary for 000000100000 where the one 1 is at the sixth place (reading right to left)
- taskset -p fff 2167 (if the number of cores in the node was 24, it would be taskset -p ffffff 2167)
- taskset -p 2167
- exit
>>>>>>>>>>
Qs 32: How to check the remaining licenses on the license server
/usr/local/bin/lmstat -a -c 27006@gc-prd-erslic.corp.griffith.edu.au /sw/misc/intelflexlm/lmstat -a -c 27006@gc-prd-erslic.corp.griffith.edu.au /usr/local/bin/lmstat -a -c 27004@gc-prd-erslic.corp.griffith.edu.au
Qs 33: User defined Job Priority
The queue system uses some algorithm to maximize the use of the resources while prioritizing some jobs over others. Normal users can only make their jobs have lower (eg, use negative arguments to p) priority than normal. This lets you shuffle around the order of your jobs with respect to each other qalter -p value job qalter -p -1020 101 qalter -l prio=1000 <jobID> qsub -p priority job-script priority is a number between -1024 and +1023. A higher number means higher priority. The default priority is 0
Ref:
Qs 34: How do I request access to external clusters like NCI's Raijin Cluster
Here are some of the HPC resources available to Griffith researchers: 1. Griffith users can use facilities like NCI, Pawsey. For NCI, project allocations are usually done on a quarter basis and are on a "use it or lose it" basis. To get access please follow the instructions here: http://nci.org.au/access/user-registration/new-project-application/ You will need to get a new user account first as specified on the page. When you apply for a project please select to propose a new project and select QCIF as the funding body. You will need to write a project proposal and request computational time per quarter. You may contact QCIF representative, Marlies Hankel (m.hankel@uq.edu.au) if you have any questions related NCI, Pawsey etc. 2. The following resources are available through QCIF/QRISCloud: a. euramoo cluster b. flashlite cluster c. special compute nodes with GPUs, big memory etc d. virtual machines e. virtual storage To request an account, please follow the links here: https://www.qriscloud.org.au/index.php/services For example, to request an account on the euramoo cluster, please click the link "Register to use Euramoo". To get further imformation about a particular resource, please click on the "more" link besides the link for a particular resource. For example, click on the "more" link beside the "Request to use Flashlite" to get more information about the flashlite cluster.
Qs 35: Sample script on the awoonga cluster
#!/bin/bash #PBS -m abe #PBS -M YOUREMAIL@griffith.edu.au #PBS -A qris-gu #PBS -l nodes=1:ppn=1,mem=3GB,vmem=3GB,walltime=01:00:00 #PBS -N TestOnly ###################################################################### ###################################################################### #### This section is setting up and running your executable or script ###################################################################### module load R/3.2.3 module list cd $PBS_O_WORKDIR #Now do some things echo -n "What time is it ? "; date echo -n "Who am I ? " ; whoami echo -n "Where am I ?"; pwd echo -n "What's my PBS_O_WORKDIR ?"; echo $PBS_O_WORKDIR echo -n "What's my TMPDIR ?"; echo $TMPDIR echo "Sleep for a while"; sleep 1m echo -n "What time is it now ? "; date
ref: https://www.qriscloud.org.au/support/qriscloud-documentation/92-awoonga-user-guide#batch_system
Qs 36: How to install Local R Packages without root access
Ref: http://www.ceci-hpc.be/r_packages.html
To enable you to download R packages from outside Griffith, you may do this: source /usr/local/bin/s3proxy.sh Load the R module module load R/4.1.3nopkgs (Older R modules are also available if needee.g module load R/4.0.3, module load R/3.6.1 or module load anaconda3/2019.07py3; source activate R) Create file named ~/.Renviron nano ~/.Renviron ## Linux - check version of R/OS R_LIBS=~/R/x86_64-pc-linux-gnu-library/3.6 OR R_LIBS=~/R/x86_64-pc-linux-gnu-library/4.0 (Create this directory if it doesn't exist:e.g. mkdir -p ~/R/x86_64-pc-linux-gnu-library/3.6) The .libPaths() command lists the places where R will search for libraries, and use the first item of the list as target for new package installs. Try this: module load R/4.1.3nopkgs R > .libPaths() [1] "/usr/lib64/R/library" "/usr/share/R/library" Let's install the dummy package > install.packages('dummy') R tries to install it in the global library Installing package(s) into ‘/usr/lib64/R/library’ (as ‘lib’ is unspecified) Warning in install.packages("dummy") : 'lib = "/usr/lib64/R/library"' is not writable but quickly notes that it cannot write in the global place, and asks whether it should create a local library. Simply answer 'y'. Would you like to create a personal library '~/R/x86_64-redhat-linux-gnu-library/2.13' to install packages into? (y/n) y Another example: install.packages('raster', type='source') install.packages('RCurl', type='source') install.packages('rgdal', type='source', configure.args="--with-proj-share=~/opt/sw/library/proj/4.8.0/share/proj")
Qs 37: How to install custom packages on flashlite/awoonga clusters without root access
Sometimes, it is possible to install specific versions of software without root access. Please find below an example of how the R package named "rgdal" was installed in a local directory
module load R/3.2.3 cd /tmp Download the packages you want >>>>>>>>>> e.g: wget https://cran.r-project.org/src/contrib/Archive/rgdal/rgdal_1.2-10.tar.gz wget https://cran.r-project.org/src/contrib/Archive/rgdal/rgdal_1.2-4.tar.gz wget https://cran.r-project.org/src/contrib/Archive/sp/sp_1.2-2.tar.gz >>>>>>>>>>> R CMD INSTALL /tmp/rgdal_1.2-10.tar.gz * installing to library '~/R/x86_64-pc-linux-gnu-library/3.2' ERROR: this R is version 3.2.3, package 'rgdal' requires R >= 3.3.0 Tried with a lower version of rgdal >>>>>>>>> R CMD INSTALL rgdal_1.2-4.tar.gz * installing to library '/~/R/x86_64-pc-linux-gnu-library/3.2' ERROR: dependency 'sp' is not available for package 'rgdal' * removing '/~/R/x86_64-pc-linux-gnu-library/3.2/rgdal' So I tried: >>>>>>>> It said sp package was needed. Try a lower version of sp R CMD INSTALL sp_1.2-2.tar.gz That worked. But another error was received: >>>>>>>>>>>>>>>>>> R CMD INSTALL rgdal_1.2-4.tar.gz checking for gdal-config... no configure: error: gdal-config not found or not executable. ERROR: configuration failed for package 'rgdal' >>>>>>>>>>>>>> Tried to load this package using the command "module load gdal/2.0.2" Got another error: checking for proj_api.h... no configure: error: proj_api.h not found in standard or given locations. Try this: module load proj/4.9.1 Still got the error "configure: error: proj_api.h not found in standard or given locations." A cursory look at the proj/4.9.1 module file showed that it was not set up correctly (module display proj/4.9.1) Fix it by creating a custom module file: mkdir ~/.moduleshome cd ~/.moduleshome vi ~/.bash_profile export MODULEPATH=$MODULEPATH:~/.moduleshome source ~/.bash_profile Create the custom module file for proj: >>>> mkdir -p ~/.moduleshome/proj/ vi ~/.moduleshome/proj/mine-4.9.1 >>>>>>>> setenv PROJHOME /opt/proj prepend-path PATH /opt/proj/bin prepend-path LD_LIBRARY_PATH /opt/proj/lib prepend-path LDFLAGS -L/opt/proj/lib prepend-path LIBRARY_PATH /opt/proj/lib prepend-path PKG_CONFIG_PATH /opt/proj/lib/pkgconfig prepend-path MANPATH /opt/proj/share/man prepend-path INCLUDE /opt/proj/include prepend-path CPLUS_INCLUDE_PATH /opt/proj/include prepend-path CPATH /opt/proj/include prepend-path C_INCLUDE_PATH /opt/proj/include prepend-path FPATH /opt/proj/include prepend-path CFLAGS -I/opt/proj/include prepend-path CPPFLAGS -I/opt/proj/include prepend-path CXXFLAGS -I/opt/proj/include prepend-path FCFLAGS -I/opt/proj/include prepend-path FFLAGS -I/opt/proj/include ------------------------------------------------------------------- >>>>>>>>> Then: module unload load proj/4.9.1 module load proj/mine-4.9.1 R CMD INSTALL rgdal_1.2-4.tar.gz SUCCESS!
Qs 38: How do I get started on the awoonga cluster
You may request an account on Awoonga by going to this link: https://www.qriscloud.org.au/services Please click the link "Register to use Awoonga” Usage: Griffith users have access to directories /sw/GU and /sw/Modules/GU. Griffith Support team has write permissions via the group "suppport4". All others have read access to these directories. Griffith support team has created modulefiles, under /sw/Modules/GU, All users are asked to add this line into their ~/.bashrc file in their home directory: module use /sw/Modules/GU Parallel to the several /sw/{support team) directories We recommend the compiler versions from the SDSC rolls, so module load gnu or intel module load any libraries Build (configure, make, make install) module avail will list all the modules.
Qs. 39: Proper way to set up library path on custom installation
vi ~/.bashrc module use /export/home/YOURSNUMBER/sw/Modules/GU >>>>>>> mkdir -p ~/sw/Modules/GU cd ;HOMEDIR=`pwd`;echo $HOMEDIR echo "module use $HOMEDIR/sw/Modules/GU" >>~/.bashrc source ~/.bashrc >>>>>>> Create a module file like this vi ~/sw/Modules/GU/postgress >>>>>>>>>>> #%Module###################################################################### ## ## PostGress modulefile ## proc ModulesHelp { } { puts stderr "Sets up shell environment to use Postgress" puts stderr " " } set base_path /export/home/snumber/sw/PostgresSQL prepend-path PATH $base_path/bin #prepend-path MANPATH $base_path/share/man prepend-path MANPATH $base_path/share ##setenv BLASTDB $dbbase_path/blastdb prepend-path LIBRARY_PATH $base_path/lib prepend-path LD_LIBRARY_PATH $base_path/lib prepend-path LDFLAGS -L$base_path/lib prepend-path PKG_CONFIG_PATH $base_path/lib/pkgconfig prepend-path INCLUDE $base_path/include prepend-path CPLUS_INCLUDE_PATH $base_path/include prepend-path CPATH $base_path/include prepend-path C_INCLUDE_PATH $base_path/include prepend-path FPATH $base_path/include prepend-path CFLAGS -I$base_path/include prepend-path CPPFLAGS -I$base_path/include prepend-path CXXFLAGS -I$base_path/include prepend-path FCFLAGS -I$base_path/include prepend-path FFLAGS -I$base_path/include >>>>>>>>>>>> source ~/.bashrc check if it works: module display postgress
Qs40: How to install a python based application while not having root privilege
Here is an example to install a python application named busco locally: >>>>>>>>>>>>>>>>>>>>> mkdir ~/sw cd ~/sw cd /export/home/s2981868/sw/busco #You may load any version of python we have on the cluster module load python/3.6.1 cd ~/sw/busco python setup.py install --prefix=/export/home/s123456/sw/busco 2>&1 | tee buscoInstallLog.txt >>>>>>>>>>>>>>>>>>>>>> That's all. Now you need to set the PYTHONPATH variable properly The easiest would be to setup a module file locally. Or to add this in .bashrc PYTHONPATH /export/home/s12345/sw/busco/lib/python3.6/site-packages The preferred option is to add it as a module: You can do this one time >>>>>>>>>>>>>>>setting up custom/local module env<<<<<<<<<<<<<<<< mkdir ~/sw/Modules cd ;HOMEDIR=`pwd`;echo $HOMEDIR echo "module use $HOMEDIR/sw/Modules/" >>~/.bashrc source ~/.bashrc >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Set up the module file >>>>>>>>>>>>>>>>>>>>>> Create a module file in /export/home/s12345/sw/Modules/busco >>>>>>>>> #%Module###################################################################### ## ## python modulefile ## proc ModulesHelp { } { puts stderr "Sets up paths for busco with python 3.6.1" } module-whatis "adds custom PYTHONPATH directories to PATH etc. " set base_path /export/home/s12345/sw/busco prepend-path PATH $python_base/scripts prepend-path PYTHONPATH $python_base/lib/python3.6/site-packages prepend-path LD_LIBRARY_PATH $python_base/lib:$python_base/lib/python3.6 >>>>>>>>> ++++++++++++++++ module display busco/custom-busco ------------------------------------------------------------------- /export/home/s12345/sw/Modules//busco/custom-busco: module-whatis adds custom PYTHONPATH directories to PATH etc. prepend-path PATH /export/home/s12345/sw/busco/scripts prepend-path PYTHONPATH /export/home/s12345/sw/busco/lib/python3.6/site-packages prepend-path LD_LIBRARY_PATH /export/home/s12345/sw/busco/lib:/export/home/s12345/sw/busco/lib/python3.6 ------------------------------------------------------------------- +++++++++++++++++ Simply do this: module load busco/custom-busco To run this program: Now you can do this: /export/home/s12345/sw/busco/scripts/run_BUSCO.py -i /export/home/s12345/scratch/SRO_trinity.fasta -o SRO_BUSCO -l ~/scratch/metazoa_odb9 -m tran -c 4 Once satisfied, you can write the pbs script and the config.ini (/export/home/s2981868/sw/busco/config/) >>>>>>>>>> #!/bin/bash -l #PBS -m abe #PBS -M MYEMAIL@griffith.edu.au #PBS -N BUSCO_SRO #PBS -l walltime=250:00:00 #PBS -l select=1:ncpus=4:mpiprocs=4 cd $PBS_O_WORKDIR source $HOME/.bashrc module load module load python/3.6.1 module display busco/custom-busco /export/home/s12345/sw/busco/scripts/run_BUSCO.py -i /export/home/s12345/scratch/SRO_trinity.fasta -o SRO_BUSCO -l ~/scratch/metazoa_odb9 -m tran -c 4
Qs41: How to install python modules without root access
ref: https://stackoverflow.com/questions/7465445/how-to-install-python-modules-without-root-access
In this example, python 2.7.10 is used but this is applicable to any version of python. module load python/2.7.10 run the netcheck script to log into the internet sh /sw/sysadmin/netcheck (old cluster) OR on the new cluster source /usr/local/bin/s3proxy.sh You may use pip/easy_install/python setup.py to install to a local directory pip === pip install --user package_name pip install --install-option="--prefix=$HOME/scripts" package_name e.g pip install --install-option="--prefix=/export/home/s2819099/scripts" nose easy_install ========== easy_install --prefix=$HOME/scripts package_name which will install into $HOME/scripts/lib/pythonX.Y/site-packages You will need to manually create $HOME/scripts/lib/pythonX.Y/site-packages and add it to your PYTHONPATH environment variable (otherwise easy_install will complain -- btw run the command above once to find the correct value for X.Y). e.g: export PYTHONPATH=/export/home/s2819099/scripts/lib/python2.7/site-packages easy_install --prefix=/export/home/s2819099/scripts nose setup.py ======== Source: http://docs.python.org/install/index.html#alternate-installation python <lxml_distrib_dir>/setup.py install --home=<dir> e.g: python setup.py install --home=/export/home/s2819099/scripts virtualenv ========== $ curl -O https://raw.github.com/pypa/virtualenv/master/virtualenv.py $ python virtualenv.py my_new_env $ . my_new_env/bin/activate (my_new_env)$ pip install package_name Source and more info: https://virtualenv.pypa.io/en/latest/installation/ Finish the installation ======================== export PYTHONPATH=$PYTHONPATH:$HOME/scripts/lib/python2.7/site-packages Please logout of the internet on gowonda, otherwise all internet charges from all users will be charged to your account. sh /sw/sysadmin/netcheck -logout
Qs42 : User defined dependency
Source: http://web.mit.edu/longjobs/www/faq.html
To change a dependency after you have submitted a job, use qalter -W depend=type:argument jobid; to remove the dependency completely, use qalter -W depend=type jobid (i.e., omit :argument). For example: athena% qstat -f 1175 | grep depend depend = afterok:1171.hydrogen.mit.edu@hydrogen.mit.edu To make 1175 wait for 1172 instead of 1171: · athena% qalter -W depend=afterok:1172 1175 · athena% qstat -f 1175 | grep depend · depend = afterok:1172.hydrogen.mit.edu@hydrogen.mit.edu To clear the dependency: · athena% qalter -W depend=afterok 1175 · athena% qstat -f 1175 | grep depend · athena%
Qs 43: How do I get access to QCIF share of NCI @Raijin machine? How do I get access to QCIF GPU machine
How do I get access to QCIF share of NCI @Raijin machine? https://www.qriscloud.org.au/index.php/services/compute#NCI How do I get access to QCIF GPU machine Wiener (located @UQ)? Check under "Use Specialised compute" https://www.qriscloud.org.au/index.php/services/compute#SpecialCompute
https://www.qriscloud.org.au/index.php/services/compute#NCI
Check under "Use Specialised compute" https://www.qriscloud.org.au/index.php/services/compute#SpecialCompute
Qs 44: How do I use wget on GriffithHPC? How do I get outside access to download for example conda packages?
You can go through a proxy to access external collections using wget. You may create a file named .wgetrc (vi ~/.wgetrc) with the following contents use_proxy = on http_proxy = http://s3proxy.itc.griffith.edu.au:3128/ HTTP_PROXY = http://s3proxy.itc.griffith.edu.au:3128/ https_proxy = https://s3proxy.itc.griffith.edu.au:3128/ HTTPS_PROXY = https://s3proxy.itc.griffith.edu.au:3128/
You may create a file named s3proxy.sh (vi ~/s3proxy.sh) with the following contents export http_proxy = http://s3proxy.itc.griffith.edu.au:3128/ export HTTP_PROXY = http://s3proxy.itc.griffith.edu.au:3128/ export https_proxy = https://s3proxy.itc.griffith.edu.au:3128/ export HTTPS_PROXY = https://s3proxy.itc.griffith.edu.au:3128/ Simply source this package (~/s3proxy.sh) to get temporary access to external packages. Please note that all downloads are monitored. Also note that your home directory should not go above 200GB in size.
Qs 45: How I install my own conda environment without root access
#Gain access to the outside: source /usr/local/bin/s3proxy.sh #Load the anaconda module e.g module load anaconda3/2023.09) #Add the following entry into .condarc (nano ~/.condarc) channels: - defaults #Create the following folders: mkdir -p ~/.conda/envs mkdir -p ~/.conda/pkgs See if the following commands work! #conda info # conda search flask conda create --name env1 #e.g conda create --name trinity_env --clone root #To remove an environment, conda remove -n env --all source activate env1 #e.g source activate trinity_env Now you can install your packages: conda search -c bioconda trinity For example to install version trinity version 2.9.1: e.g conda install -n trinity_env -c bioconda trinity=2.9.1 conda create --name rstan --channel conda-forge r-dplyr r-rmapshaper r-sf To check all versions available of the rstan package: conda search r-rstan --channel conda-forge
Qs 46: How do I transfer a centrally installed application to my local folder
You may wish to transfer the app locally due to write issues on the app folder This can be easily accomplished. Here is an example to transfer SWAAT-CUT app: cd ~ mkdir ~/sw cp -r /sw/misc/swatcup ~/sw Change permission like this chown -R yoursnumber:yoursnumber ~/sw/swatcup Now you would have it locally. Then setup modules to load https://conf-ers.griffith.edu.au/display/GHCD/FAQ+-+Gowonda#FAQ-Gowonda-Qs29:Howtocustomizeanenvironmentalvariableusingmodules
Follow this link to setup the modules environment
After following the above link and creating the ~/sw/Modules directory, you can simply copy the parent module file and make changes to it where needed. cp /sw/Modules/misc/swatcup/swatcup ~/sw/Modules vi ~/sw/Modules/swatcup (make the changes) This module will be available to you on the login. Or if needed immediately, source ~/.bash_profile
Qs 47: How do you check the current status of the cluster?
On the login node, please run the commands: pnodes, pjobs and pqueues
press "Q" to quit
"qhost" will list all the nodes. Please note that not all nodes are available to all users. Only a subset is available. "pbsnodes -aSj" will give an indication of the currently available resources. Again, not all available resources can be used and pbs will determine where and when a job is run. For example, there are nodes with 72 cores but if 72 cores is requested, your wait time would be a long time (weeks or months) unless the cluster is not busy at all. "qstat -q" will give the queue configuration. To view the current cluster status, you can also use the elinks text browser on the login node to view the status: elinks http://localhost:3000/nodes elinks http://localhost:3000/jobs elinks http://localhost:3000/queues (You can press "Q" to quit from the below text-based browsers
Qs 48: Why isn’t my job running?
There are many reasons that your job may have to wait in the queue longer than you would like. Here are some of them. 1.System load is high. It’s frustrating for everyone! There are peaks and lows. Unfortunately it is not possible to predict when we will experience this. We have noted that near the beginning of the semester, midway through the semester and at the end of the semester, the load is high. Load is low during the semester break. However this may not be the case all the time. 2.A system downtime has been scheduled and jobs are being held. Check the message of the day, which is displayed every time you login, or emails to the LML - GHPC list. 3. You or your group have used a lot of resources in the last few days, causing your job priority to be lowered (“fairness policy”). Priority is a complicated function of many factors, including the processor count and walltime requested, the length of time the job has been waiting, and how much other computing has been done by the user and their group over the last several days. 4. You or your group are at the maximum processor count or running job count and your job is being held. 5. Your job is requesting specialised resources, such as large memory, gpus or certain software licences, that are in high demand. 6. Your job is requesting a lot of resources. It takes time for the resources to become available. 7. Your job is requesting incompatible or nonexistent resources and can never run. For example, if you request 200GB memory, it will never run as the maximum capacity of a node is 170GB distributed among many jobs. 8. Job is unnecessarily stuck in batch hold because of system problems (very rare!).
Qs 49: How do I copy files and folders from the old cluster to the new cluster
From the new login node (gc-prd-hpclogin1.rcs.griffith.edu.au), you can see your old home and scratch: ls /ngumbin/home/s123456 You will see tons of files from old home ls /ngumbin/scratch/s123456 Again you will see a lot of files and folders from your old scratch. If you wish to copy a folder named FAS-917 located in the old scratch into the new scratch, you can do this on the new login node. screen cp -r /ngumbin/scratch/s123456/FAS-917 ~/scratch & You will see the copy. Just press ENTER and type screen -d After this you can even logout and the copy will continue till it is done. Please note screen is used to run the command even after you logout. The mount point to the new home and scrtach is not available on the old login node gowonda/gowonda2 but instead the old home and scratch are available on the new login node "gc-prd-hpclogin1"
Qs 50: How do you set up Griffith VPN?
https://intranet.secure.griffith.edu.au/computing/remote-access/virtual-private-network
You may follow the instructions on the Griffith vpn site. https://intranet.secure.griffith.edu.au/computing/remote-access/virtual-private-network
Qs 51: interactive jobs on the gpu queues
You will need to be part of the gpu queues: gpuq, dljun and dlyao group. Interactive pbs run ================= You log into n060 and run this command: qsub -I -V -q gpuq -l select=1:ncpus=1:ngpus=1:mem=12gb,walltime=10:00:00 (Depending on your need, you may wish to alter the walltime or other resources like memory etc) It will start a pbs job in interactive mode with a wall time of 10 hours. If there are resource available currently, it would start the job immediately, else you may have to wait. Other examples: qsub -V -I -q gpuq -l select=1:ncpus=1:ngpus=1 -l walltime=0:30:00 qsub -V -I -q dljun -l select=1:ncpus=1:ngpus=1 -l walltime=0:30:00 qsub -V -I -q dlyao -l select=1:ncpus=1:ngpus=1 -l walltime=0:30:00
Qs 52: Are there additional storage facilities at Griffith
We offer 3 research storage services at Griffith University for storage of research data. Check out https://research-storage.griffith.edu.au/ for more details. You can take a short quiz at https://research-storage.griffith.edu.au/compare to see which service is most appropriate for you depending how you would like to use the data. Each service has a separate form to request an account on/Project Space request. Please follow the links on the page above to get to them. On the HPC there is a 200GB quota set for your account. So you would need to upload segments of the data that you wish to analyse. This quota can be temporarily increased but is done on a case by case basis and needs to be requested through the HPC administrator To transfer files to the the HPC you can read through the FAQs https://conf-ers.griffith.edu.au/display/GHCD/FAQ+-+Griffith+HPC+Cluster There is a section on this very question in there.
Qs 53: Sample pbs script to run mpijobs
openmpi
#!/bin/bash #PBS -m abe #PBS -M YOUREMAIL@griffith.edu.au #PBS -N Inversion #PBS -l select=1:ncpus=16:mpiprocs=16:mem=1gb,walltime=300:00:00 ## processor cores slots=16 cd $PBS_O_WORKDIR module load mpi/openmpi/4.0.2 module load cmake/3.15.5 echo "Starting job" echo Running on host `hostname` echo Time is `date` echo Directory is `pwd` make clean make SolveInversion_NoRot mpiexec $PBS_O_WORKDIR/SolveInversion_NoRot > $PBS_O_WORKDIR/Inversion_export.dat #NP=`wc -l < $PBS_NODEFILE` #mpirun -hostfile $PBS_NODEFILE -np $NP mdrun_mpi_d -deffnm md01
intelmpi (MPI_DIR = /sw/intel/ps/2019up5/compilers_and_libraries_2019.5.281/linux/mpi/intel64)
#!/bin/bash #PBS -m abe #PBS -M YOUREMAIL@griffith.edu.au #PBS -N Inversion #PBS -l select=1:ncpus=16:mpiprocs=16:mem=1gb,walltime=300:00:00 ## The number of chunks is given by the select =<NUM > above ##$PBS_NODEFILE is a node-list file created with select and ncpus options by PBS PROCS=16 module load intel/2019up5/mpi module load cmake/3.15.5 echo "Starting job" echo Running on host `hostname` echo Time is `date` echo Directory is `pwd` cd $PBS_O_WORKDIR make clean make SolveInversion_NoRot mpiexec -n $PROCS $PBS_O_WORKDIR/SolveInversion_NoRot > $PBS_O_WORKDIR/Inversion_export.dat
mpich
module load mpi/mpich/3.3.2-gnu
Qs 54: How do to limit python program to use a certain number of cpus
A sample pbs script below provides an example
#!/bin/bash #PBS -m abe #PBS -M YourEmail@griffith.edu.au #PBS -N PepDock #PBS -q dljun@n060 #PBS -W group_list=deeplearning -A deeplearning ### Number of nodes:Number of CPUs:Number of threads per node. #PBS -l select=1:ncpus=3:mem=12gb,walltime=100:00:00 cd $PBS_O_WORKDIR NSLOTS=3 ##module load galaxyPepDock module load misc/galaxypepdock/galaxyPepDock GalaxyPepDock.centos7 -t test -p ACE2.pdb -s RBD-mimic1.fasta
Qs 55: How to Use Local Scratch Storage (/lscratch)
The cluster is equipped with
- home file system (your /export/home/snumber). This is common to all compute nodes
- a global scratch file system (your /scratch/snumber or links to it from /export/home/snumber/scratch). This is common to all compute nodes
- a local temporary scratch file system (/lscratch/snumber). A local scratch is only visible from within the compute node it belongs to.
Since /lscratch is a local disk mounted on the node, it's faster than network storage (home or global scratch). Most local nodes have limited disk space (<20gb)
Because of this limited disk space, a node may be forced to go offline if a large directory is created within /scratch on the node and then not deleted once the job ends.
Your pbs job must delete the directory created after the job has finished running.
A job script template of using the local scratch disks on a compute node is shown as below (Ack: Adapted from UOW HPC Guide)
#!/bin/bash #PBS -N jobName ####PBS -m abe ####PBS -M YourEmail@griffith.edu.au #PBS -q workq #PBS -l select=1:ncpus=1:mem=2gb,walltime=5:00:00 #======================================================# # USER CONFIG #======================================================# INPUT_FILE="hello.py" OUTPUT_FILE="$PBS_JOBNAME.out" MODULE_NAME="python/3.7.4" PROGRAM_NAME="python" # Set as true if you need those /lscratch files. COPY_SCRATCH_BACK=true #======================================================# # MODULE is loaded #======================================================# NP=‘wc -l < $PBS_NODEFILE‘ source /etc/profile.d/modules.sh module load $MODULE_NAME cat $PBS_NODEFILE #======================================================# # SCRATCH directory is created at the local disks #======================================================# SCRDIR=/lscratch/$LOGNAME/$PBS_JOBID if [ ! -d "$SCRDIR" ]; then mkdir $SCRDIR fi #======================================================# # TRANSFER input files to the scratch directory #======================================================# # just copy input file cp -r $PBS_O_WORKDIR/$INPUT_FILE $SCRDIR # copy everything (Option) #cp -r $PBS_O_WORKDIR/* $SCRDIR #======================================================# # PROGRAM is executed with the output or log file # direct to the working directory #======================================================# echo "START TO RUN WORK" cd $SCRDIR # Run a system wide sequential program ##$PROGRAM_NAME < $INPUT_FILE >& $PBS_O_WORKDIR/$OUTPUT_FILE $PROGRAM_NAME $INPUT_FILE >& $SCRDIR/$OUTPUT_FILE ###$PROGRAM_NAME $INPUT_FILE >& $PBS_O_WORKDIR/$OUTPUT_FILE # Run a MPI program (Option) ###For openmpi, use the following syntax#### #module load mpi/openmpi/4.0.2 #mpiexec $PROGRAM NAME < $INPUT FILE >& $OUTPUT FILE ####For intel mpi, use the following syntax#### #module load intel/2019up5/mpi #mpiexec -n $NP $PROGRAM NAME < $INPUT FILE >& $OUTPUT FILE # mpirun -np $NP $PROGRAM NAME < $INPUT FILE >& $OUTPUT FILE # Run a OpenMP program(Option) # export OMP NUM THREADS=$NP # $PROGRAM NAME < $INPUT FILE >& $OUTPUT FILE sleep 60 #======================================================# # RESULTS are migrated back to the working directory #======================================================# if [[ "$COPY_SCRATCH_BACK" == *true* ]] then echo "COPYING SCRACH FILES TO " $PBS_O_WORKDIR/$PBS_JOBID cp -rp $SCRDIR/* $PBS_O_WORKDIR if [ $? != 0 ]; then { echo "Sync ERROR: problem copying files from $tdir to $PBS_O_WORKDIR;" echo "Contact HPC admin for a solution." exit 1 } fi fi #======================================================# # DELETING the local scratch directory #======================================================# cd $PBS_O_WORKDIR if [[ "$SCRDIR" == *scratch* ]] then echo "DELETING SCRATCH DIRECTORY" $SCRDIR rm -rf $SCRDIR echo "ALL DONE!" fi #======================================================# # ALL DONE #======================================================# ## End-of-job summary echo "qstat -H $PBS_JOBID" echo "qstat -xf $PBS_JOBID"
Another simple example
#!/bin/bash #PBS -N jobName ###PBS -m abe ###PBS -M Myemail@griffith.edu.au #PBS -q workq #PBS -l select=1:ncpus=1:mem=2gb,walltime=5:00:00 ##Find the node on which the pbs job is running PBSCOMPUTENODE=`hostname` ##echo $PBSCOMPUTENODE ##Make the directory if neeeded to copy into on the pbs mkdir -p /lscratch/$LOGNAME/$PBS_JOBID cp -rp /export/home/$LOGNAME/Data /lscratch/$LOGNAME/$PBS_JOBID #Run command to process the data on the lscratch dir. E.g as below du -kh /lscratch/$LOGNAME/$PBS_JOBID echo "Hello World" > /lscratch/$LOGNAME/$PBS_JOBID/welcome.txt #Copy the output to the shared home dir cp -rp /lscratch/$LOGNAME/$PBS_JOBID/welcome.txt /export/home/$LOGNAME/Data ###This data will now be available on the shared drive. ##FYI: the .o file (output file) will have the node on which this job was run. #The /lscratch data must be deleted after the copy rm -r -f /lscratch/$LOGNAME/$PBS_JOBID
Manual copy:
To get the content of a remote /lscratch folder
ssh remotehostname "ls -la /lscratch/snumber"
e.g: ssh n061 "ls -la /lscratch/s5284664"
To copy something from a lscratch folder on a remote to your scratch folder:
scp -r remotehostname:/lscratch/snumber/FolderName ~/scratch
scp -r n061:/lscratch/s5284664/AnjuFolder ~/scratch
To copy into remote host's lscratch folder
scp -r ~/folder remotehost:/lscratch/snumber
e.g scp -r /export/home/s5284664/folder n061:/lscratch/s5284664/
Qs.56: NCMAS process and application
NCMAS facilities overview and who should apply
https://youtu.be/7ZZVk4HtdDY
NCMAS process and application 2021
https://youtu.be/hmV_j5GFgI0
Qs 57: What kind of storage and compute is available on Griffith HPC
The following is applicable generally but if it would be best to discuss it with the cluster admin for your situation.
Compute:
All compute resources within Griffith HPC are shared with other researchers. We use a job scheduler called PBS to intelligently schedule jobs. If there is a sudden surge in jobs by others, we can see lots of jobs being queued up, (sometimes for several days). It will all depend on how busy the cluster is.
What can be done if your research group needs a dedicated resource on the Griffith HPC? We have limited rack space available to add dedicated nodes. This means, if your project can buy the nodes (we can arrange a quote on request), they can be racked up solely for use by your project. Please note that rack space is limited and sometimes this option may not be available. You may contact the cluster admin if you need information about obtaining dedicated nodes.
Storage:
Currently, storage is premium on the Griffith HPC cluster. The recommended best practice is to keep the home directory and scratch directory under 200GB. We recognise the need for temporary surge in space and have allowed that with permission from the cluster administrator and with the expectation that space would be brought back to normal within a reasonable amount of time. If you need it a little longer than expected, you will need to contact the cluster admin again to make suitable arrangement.
We do have shared /project space for projects where a group of researchers from a research group/project can share files work space. However, due to resource issues, this space was not upgraded with the cluster upgrade in mid 2019. We are still using the old /project space and it is nearly full. Without an upgrade to the storage subsystem, it is not possible to accommodate new projects in this space.
What can you do if you need to bring in large amount of data? This will depend on how large the data is.
If it is under 500GB, and subject to space availability, we may consider accommodating it within /scratch space. You would need to request the cluster admin to create a shared folder on /scratch. It will have to be reviewed and renewed yearly. Space management will be the joint responsibility of all members of the research group. An email can be sent automatically when space reaches 50% , 75% and 90% capacity.
If the data is big, there are options but with some drawbacks that cluster users need to be aware of. The best option (if your project funding allows) is to buy a new storage subsystem that can be hosted within the HPC network. The drawback is that it can be expensive (indicative cost: 30k-50K but actual quote can be arranged on request). The advantage is that it will be connected to the fast InfiniBand network within the HPC network and data transfer will be quick. All backend compute nodes will be able to see the data and there will no additional overhead (like copy in and out).
The other option is to put the data on a storage device like Griffith’s research drive and then transfer the data thorough sftp as they become needed. We are not able to mount research drive to the cluster head node (gc-prd-hpclogin1/gowonda) at present. A two hop transfer to local desktop and HPC is suggested. We are also not able to mount cloud based storage devices like aws, azure, etc on the cluster. The mounted data (if any e.g through sshfs) cannot be used directly for compute by the compute nodes:
- The mount point will not be seen by compute nodes and hence, it needs to be copied to /scratch (or home directory). Once it is copied to /scratch, it can be seen by all compute nodes and used as working data from then onwards. The problem is that depending on the size of the data, this can take a while. The output needs to be copied back to the research drive. All of this cannot be done as part of the pbs script and needs to be performed manually. Also, there is the risk of leaving the data behind on /scratch space after use and then forgetting about it. Over time, the /scratch space will fill up and disrupt your jobs and other user’s HPC operations.
- The copy in and copy out process will be much slower as it will not be using the fast infiniband interconnect network within the HPC cluster. It would use Griffith network to transfer files. This copy in and copy out will be additional overhead for the researcher/research group.
- The shared folder in /scratch will have space limitation (200GB typically). Hence, keeping it well managed is critical.
Data ingress and egress
There is no internet access from the HPC to outside world to bring in lots of data from outside. A limited facility using proxy for application installs is allowed. Within Griffith network, a two hop transfer to local desktop and HPC is suggested.
Please contact cluster admin to discuss this and other cluster issues so that we can enable your HPC work.
Indy |Senior Systems Engineer / Griffith HPC Administrator
eResearch Services (eResearch Support Services), Office of Digital Solutions
Griffith University | Gold Coast Campus | QLD 4222 | G11 Room 4.42
T +61 7 5552 7259 | Mob 0434 600 814| email
griffith.edu.au | HPC User Guide | Submit a Support Ticket
Qs. 58: How can I find the list of nodes and licenses in HPC?
These commands can help 1.List nodes and usage: pbsnodes -aSj 2. qhost 3. Jobs queued and Running: qstat -1an 4. license info: comsol: /usr/local/bin/lmstat -a -c 27006@gc-prd-erslic.corp.griffith.edu.au abacus: /usr/local/bin/lmstat -a -c 27005@gc-prd-erslic.corp.griffith.edu.au ArcGis: /usr/local/bin/lmstat -a -c 27004@gc-prd-erslic.corp.griffith.edu.au Matlab: /usr/local/bin/lmstat -a -c 27001@gc-prd-erslic.corp.griffith.edu.au Ansys: /sw/ansys/2020R2/shared_files/licensing/linx64/ansysli_util -liusage Ansys: /sw/ansys/2020R2/shared_files/licensing/linx64/ansysli_util -statli 2325@gc-prd-erslic.corp.griffith.edu.au 5. Different queues: qstat -q 6. Additional Info: pqueues, pjobs and pnodes
Qs. 59: How do I purchase a licensed software to be installed on the Griffith HPC?
You can start the process by requesting a quote from Griffith software purchasing
https://intranet.secure.griffith.edu.au/computing/software
Click on "Request a Quote"
You can request the software through them and they can guide you through the legal stuff as well .
Qs 60: How to obtain the number of abaqus licenses?
License issued: /usr/local/bin/lmstat -a -c 27005@gc-prd-erslic.corp.griffith.edu.au|grep 'Users of standard' Users of standard: (Total of 10 licenses issued; Total of 10 licenses in use) Remaining licenses /usr/local/bin/lmstat -a -c 27005@gc-prd-erslic.corp.griffith.edu.au|grep 'Users of standard' | awk '{ printf("%d\n", $6-$11); }' 0
Qs61: How to run R code in parallel
Parallelisation using plyr and doParallel
We have doMC, plyr and DoParallel in R/4.0.3
Threads vs. cores
There is often a lot of confusion between CPU threads and cores. A CPU core is the actual computation unit. Threads are a way of multi-tasking, and allow multiple simultaneous tasks to share the same CPU core. Multiple threads do not substitute for multiple cores. Because of this, compute-intensive workloads (like R) are typically only focused on the number of CPU cores available, not threads. (Ref: https://jstaf.github.io/hpc-r/parallel/)
Example: module load R/4.0.3 > library(plyr) > library(doParallel) Loading required package: foreach Loading required package: iterators Loading required package: parallel > cores <- detectCores() > cores [1] 72 > registerDoParallel(cores=12) > fake_func <- function(x) { + Sys.sleep(0.1) + return(x) + } > > library(microbenchmark) > microbenchmark( + serial = llply(1:24, fake_func), + parallel = llply(1:24, fake_func, .parallel = TRUE), + times = 1 + ) Unit: milliseconds expr min lq mean median uq max neval serial 2424.3580 2424.3580 2424.3580 2424.3580 2424.3580 2424.3580 1 parallel 226.2199 226.2199 226.2199 226.2199 226.2199 226.2199 1 >
Qs62: My app is not running properly on the gpu node. It is stuck with no errors but with long application start-up times
You may be running into issues outlines here
Use this variable: export CUDA_CACHE_DISABLE=1 This can be added to your pbs script or from command line (for interactive runs) make sure you use the /lscratch directory for all runs. It looks to me that our /scratch and /export/home are too slow . Here is an explanation from the link above Cache stored on a Slow Network Share ============================ On Linux, the default location of the CUDA JIT cache is in your home directory. On clusters, it is not uncommon to mount home directories with relatively poor performance to the compute nodes (by using the Lustre file system for scratch space, but only NFS for the home directory, for example). We have seen cases where this relatively slow connection to the home directory (and thus the JIT cache) resulted in very long application start-up times when the application was not built with code for the right SM version. Even more confusing, start-up time can vary from node to node due to intricacies of the NFS set up. In this situation, it is best to build the application to avoid JIT entirely, and alternatively, to set CUDA_CACHE_PATH to point to a location on a fast file system.
Qs63: How do I connect remotely to my files on Griffith's G and H drives and network storage
https://intranet.secure.griffith.edu.au/computing/remote-access#network
Qs64: How do I analyse a core dump
If errors like "Program received signal SIGSEGV: Segmentation fault - invalid memory reference." are received, you may force a core dump to be generated by simply running the program on the login node outside of pbs e.g: module load quantum-espresso/6.7.0 PRE='Fe' turbo_eels.x < ${PRE}tddft.in > ${PRE}tddft.out This generated a core dump. To analyse a core dump gdb program coredump (gdb) where (gdb) bt full Please google "analysing core dump" for various techniques. e.g gdb turbo_eels.x core.36881 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /sw/quantum-espresso/6.7.0/bin/turbo_eels.x...(no debugging symbols found)...done. [New LWP 36881] Core was generated by `turbo_eels.x'. Program terminated with signal 11, Segmentation fault. #0 0x000000000040d7d1 in lr_alloc_init_k.3770 () Missing separate debuginfos, use: debuginfo-install blas-3.4.2-8.el7.x86_64 fftw-libs-double-3.3.3-8.el7.x86_64 glibc-2.17-260.el7_6.6.x86_64 lapack-3.4.2-8.el7.x86_64 libgfortran-4.8.5-36.el7_6.2.x86_64 zlib-1.2.7-18.el7.x86_64 (gdb) where #0 0x000000000040d7d1 in lr_alloc_init_k.3770 () #1 0x0000000000413ea0 in lr_alloc_init_ () #2 0x0000000000406bde in MAIN__ () #3 0x0000000000408fec in main () Other stuff: =========== readelf -Wa core.36881 objdump -s core.36881
Qs 65: Why is my GPU job not running when there are free GPU resources available
I have submitted two jobs in the n060 node. But one of my job is in the queue. Same case is happened in the gpuq. But I saw that some gpus are available. Now What I need to do? >>>>>>> Please run the command qstatt to check what is running on n060: >>>>> qstatt 94437.gc-prd-hp s5084400 dljun KF_RD_F2 37878 1 1 100gb 100:0 R 20:41 n060/1 94493.gc-prd-hp s5084397 dljun OUR 17110 1 1 100gb 200:0 R 01:41 n060/2 94501.gc-prd-hp s5084400 dljun KF_PN_F2_3 20741 1 1 100gb 100:0 R 00:18 n060/0 94503.gc-prd-hp s5084397 dljun OUR_2_32 -- 1 1 100gb 200:0 Q -- -- 94504.gc-prd-hp s5084397 gpuq Our_2_32 -- 1 1 100gb 200:0 Q -- -- >>>> The other command to run is: gpustat n060 Thu Mar 25 11:09:19 2021 [0] Tesla V100-PCIE-32GB | 54'C, 90 % | 31114 / 32480 MB | s5084400(31103M) [1] Tesla V100-PCIE-32GB | 51'C, 91 % | 31114 / 32480 MB | s5084397(31103M) [2] Tesla V100-PCIE-32GB | 33'C, 0 % | 0 / 32480 MB | [3] Tesla V100-PCIE-32GB | 32'C, 0 % | 0 / 32480 MB | [4] Tesla V100-PCIE-32GB | 49'C, 92 % | 31114 / 32480 MB | s5084400(31103M) [5] Tesla V100-PCIE-32GB | 33'C, 0 % | 0 / 32480 MB | [6] Tesla V100-PCIE-32GB | 31'C, 0 % | 0 / 32480 MB | [7] Tesla V100-PCIE-32GB | 33'C, 4 % | 0 / 32480 MB | >>>>>>>>>>>>>>>> Run the command "free" to find how much total memory is available free total used free shared buff/cache available Mem: 395591556 17993044 40322676 710764 337275836 375287544 Swap: 2097148 17408 2079740 >>>>>>>>>>>>>>>> gpustat shows free gpus The command qstatt shows what is running. If you look at it, there are 3 jobs running each with 100GB memory request. Which means a total of 300GB for these jobs. (Never mind if they are using that memory or not) The problem is clear. For the queued job, you have requested 100GB. The command "free" tells you the maximum amount of memory on that node which is 375GB (the last column). So already 300GB is being used with the running jobs. Additional jobs request will have to be below 75GB. The queued jobs requests additional 100GB each. It obviously does not have this as it will put it over the max 375GB. So, you must wait for the other remaining jobs to finish or reduce your memory requirements (e.g to less than 75GB in this case). The bottleneck could be the CPUs,walltime, GPUs, memory (as in this example), etc. I hope this explains why the jobs are still queued. There are cloud options (eg Microsoft Azure) that are available to all researchers on a paid basis if your research group has the funding for it
Qs66: How do I parallelise my run
Ref: https://www.quantum-espresso.org/Doc/user_guide/node18.html
Understanding Parallelism
Broadly, there are two different parallelization paradigms
- Message-Passing (MPI). A copy of the executable runs on each CPU; each copy lives in a different world, with its own private set of data, and communicates with other executables only via calls to MPI libraries. MPI parallelization requires compilation for parallel execution, linking with MPI libraries, execution using a launcher program (depending upon the specific machine). The number of CPUs used is specified at run-time either as an option to the launcher or by the batch queue system.
- OpenMP. A single executable spawn subprocesses (threads) that perform in parallel specific tasks. OpenMP can be implemented via compiler directives (explicit OpenMP) or via multithreading libraries (library OpenMP). Explicit OpenMP require compilation for OpenMP execution; library OpenMP requires only linking to a multithreading version of mathematical libraries, e.g.: ESSLSMP, ACML_MP, MKL (the latter is natively multithreading). The number of threads is specified at run-time in the environment variable OMP_NUM_THREADS.
MPI is the well-established, general-purpose parallelization. In QUANTUM ESPRESSO several parallelization levels, specified at run-time via command-line options to the executable, are implemented with MPI. This is your first choice for execution on a parallel machine.
The support for explicit OpenMP is steadily improving. Explicit OpenMP can be used together with MPI and also together with library OpenMP. Beware conflicts between the various kinds of parallelization! If you don't know how to run MPI processes and OpenMP threads in a controlled manner, forget about mixed OpenMP-MPI parallelization.
A lot of examples have been given for mpi runs (search for mpi in this FAQ).
Ref: https://www2.le.ac.uk/offices/itservices/ithelp/services/hpc/dirac/run-a-computation/job-types
Open MP/Threaded jobs
OpenMP and threaded jobs are parallel in nature, but only scale as far as the resources available within a single node. They cannot take advantage of processors across multiple compute nodes.
For these jobs, you must additionally request the number of processors that the job will require. As the job cannot be spread across more than one node, then chunk must equal 1 ((select=1), and ncpus can be any value from 1 to the number of physical cores per node (max 72 for gc-prd-hpcn nodes).It is 16 or less for older n00 nodes). In the example below, 8 cores on a single node have been requested:
#!/bin/bash #PBS -m abe #PBS -M i.siva@griffith.edu.au #PBS -N SimpleTest #PBS -q workq #PBS -l select=1:ncpus=8:mem=1g,walltime=01:01:10 module load intel/intelParallelStudio2019 export OMP_NUM_THREADS=$NCPUS cd $PBS_O_WORKDIR ./hello_world.exe
Here is the source code for hello_world
hello_world.c
#define NPROCS 8 int main (int argc, char *argv[]) { int nthreads, num_threads=NPROCS, tid; /* Set the number of threads */ omp_set_num_threads(num_threads); /* Fork a team of threads giving them their own copies of variables */ #pragma omp parallel private(nthreads, tid) { /* Each thread obtains its thread number */ tid = omp_get_thread_num(); /* Each thread executes this print */ printf("Hello World from thread = %d\n", tid); /* Only the master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Total number of threads = %d\n", nthreads); } } /* All threads join master thread and disband */ }
Sample output
./hello_world.exe Hello World from thread = 4 Hello World from thread = 0 Total number of threads = 8 Hello World from thread = 2 Hello World from thread = 3 Hello World from thread = 5 Hello World from thread = 6 Hello World from thread = 7 Hello World from thread = 1
Qs67: What are the extra benefits of using National Computing Infrastructure over in-house HPC?
https://my.nci.org.au/mancini/ncmas/2022/
"Simple answer, size. Most who put care into NCMAS to get an allocation could not survive or build a research career on inhouse HPC.
The national facilities might also offer access to large ram or GPU, or particular software, that is not available on inhouse HPC.
If inhouse resources are sufficient to so the research required then putting the effort into an NCMAS application is not worth it.
But if research stagnates or projects are put on hold or are not done as there are not enough resources available
to do them then getting more from external facilities is one way to go."
Qs68: NCI access to Griffith Researchers
Griffith researchers can get access to NCI through QCIF's NCI share.
https://www.qriscloud.org.au/index.php/services
Please select the QCIF's NCI share from the link above/
(You will see NCI share as an option under QRIScompute in the 2nd column).
Further details can be obtained from QCIF contact person, Marlies Hankel.
Additionally and in parallel, you can also apply directly when the application opens
Please note that projects will be given a fixed allocation which is given per quarter on a use it or loose it basis. Allocations cannot be carried forward or backward into other quarters. Standard disk space per project is 75GB in /scratch and if a project needs more you will need to contact help@nci.org.au.
Students cannot be a lead CI on an NCI project however, for the QCIF share postdocs can be. For NCMAS the lead CI is required to have an ARC or NHMRC grant or equivalent which is why larger groups apply for NCMAS. A grant is not required for a project under QCIF. However, the QCIF allocations are small, around 20-50 thousand per quarter. Larger allocations are only available through NCMAS.
Some applications like Mathematica and Matlab are licensed software. Mathematica is only available to ANU researchers on NCI. For Matlab, Griffith will need to get in contact with NCI to set up their institutional license. At the moment this is not available so one cannot use it. Unless you have your own license. But also in that case you would need to get in touch with NCI first to see if you can use Matlab on Gadi or not.
In general, allocations are given in service units SUs. 1 core hour is charged at 2 SUs. So if you have a calculation running using 4 cores and taking 48 hours then you will be charged 4*48*2=384 SUs for that calculation.
If a larger disk space (e.g 300GB) is needed, you would need to talk to NCI to increase the space in /scratch to accommodate this. If a larger RAM (e.g 400GB ), then you would need to make sure you run in a queue that supports that ram request. They could be charged more than 2 SUs per core hour though, so you would need to factor that in.
But talk to NCI, help@nci.org.au, first to see if you could use the application (e.g Matlab) onNCI before you even consider applying for an allocation on NCI.
Qs69: How to install bioinformatics software in your home directory
Requirement: Install fastqc v0.11.9 ,samtools v1.15.1, bcftools v1.15.1 ,bwa v0.7.17 (r1188), bwa-mem2 v2.2.1, GATK v4.2.6.1 Please check to make sure if these packages are available through conda. Once that is confirmed, do this: To get internet access from the cluster, run this command: source /usr/local/bin/s3proxy.sh Load the anaconda module to create virtual environment to install the given software (if available through conda) module load anaconda3/2022.10 conda search -c bioconda gatk4 We get the following results. Repeat the same for the other applications if needed. (fastqc 0.11.9 0 bioconda ) (samtools 1.15.1 h1170115_0 bioconda) (bcftools 1.15.1 h0ea216a_0 bioconda) (bwa 0.7.17 pl5.22.0_2 bioconda ) (bwa-mem2 2.2.1 he513fc3_0 bioconda) (gatk 3.8 py36_4 bioconda) (gatk4 4.2.6.1 hdfd78af_0 bioconda ) As you can see all are available for the version required. If you do not have a virtual environment already, you may create one like this by specifying the version of python needed if required. >>>>>> mkdir -p ~/.conda/envs;mkdir -p ~/.conda/pkgs Edit the ~/.condarc file (nano ~/.condarc) and place the following content: channels: - defaults The do the following to create the virtual environment conda create --name environmentName e.g conda create --name javed Activate this environment by doing this: source activate javed >>>>>>> Once you are in the virtual environment, you simply use the conda command to install the version of the application you need. e.g conda install -c bioconda fastqc=0.11.9 conda install -c bioconda samtools=1.15.1 conda install -c bioconda bcftools=1.15.1 conda install -c bioconda bwa=0.7.17 conda install -c bioconda bwa-mem2=2.2.1 conda install -c bioconda gatk4=4.2.6.1 Once installed, these applications would be available in your pbs script by having these 2 lines in the pbs script module load anaconda3/2021.11 source activate javed (or whatever the name of the virtual env) Sometimes a particular virtual environment may conflict with the install request due to other applications already installed in that environment conflicting with the the requirement. Simply install a new virtual environment to install the problem application. source deactivate javed conda create --name javed2 source activate javed2 You may have to load further modules to enable it to install. (e.g module load library/zlib/1.2.12 ) or install updated versions through conda of dependencies. (e.g:conda install zlib=1.2.12) Now that you have installed everything, you can now use it in a pbs script (see below for cat ~/pbs/pbs.01). To submit the job, you can do this "qsub pbs.01". #!/bin/bash #PBS -N MyTest #PBS -m abe #PBS -M myEmail@griffithuni.edu.au #PBS -q workq #PBS -l select=1:ncpus=1:mem=12gb,walltime=5:00:00 module load anaconda3/2021.11 source activate javed echo "Starting job: " cd $PBS_O_WORKDIR tensorboard -help
Qs70: How do I run Jupyter notebook on the HPC cluster
Please have a look at:
Qs71: Can I do x11 forwarding and run GUI applications on the HPC cluster
The login or head node of each cluster is a resource that is shared by many users. Running a GUI job on the login node is prohibited and may adversely affect other users. X11 Forwarding is only possible for interactive jobs.
Please note that there is a performance penalty when running a GUI job on the compute nodes using the method outlined below.
Set up X11 forwarding
To use X11 port forwarding, Install Xming X Server on Windows laptop/desktop first. Install the xming fonts package as well.
See instructions here: https://griffith.atlassian.net/wiki/spaces/GHCD/pages/4035477/xming
On a mac laptop/desktop, you can install quartz
Install a ssh client e.g. putty, please follow this instruction
https://griffith.atlassian.net/wiki/spaces/GHCD/pages/4030965/putty
On a mac desktop/laptop, install Xquartz (https://www.xquartz.org/)
On a linux laptop/desktop:
ssh -Y gc-prd-hpclogin1.rcs.griffith.edu.au
Once you are logged on the login node, do the following
Test if X11 is working by typing: xclock
If the clock pops up, it means it is working. Do do not run any GUI applications on the login node as it is a shared node and should not be used for running any gui application.
Set up pbs job with X11 forwarding
X11 Forwarding is only possible for interactive jobs
Sample interactive run
#On the login node which has been sshed into with X11 forwarding, run the following command to submit an interactive pbs job: qsub -I -X -q workq -l select=1:ncpus=1:mem=8gb,walltime=24:00:00 If you need a specific node, you can mention the node as follows qsub -I -X -q workq -l select=1:ncpus=1:host=gc-prd-hpcn006:mem=8gb,walltime=24:00:00
Start your gui application
Once the interactive job runs on a compute node, please run your GUI application :
e.g:
xclock (to test if X11 forwarding worked)
##module load matlab/2021a
##matlab
- Because the connection from the PBS job to your local system must be maintained, you cannot exit from either your local system or from the login node window until the job completes.
- Application performance on remote X11 servers deteriorates with latency, so unless your local system is physically close to the login node, job performance may not be optimal.
Here is a step by step instruction
From your local computer ssh -Y snumber@10.250.250.3 Once you log in, make sure you are able to open "xclock" Once you are happy xclock works, cd <dataDirectory> Start pbs with X11 forwarding and in interactive mode (this is only possible in interactive mode) e.g: qsub -X -I -l select=1:ncpus=1:mem=12gb,walltime=5:00:00 Once the job has been placed in a compute node, you wil be taken there. The prompt will change from snumber@gc-prd-hpclogin1 to one of the compute node snumber@gc-prd-hpcn003 Now, type "xclock" to make sure all is well. Then do this: cd $PBS_O_WORKDIR module load misc/febio/3.6 module load matlab/2021a module load anaconda3/2021.11 source activate nataliya cp ~/pathdef.m . #Run your matlab here matlab -nodisplay -nosplash -nodesktop -r "run('/export/home/snumber/FEA_scripts/Mat_Visco_PinkySil/C_inverse_FEA_uniaxial_viscoelastic.m');exit;"
Qs72: How do I run Singularity container on the cluster?
We will show you with an example. We have a container named
"trinityrnaseq.v2.14.0.simg" located in /sw/Containers/singularity/images/
a. Copy the container to your home directory ~/Containers
mkdir ~/Containers cp -i /sw/Containers/singularity/images/trinityrnaseq.v2.14.0.simg ~/Containers
b, Create a run file inside scratch folder
mkdir ~/scratch/jobs create a run file inside this folder (nano ~/scratch/jobs/trinity.sh) cat trinity.sh dir=/scratch/jobs/sz_ons_totalRNA out_dir=/scratch/jobs/trinity_out Trinity --seqType fq --max_memory 19G --normalize_reads --left $dir/100020001_1P.fastq --right $dir/100020001_2P.fastq --SS_lib_type RF --CPU 15 --output $out_dir/10201_r1r2_trinity_out Make it an executable: chmod +x trinity.sh Make sure data is available inside: /export/home/s123456/scratch/jobs (or wherever in scratch)
c. Sample pbs script named sin.pbs1
#!/bin/bash #PBS -m abe #PBS -M caio.damski@griffithuni.edu.au #PBS -N TestMitoZ #PBS -q workq #PBS -l select=1:ncpus=16:mem=21gb,walltime=19:00:00 cd $PBS_O_WORKDIR singularity exec -B /scratch/s123456:/scratch --pwd /scratch/s123456:/scratch --pwd /scratch /export/home/s123456/containers/trinityrnaseq.v2.14.0.simg "/scratch/jobs/trinity.sh" exit sleep 2
Qs73: How do I make my conda environment available to other users in the team
https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html Exporting the environment.yml file Note If you already have an environment.yml file in your current directory, it will be overwritten during this task. Activate the environment to export: Note Replace myenv with the name of the environment. module load anaconda3/2023.09 source activate myenv Export your active environment to a new file: conda env export > environment.yml Note This file handles both the environment's pip packages and conda packages. Email or copy the exported environment.yml file to the other person. To setup this environment: module load anaconda3/2023.09 source /usr/local/bin/s3proxy.sh conda env create -n ENVNAME --file ENV.yml e.g conda env create -n labmcdougall --file /tmp/environment.yml conda env create -n labmcdougallRoot --file /tmp/environment.yml #To remove an environment, conda remove -n env --all
If an environment is already available, and you wish to create a copy locally,
mkdir -p ~/.conda/envs mkdir -p ~/.conda/pkgs vi ~/.condarc channels: - defaults module load anaconda3/2022.10 source /usr/local/bin/s3proxy.sh e.g if an env called n061 needs be copied: cd ~/.conda/envs #This step is important conda create --prefix=n061 --clone /sw/anaconda3/2022.10/envs/n061 source activate ~/.conda/envs/n061
Qs74: How to use the bastion server (jump host) to log into the cluster
HPC Bastion servers provide Multi-Factor Authentication (MFA) as an additional layer of cybersecurity. One will need to use appropriate methods that Griffith supports (pingID app, yubi keys, etc) to authenticate. Unfortunately this option is no longer available to HPC users.
ssh -o ProxyCommand="ssh -W %h:%p s123456@gc-prd-bastion-1.itc.griffith.edu.au" s123456@10.250.250.3 OR ssh -l s123456 \ -o 'ProxyCommand ssh -l s123456 %h nc 10.250.250.3 22' \ -o 'HostKeyAlias 10.250.250.3' \ gc-prd-bastion-1.itc.griffith.edu.au You may update your ~/.ssh/config file and make the command simpler. Host hpclogin1 Hostname 10.250.250.3 ProxyCommand ssh s123456@gc-prd-bastion-1.itc.griffith.edu.au -W %h:%p Now all I have to do is type the following ssh command ssh s123456@hpclogin1 Note 1: ======= OpenSSH version 7.3 or above: If there are multiple jump hosts, you can set multiple jump host using a comma-separated list and the servers will be visited in the order listed: Host hpclogin1 Hostname 10.250.250.3 ProxyCommand ssh gc-prd-bastion-1.itc.griffith.edu.au,na-prd-bastion-1.itc.griffith.edu.au -W %h:%p User s123456 Note 2: Multihop transfers =========================== sftp -o ProxyCommand="ssh -W %h:%p s123456@gc-prd-bastion-1.itc.griffith.edu.au" s123456@10.250.250.3 scp -o ProxyCommand="ssh -W %h:%p user1@server1" user2@server2:/<remotePath> <localpath> To copy a file named core.10437 from HPC login node to local machine: ===================================================================== scp -o ProxyCommand="ssh -W %h:%p s123456@gc-prd-bastion-1.itc.griffith.edu.au" s123456@10.250.250.3:/export/home/s123456/core.10437 . To copy a directory named tmp2 from local machine to HPC login node ==================================================================== scp -o ProxyCommand="ssh -W %h:%p s123456@gc-prd-bastion-1.itc.griffith.edu.au" -r tmp2 s123456@10.250.250.3:/export/home/s123456/
Qs75: How to use MFA on the QRIScloud cluster bunya (UQ cluster) to transfer files
UQ's bunya uses google authenticator.
You may use Cyberduck for sftp connections. The popular Windows client, WinSCP, works very well with MFA. Unfortunately the popular Windows/Linux client, Filezilla, does not handle the 2nd authentication so always do things from a terminal and CLI on Linux clients. Mac users can use CyberDuck. Here are some screenshots from CyberDuck:
mac/linux users, can also use sshfs to mount (Though we have not tested this, this may be available for windows users also)
Ref 1: https://osxfuse.github.io/
Ref2: https://phoenixnap.com/kb/sshfs
e.g
mkdir -p ~/mnt/bunya
sshfs myusername@bunya.rcc.uq.edu.au:/home/myusername/ ~/mnt/bunya
One-time password (OATH) for `myusername':
Use the native file explorer that comes with the OS to browse this folder and transfer files.
Bunya Referencs: https://services.qriscloud.org.au/credential
Qs76: My job is not running. How do I check if resources are available
qhost|grep gc-prd|grep -v login vnode state OS hardware host queue mem ncpus nmics ngpus comment --------------- --------------- -------- -------- --------------- ---------- -------- ------- ------- ------- --------- gc-prd-hpcn002 job-busy -- -- gc-prd-hpcn002 -- 188gb 72 0 0 -- gc-prd-hpcn003 job-busy -- -- gc-prd-hpcn003 -- 188gb 72 0 0 -- gc-prd-hpcn004 free -- -- gc-prd-hpcn004 -- 188gb 72 0 0 -- gc-prd-hpcn005 job-busy -- -- gc-prd-hpcn005 -- 188gb 72 0 0 -- gc-prd-hpcn006 job-busy -- -- gc-prd-hpcn006 -- 188gb 72 0 0 -- gc-prd-hpcn001 job-busy -- -- gc-prd-hpcn001 -- 188gb 72 0 0 --
pbsnodes -aSj|grep gc-prd|grep -v login
pbsnodes -aSj|grep gc-prd|grep -v login mem ncpus nmics ngpus vnode state njobs run susp f/t f/t f/t f/t jobs --------------- --------------- ------ ----- ------ ------------ ------- ------- ------- ------- gc-prd-hpcn002 job-busy 5 5 0 38gb/188gb 0/72 0/0 0/0 214921,214922,214923,214924,215171 gc-prd-hpcn003 job-busy 5 5 0 38gb/188gb 0/72 0/0 0/0 214925,214926,214927,214942,215172 gc-prd-hpcn004 free 6 6 0 2gb/188gb 6/72 0/0 0/0 170274,214943,214944,215089,215176,215236 gc-prd-hpcn005 job-busy 5 5 0 38gb/188gb 0/72 0/0 0/0 214928,214929,214930,214931,215173 gc-prd-hpcn006 job-busy 5 5 0 38gb/188gb 0/72 0/0 0/0 214932,214933,214934,214935,215174 gc-prd-hpcn001 job-busy 5 5 0 38gb/188gb 0/72 0/0 0/0 214936,214937,214938,214939,215175
You may try sending the job to queues named:
Qs77: Reusing SSH connections
Mac/Linux
To make it more convenient for users who use multiple terminal sessions simultaneously, SSH can reuse an existing connection if connecting from Linux or Mac. After the initial login, subsequent terminals can use that connection, eliminating the need to enter the username and password each time for every connection. To enable this feature, add the following lines to your ~/.ssh/config
file:
~/.ssh/config Host 10.250.250.* ControlMaster auto ControlPath /tmp/%r@%h:%p ControlPersist 2h
~/.ssh/config.
If not, simply create the file and set the permissions appropriately first: touch ~/.ssh/config && chmod 600 ~/.ssh/config
Windows (PuTTY)
To enable connection reuse in PuTTY, enable the “Share SSH connections if possible” option under the “SSH” configuration section.
First, select the saved PuTTY config for the cluster and click “Load”.
You can configure Putty to Share SSH connections if possible
via the SSH
option in the Connection Catagory
when configuring a new connection.
As long as your existing connection remains active you can start new sessions without re-authenticating by using Duplicate Session
command to start new sessions.
Qs78: winscp connections fail with error
Ans:
- You have more than 2 devices configured with pingID (e.g apps on multiple phones or desktop)
- If you have 2 or less apps configured, can you reinstall the pingID app?. Contact IT support 07 373 55555 if you have any issues re-installing pingid app.
- You have another authenticator like google authenticator installed. This does not work well winscp on the Griffith HPC cluster. You must configure pingID app.
Contact IT support 07 373 55555 if you have any issues re-installing pingid app.
Qs79: List of ssh/sftp clients
ssh clients
=======
multiplatform:
Windows:
- http://mobaxterm.mobatek.net/download-home-edition.html
- putty
- Filezillia
- Windows WSL system lets you run the linux versions of ssh under windows.wsl --installThis should get you command line: ssh, scp, and sftp;
scp/sftp clients
===========
multiplatform: CyberDuck
Windows: winscp (recommended), filezila,
In Ubuntu 14.0.4 its under Files > Connect to Server
in the Menu or Network > Connect to Server
in the sidebar
in Connect to server
: ssh://s123456@10.250.250.3
In Fedora, go to menu File → Connect To Server, select the appropriate protocol, enter required details and simply connect. Just make sure that the SSH server is running on the other side. It works great.
Others include nautilius, thunar, gftp, gigolo, etc
Or use wine to run windows apps.
- Run
sudo apt-get install wine
(run this one time only, to get 'wine' in your system, if you don’t have it) - Download the latest WinSCP portable package https://winscp.net/eng/download.php
- Make a folder and put the content of the ZIP file in this folder
- Open a terminal
- Type
wine WinSCP.exe
Done! WinSCP will run like in a Windows environment!
Usage of linux scp command is as follows
To transfer from remote to locally: scp -r remotehostname:/export/home/snumber/FolderName ~/
Qs80: Tunnel setup
Tunneling setup:
From command line e.g:
ssh -N -f -L 8889:gc-prd-hpcn002:8889 s123456@gc-prd-hpclogin1.rcs.griffith.edu.au
Or use an app like:
https://davrodpin.github.io/mole/#windows
Qs81: Sending a large file to external collaborators
Qs82: Explain how to improve the chances of a job run
You can investigate which queue has free resources currently with these commands: pbsnodes -aSJ qhost qstat -q qmgr -c 'print queue longbigmem'. #substitute longbigmem with the queue name However, not all queues can be used by all as some are reserved for researchers who bought dedicated compute nodes from their own funds. Examples of these queues are: dljun,dlyao,sparks2,omero,aspen,gpuq,gpuq2 etc. Please also note there are walltime and memory restrictions on some queue. The default queue is workq has the least amount of restrictions and is recommended for general purpose runs. For some queues, there may be just one node configured. Also, some nodes are down for hardware reason. As an example, let's look at the queue named 'small_long' >>>>>>>> qhost|grep small_long n017 free -- -- n017 small_long 47gb 12 0 0 -- This shows one node is configured for this queue. Next, let's check what is available currently for this node pbsnodes -aSj | egrep "mem|vnode|n017" mem ncpus nmics ngpus vnode state njobs run susp f/t f/t f/t f/t jobs n017 free 1 1 0 35gb/47gb 11/12 0/0 0/0 221143 We can see it has 35GB free out of a total of 47gb. It has 11 cores free out of 12 cores. From this, we can see that if your job asks for 35GB or less and 11 or less cores, and your job is sent to this queue, it will probably run. Please note that free resources will change dynamically.
Qs83: pytorch is not able to find the cuda device
Ref:
n061 nvidia driver is 530.30.02 and cuda toolkit 12.1. As this is the latest (as of March 2023), pytorch was not compatible. The nvidia drivers have been downgraded to 520.61.05 and cuda toolkit 11.8. Even after the downgrade, torch still could not detect a cuda device.The workaround was to manually compile it Here is the installation notes: >>>>>> Edit ~/.condarc (vi ~/.condarc) >>>>>> channels: - defaults envs_dirs: - /lscratch/s12345/.conda/envs - /export/home/s12345/.conda/envs >>>>>>> source /usr/local/bin/s3proxy.sh module load anaconda3/2022.10 module load gcc/11.2.0 module load cmake/3.26.4 module load cuda/11.4 #If you have an existing environment, you can use it. If not create it with: conda create -n myTorch source activate myTorch conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cffi typing_extensions future six requests dataclasses #cmake conda install -c pytorch magma-cuda118 mkdir /tmp/bela #Any name is fine. Here I named it bela. It is temp dir cd /tmp/bela git clone --recursive https://github.com/pytorch/pytorch cd pytorch git checkout v1.13.1 # Or any version you want git submodule sync git submodule update --init --recursive export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"} python setup.py install 2>&1 | tee pythonSetupLogs.txt >>>>>>> Now you can test your installation: source activate myenv python Python 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> torch.cuda.is_available() True >>> Another test: python isCuda.py CUDA AVILABLE >>>>cat isCuda.py<<<<< import torch if torch.cuda.is_available(): print('CUDA AVILABLE') else: print('NO CUDA') >>>>>>>>>>>>>>>>>> To install torchvision from source: conda install -c conda-forge libjpeg-turbo git clone https://github.com/uploadcare/pillow-simd cd pillow-simd python setup.py install 2>&1 | tee pythonInstallPillowSimd.txt git clone https://github.com/pytorch/vision.git cd vision git checkout v0.14.1 python setup.py install 2>&1 | tee pythonInstalltorchvision.txt For tensorflow: conda install tensorflow=2.12.0=gpu_py311h65739b5_0 -c pytorch -c nvidia >>>>>>>>> Another way: source /usr/local/bin/s3proxy.sh module load anaconda3/2023.09 conda create -n mytorchA100 -c pytorch -c nvidia source activate mytorchA100 conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia conda install tensorflow=2.12.0=gpu_py311h65739b5_0 -c pytorch -c nvidia >>>>>>>> To ensure that PyTorch was installed correctly, we can verify the installation by running sample PyTorch code. Here we will construct a randomly initialized tensor. import torch x = torch.rand(5, 3) print(x) The output should be something similar to: tensor([[0.3380, 0.3845, 0.3217], [0.8337, 0.9050, 0.2650], [0.2979, 0.7141, 0.9069], [0.1449, 0.1132, 0.1375], [0.4675, 0.3947, 0.1426]]) Source: https://pytorch.org/get-started/locally/
Qs84: How do I use the tensorflow singularity/docker containers on the A100 gpu node
ssh n061 #Check if they work first cd /lscratch/sw/Containers singularity shell --nv /lscratch/sw/Containers/tensorflow_23.05-tf2-py3mine.sif python import tensorflow as tf physical_devices = tf.config.list_physical_devices('GPU') print("Num GPUs:", len(physical_devices)) #use it in a pbs script Check: https://griffith.atlassian.net/wiki/spaces/GHCD/pages/4030761/Singularity # You can create your own containers #https://catalog.ngc.nvidia.com/containers singularity pull docker://nvcr.io/nvidia/tensorflow:22.01-tf2-py3 # Get the version you desire singularity shell --nv <container.sif>
Qs85: What can help if the home, sw and scratch space is slow on the A100 gpu node
You can temporarily use the local scratch to install application and run data out of. #Please note /lscratch is temporary storage and will disappear next time the server is imaged. So always have a backup. To install from /lscratch, you can do the following: mkdir -p /lscratch/s12345/.conda/envs mkdir -p /lscratch/s12345/.conda/pkgs Edit ~/.condarc (vi ~/.condarc) >>>>>> channels: - defaults envs_dirs: - /lscratch/s12345/.conda/envs - /export/home/s12345/.conda/envs >>>>>> module use /lscratch/sw/Modules #put it in your ~/.bashrc if you are going to use this often module load anaconda3/2023.03 source /usr/local/bin/s3proxy.sh conda create -n A100local #any name will do source activate A100local Now you can install the packages with conda search -c pytorch pytorch #example of a package conda search -c pytorch-nightly pytorch #example of a package from development site conda search tensorflow #example of a package conda install <package> e.g: conda install -c pytorch pytorch=2.0.1=py3.11_cuda11.8_cudnn8.7.0_0 e.g: conda install -c pytorch-nightly tensorflow=2.12.0=gpu_py310hfda07e1_0 e.g. conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
Qs86: Useful external video tutorials
Qs87: I do not receive email sent to the LML - GHPC group.
Requests of this type should be directed to IT Service Centre to raise though the Digital Solutions service management system (Cherwell / GSM). The easiest way to do this is to email your request to ithelp@griffith.edu.au. The request will then be redirected to the correct team. Given the email list is an LML, this will be Identity Services. (e.g Reference GSM case Incident#1081850) The solution provided for this was this: >>>>>> Regarding your Service Request 1081850 emails sent to LML email list are not received by me, logged on 7/11/2023 2:49 PM, we have the following question or update: We have noticed a discrepancy in the Group Membership for LML - GHPC and your Staff Azure GroupMembership was missing from this Azure Group used for email. A re-synchronization of this group has increased the associated Azure Group Membership from 178 to 216 members of which you were one of the people missing. Thank you for notifying us of this anomaly. You will receive emails for those sent to this group from now on. None of the other LML - Groups (you have membership for) have Staff Azure/Staff Email as a Target System and would never generate email. This Service Request will now be marked resolved on the basis of the action taken and information provided. >>>>>>
Qs88: Install perl modules without root access.
Install perl locally >>>>>>>>installation notes<<<<<<<<<<<< Download the perl source file and unzip/untar wget https://www.cpan.org/src/5.0/perl-5.38.2.tar.gz --no-check-certificate ./Configure -des -Dprefix=~/sw/perl/5.38.2 2>&1 | tee configureLogs.txt make 2>&1 |tee makeLogs.txt make test 2>&1 | tee maketestLogs.txt make install 2>&1 | tee makeInstallLogs.txt >>>>>>>>End of perl installation<<<<< echo "check_certificate = off" >> ~/.wgetrc Edit the .bash_profile file and add the local MODULEPATH >>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< vi ~/.bash_profile export MODULEPATH=$MODULEPATH:~/sw/Modules >>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< #copy the main perl module file locally cp -r /sw/Modules/lang/perl/ ~/sw/Modules module load perl/5.38.2 You now have a a copy of the perl installation in your ~/sw/perl directory and made a module file from ~/Modules/perl/5.38.2 source /usr/local/bin/s3proxy.sh I guess you want to install cpan -i File:FindLib This will install this module now. Simple press return for username and password for proxy.
Qs89: n061 gpu node : pytorch usage
module load anaconda3/2022.10 source activate TorchA100
Qs90: How to run the pytorch (container method and conda env method) on the ICT cluster
Container method:
Ref1: Singularity#Convertingadockerimageintosingularityimage
Minimal docker images are kept here: /sw/Containers/docker mkdir -p /export/home/snumber/sw/Containers cd /export/home/snumber/sw/Containers Convert the tarball to a Singularity image. module load singularity/4.1.3 singularity build --sandbox pytorch docker-archive:///sw/Containers/docker/pytorchnew.tar (it takes quite sometimes to create the sandbox. Please be patient. You only have to do this once!) for testing: singularity shell -e -B /scratch/snumber:/scratch/snumber -B /export/home/snumber:/export/home/snumber /export/home/snumber/sw/Containers/pytorch e.g: Singularity> python Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> torch.cuda.is_available() True So we see it works! So now you need to integrate this into a slurm or pbs script. PBS examples are here Slurm would be like this: create a run script: mkdir /export/home/snumber/slurm/data create a file /export/home/snumber/slurm/data/myrun1.sh cat /export/home/snumber/slurm/data/myrun1.sh >>>>>>>>> cd /export/home/snumber/slurm/data python /export/home/snumber/slurm/data/isCuda.py >>>>>>>>>> make it executable: chmod +x /export/home/snumber/slurm/data/myrun1.sh Here is the content of the slurm.01 file >>>>cat slurm.02 >>>>>>>>>>>>> #!/bin/bash ####SBATCH --account=def-yoursnumber #SBATCH --job-name=helloWord #SBATCH --cpus-per-task=1 #SBATCH --mem-per-cpu=1500MB #SBATCH --gres=gpu:a100:1 ###SBATCH --qos=work #SBATCH --qos=work ###SBATCH --mem=4000M # memory per node #SBATCH --time=0-03:00 module load singularity/4.1.3 ##./program # you can use 'nvidia-smi' for a test singularity exec -e -B /scratch:/scratch -B /export/home/snumber:/export/home/snumber /export/home/snumber/sw/Containers/pytorch "data/myrun1.sh" >>>>>>>>>>> To submit the slurm job: sbatch slurm.02 Alternatively, to run interactively: srun --export=PATH,TERM,HOME,LANG --job-name=hello_word --cpus-per-task=1 --mem-per-cpu=1500MB --time=1:00:00 --qos=work --gres=gpu:a100:1 --pty /bin/bash -l singularity shell -e -B /scratch/snumber:/scratch/snumber -B /export/home/snumber:/export/home/snumber /export/home/snumber/sw/Containers/pytorch cd data sh myrun1.sh >>>Some errors you can ignore while building the sandbox>>> INFO: Starting build... INFO: Fetching OCI image... INFO: Extracting OCI image... 2024/05/20 12:37:31 warn rootless{usr/local/nvm/versions/node/v16.20.2/bin/corepack} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" 2024/05/20 12:37:32 warn rootless{usr/local/nvm/versions/node/v16.20.2/bin/npm} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" 2024/05/20 12:37:32 warn rootless{usr/local/nvm/versions/node/v16.20.2/bin/npx} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" 2024/05/20 12:38:17 warn rootless{usr/lib/libarrow.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" 2024/05/20 12:38:17 warn rootless{usr/lib/libarrow.so.1400} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" 2024/05/20 12:38:17 warn rootless{usr/lib/libarrow_acero.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" 2024/05/20 12:38:17 warn rootless{usr/lib/libarrow_acero.so.1400} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" 2024/05/20 12:38:17 warn rootless{usr/lib/libarrow_dataset.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" 2024/05/20 12:38:17 warn rootless{usr/lib/libarrow_dataset.so.1400} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" 2024/05/20 12:38:34 warn rootless{usr/lib/libparquet.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" 2024/05/20 12:38:34 warn rootless{usr/lib/libparquet.so.1400} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" INFO: Inserting Singularity configuration... INFO: Creating sandbox directory... INFO: Build complete: pytorch >>>>
The other method is to create a conda environment
#Go into an interactive run first srun --export=PATH,TERM,HOME,LANG --job-name=hello_word --cpus-per-task=1 --mem-per-cpu=60GB --time=19:00:00 --qos=work --gres=gpu:a100_1g.10gb:1 --pty /bin/bash -l source /usr/local/bin/s3proxy.sh module load cuda/12.1 module load anaconda3/2024.02 conda create --name pytorchCuda121 pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia source activate pytorchCuda121
Qs91: Do you recommend installing miniconda : pytorch usage example
We do not recommend installing miniconda as we provide conda as a module. module load anaconda3/2024.02 If you have installed it, comment out the miniconda section from ~./bashrc which can mess with other non-miniconda env. After commenting it out re-log back in. Here is how you can test pytorch on your home dir for cuda availability: To test this, please do this; cd ~/slurm There are a couple of sample scripts to run slurm jobs We will run an interactive job to troubleshoot this problem: For example: srun --export=PATH,TERM,HOME,LANG --job-name=hello_word --cpus-per-task=1 --mem-per-cpu=50GB --time=1:00:00 --qos=work --gres=gpu:a100:1 --pty /bin/bash -l it would put you inside a job. squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) <snip> 1718 LocalQ myrun s5284664 R 4:17:59 1 dgxlogin Once inside a job, check if it detects gpu in pytorch app. (if already installed) module load anaconda3/2024.02 conda info --envs # conda environments: # base * /export/home/s5305964/miniconda3 PHDenv /export/home/s5305964/miniconda3/envs/PHDenv Unfortunately, you have installed it in miniconda. So leaving aside that, I would build a new env To install new env and packages, do this: source /usr/local/bin/s3proxy.sh (need to get access the internet within a slurm job) conda create --name pytorchCuda121 pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia (See Qs 90 : https://griffith.atlassian.net/wiki/spaces/GHCD/pages/4030751/FAQ+-+Griffith+HPC+Cluster#FAQ-GriffithHPCCluster-Qs90%3AHowtorunthepytorch(containermethodandcondaenvmethod)ontheICTcluster) source activate pytorchCuda121 python ~/isCuda.py CUDA AVILABLE >>>>>>>>>isCuda.py>>>>>>>>>>>>> import torch if torch.cuda.is_available(): print('CUDA AVILABLE') else: print('NO CUDA') >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PS: Comment out the miniconda install from ~./bashrc which can mess with this env. If you ever need to use the miniconda env, you can do this: source deactivate module purge __conda_setup="$('~/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" eval "$__conda_setup" source activate PHDenv python isCuda.py CUDA AVILABLE
Qs92: dgxA100 gpu node : group reservation - dedicated gpus in slurm
A group named "alan" has 2 A100 gpus reserved. scontrol show res ReservationName=alan100 StartTime=2024-07-01T10:33:01 EndTime=2024-08-02T10:33:01 Duration=32-00:00:00 Nodes=dgxlogin NodeCnt=1 CoreCnt=32 Features=(null) PartitionName=LocalQ Flags=FLEX,MAGNETIC NodeName=dgxlogin CoreIDs=0-10,13,40-43,64-79 TRES=cpu=64,gres/gpu:a100=2,gres/gpu=2 Users=(null) Groups=alan Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) This is reservation is automated (using cron) so that it is deleted every 30 days and recreated. Which means , these 2 gpus are permanently reserved for alan group. Usage is as follows: For batch runs ============== Add this inside the slurm script: #SBATCH --reservation=alan100 OR sbatch --reservation=alan100 <job script> e.g: sbatch --reservation=alan1g10gb slurm.res Make sure the job script has resource request covered by the reservation (This command will list that: scontrol show res) For interactive runs ================= srun --reservation=alan100 --export=PATH,TERM,HOME,LANG --job-name=myTestRun --cpus-per-task=1 --mem-per-cpu=15GB --time=00:02:00 --qos=work --gres=gpu:a100:1 --pty /bin/bash -l
Qs93: Apptainer/Singularity Container Usage Example on the gowonda HPC cluster
This will be illustrated using an actual question that came to our support team: Support Request: ================ I’ve been trying for some time to get gene regulatory analysis to run on a large dataset on Gowonda. Originally I tried doing this using Genie3, which begins running but errors out after about 5 hours due to exceeding memory, even if I run it on bigmem requesting 900GB. There’s a new package – arboreto – that has a better version of Genie3 but also a much more computationally efficient methodology called grnboost2. I’ve made a new conda environment and have installed arboreto, however can’t get grnboost2 to run. The files load fine, but it stalls at the grnboost2 step with the error ‘TypeError: descriptor '__call__' for 'type' objects doesn't apply to a 'property' object’. Help forums indicate that that this may be due to an issue with the Python and Dask versions. I’ve tried making different Conda environments using different version combinations, but still can’t get it to work. Our Answer: =========== We looked at this page: https://github.com/aertslab/pySCENIC/issues/163 and we think the 1st solution using singularity containers is good. Containers are better suited when issues with version mismatch occurs (e.g specific versions of python, dask etc). https://pyscenic.readthedocs.io/en/latest/installation.html#docker-and-singularity-images We went to this link as directed from above link. This is the copy and paste of the section using Apptainer (used to be called Singularity) >>>>>> Clip from the link >>>>>>>>>>>> Singularity/Apptainer Singularity/Apptainer images can be build from the Docker Hub image as source: # pySCENIC CLI version. singularity build aertslab-pyscenic-0.12.1.sif docker://aertslab/pyscenic:0.12.1 apptainer build aertslab-pyscenic-0.12.1.sif docker://aertslab/pyscenic:0.12.1 # pySCENIC CLI version + ipython kernel + scanpy. singularity build aertslab-pyscenic-scanpy-0.12.1-1.9.1.sif docker://aertslab/pyscenic_scanpy:0.12.1_1.9.1 apptainer build aertslab-pyscenic-0.12.1-1.9.1.sif docker://aertslab/pyscenic_scanpy:0.12.1_1.9.1 >>>>>>>>>End of the clip from the link >>>>>>>>>> To download the container from any docker link at Griffith, we have to alter the commands slightly (also documented on the singularity page of our wiki) https://griffith.atlassian.net/wiki/spaces/GHCD/pages/4030761/Singularity (The change is to replace docker:// with docker://public.docker.itc.griffith.edu.au/) So the altered command becomes: singularity build aertslab-pyscenic-0.12.1.sif docker://public.docker.itc.griffith.edu.au/aertslab/pyscenic:0.12.1 singularity build aertslab-pyscenic-scanpy-0.12.1-1.9.1.sif docker://public.docker.itc.griffith.edu.au/aertslab/pyscenic_scanpy:0.12.1_1.9.1 After sourcing the proxy file (source /usr/local/bin/s3proxy.sh), you can run the above two commands to download those containers. We have done this for you and placed the containers in this directory: ~/Containers If this directory does not exist, create it with mkdir ~/Containers To check the functionality of the container, please do this: singularity shell -B /scratch/snumber:/scratch --pwd /scratch/snumber:/scratch --pwd /scratch /export/home/snumber/Containers/aertslab-pyscenic-0.12.1.sif >>>>>>>>> From inside the container shell, you can run a few commands to check for functionality e.g: Apptainer>cd scripts Apptainer> python grnboost3.py INFO:root:Initializing Dask client... DEBUG:asyncio:Using selector: EpollSelector DEBUG:asyncio:Using selector: EpollSelector DEBUG:asyncio:Using selector: EpollSelector DEBUG:asyncio:Using selector: EpollSelector DEBUG:asyncio:Using selector: EpollSelector DEBUG:asyncio:Using selector: EpollSelector DEBUG:asyncio:Using selector: EpollSelector DEBUG:asyncio:Using selector: EpollSelector DEBUG:asyncio:Using selector: EpollSelector DEBUG:asyncio:Using selector: EpollSelector DEBUG:asyncio:Using selector: EpollSelector INFO:root:Dask client initialized: <Client: 'tcp://127.0.0.1:41411' processes=9 threads=72, memory=251.23 GiB> INFO:root:Loading and transposing expression data... ERROR:root:An error occurred: [Errno 2] No such file or directory: 'FinalCounts.csv' <snip> exit >>>>>>>>>>>>> This looks ok (you need to supply the missing files FinalCounts.csv) If you are happy with this, go to the next step: (Please note: We have put all the sample pbs and run scripts in your ~scratch/scripts folder) Next part is to create a pbs script and a run script mkdir /scratch/snumber/scripts mkdir /scratch/snumber/scripts/data cd /scratch/snumber/scripts You can copy all the scripts into this folder as well as create a run script e.g: cp grnboost3.py ~/scratch/scripts/grnboost3.py Here is a sample pbs script to use 6 cpus >>>>>>>>>>>>>>> #!/bin/bash #PBS -m e #PBS -M YourEmail@griffith.edu.au #PBS -N GRN_pearl #PBS -q workq #PBS -l select=1:ncpus=6:mem=66gb,walltime=30:00:00 cd $PBS_O_WORKDIR singularity exec -B /scratch/:/scratch --pwd /scratch/snumber:/scratch --pwd /scratch /export/home/snumber/Containers/aertslab-pyscenic-0.12.1.sif "/scratch/scripts/run1.sh" #singularity exec -B /scratch/snumber:/scratch --pwd /scratch/snumber:/scratch --pwd /scratch /export/home/snumber/Containers/aertslab-pyscenic-scanpy-0.12.1-1.9.1.sif "/scratch/scripts/run2.sh" exit sleep 2 >>>>>>>>>>>>>>>> Next create a run script inside : /scratch/snumber/scripts cat run1.sh >>>>>>>>>>>> cd /scratch/scripts python /scratch/scripts/grnboost3.py >>>>>>>>>>>> make it executable: chmod +x d_run1.sh Now simply submit the pbs job: cd /scratch/snumber/scripts qsub pbs.01 After the run has completed, check through the output and error files and fix any errors. Voilà!