Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

No Format
You could try a trace.
cmake --trace . 2>&1 | tee /tmp/cmakeOut.txt

You may have to explicitly mention the path, For example:
cmake . -DLAPACK_LIBRARIES=/sw/library/lapack/lapack-3.6.0/3.6.0/lib64/liblapack.so -DBLAS_LIBRARIES=/sw/library/blas/CBLAS/lib/cblas_LINUX.so
 
OR:
 cmake . -DLAPACK_LIBRARIES=/sw/library/lapack/lapack-3.6.0/3.6.0/lib64/liblapack.so -DBLAS_LIBRARIES=/sw/library/blas/CBLAS/lib/cblas_LINUX.so -DCMAKE_INSTALL_PREFIX=/sw/simbody/353


Qs28:  How do I check number of CPUs my job is using

No Format
You can find out on which compute node your job is running:
qstat -1an|grep snumber
>>>>>>>>>>>>>>>>>>>
e.g:
qstat -1an|grep s2761086
4598354.pbsserv s2761086 workq    DT_k-e_04   21795   1   1   10gb 99999 R 523:2 n010/0
Here you see it is running on n010
Then you can do this:
ssh nodename -t "htop"  or ssh nodename -t "htop -u username" 
e.g.ssh n010 -t "htop -u s2761086"
Press <F2> key, go to "Columns", and add PROCESSOR under "Available Columns".
The currently used CPU ID of each process will appear under "CPU" column.



Qs29:   How to customize an environmental variable using modules

...

Qs 31: Multi Cores are requested and allocated by PBs but job runs only on 1 core. Why is that?

This contribution is from Nicholas Dhal and is acknowledged. Nick is an active Grifith HPC user.


>>>>>>>>>

When a job is executed, the processes running the job are divided up amongst the requested number of cores on the HPC compute node. The processes have an associated setting called 'affinity' which is a logical (true or false) test that specifies which CPU core the processes are allowed to run on. In serial cases, this usually makes no difference because everything is on a single CPU only. However, in parallel runs with multiple cores, sometimes the affinity is set incorrectly by the scheduler and this limits performance.

...

The command 'htop' uses some CPU resources on the node, so limit your time logged into the node (no more than 10 minutes). To exit 'htop' use the key 'F10'. To exit the node use the command 'exit'.

goo.gl/koYII3


>>>>>>>>>>

Qs 32: How to check the remaining licenses on the license server

...

No Format
#!/bin/bash
#PBS -N jobName
####PBS -m abe
####PBS -M YourEmail@griffith.edu.au
#PBS -q routeq
#PBS -l select=1:ncpus=1:mem=2gb,walltime=5:00:00
#======================================================# 
# USER CONFIG
#======================================================#
INPUT_FILE="hello.py"
OUTPUT_FILE="$PBS_JOBNAME.out"
MODULE_NAME="python/3.7.4"
PROGRAM_NAME="python"
# Set as true if you need those /lscratch files.
COPY_SCRATCH_BACK=true
#======================================================#
# MODULE is loaded 
#======================================================#
NP=‘wc -l < $PBS_NODEFILE‘
source /etc/profile.d/modules.sh
module load $MODULE_NAME
cat $PBS_NODEFILE
#======================================================#
# SCRATCH directory is created at the local disks 
#======================================================#
SCRDIR=/lscratch/$LOGNAME/$PBS_JOBID
if [ ! -d "$SCRDIR" ]; then
mkdir $SCRDIR
fi
#======================================================#
# TRANSFER input files to the scratch directory 
#======================================================#
# just copy input file
cp -r $PBS_O_WORKDIR/$INPUT_FILE $SCRDIR
# copy everything (Option) 
#cp -r $PBS_O_WORKDIR/* $SCRDIR
#======================================================#
# PROGRAM is executed with the output or log file
# direct to the working directory
#======================================================#
echo "START TO RUN WORK"
cd $SCRDIR
# Run a system wide sequential program
##$PROGRAM_NAME < $INPUT_FILE >& $PBS_O_WORKDIR/$OUTPUT_FILE
$PROGRAM_NAME $INPUT_FILE >& $SCRDIR/$OUTPUT_FILE
###$PROGRAM_NAME $INPUT_FILE >& $PBS_O_WORKDIR/$OUTPUT_FILE
# Run a MPI program (Option)
###For openmpi, use the following syntax####
#module load mpi/openmpi/4.0.2
#mpiexec $PROGRAM NAME < $INPUT FILE >& $OUTPUT FILE
####For intel mpi, use the following syntax####
#module load intel/2019up5/mpi
#mpiexec -n $NP  $PROGRAM NAME < $INPUT FILE >& $OUTPUT FILE
# mpirun -np $NP $PROGRAM NAME < $INPUT FILE >& $OUTPUT FILE
# Run a OpenMP program(Option) # export OMP NUM THREADS=$NP
# $PROGRAM NAME < $INPUT FILE >& $OUTPUT FILE
sleep 60
#======================================================# 
# RESULTS are migrated back to the working directory
#======================================================#
if [[ "$COPY_SCRATCH_BACK" == *true* ]]
then
    echo "COPYING SCRACH FILES TO " $PBS_O_WORKDIR/$PBS_JOBID 
    cp -rp $SCRDIR/* $PBS_O_WORKDIR
    if [ $? != 0 ]; then
        {
             echo "Sync ERROR: problem copying files from $tdir to $PBS_O_WORKDIR;" 
                     echo "Contact HPC admin for a solution."
             exit 1
        }
    fi
fi
#======================================================#
# DELETING the local scratch directory 
#======================================================#
cd $PBS_O_WORKDIR
if [[ "$SCRDIR" == *scratch* ]]
then
    echo "DELETING SCRATCH DIRECTORY" $SCRDIR
    rm -rf $SCRDIR
    echo "ALL DONE!"
fi
#======================================================#
# ALL DONE 
#======================================================#
##  End-of-job summary
echo "qstat -H $PBS_JOBID"
echo "qstat -xf $PBS_JOBID"


Qs.56: NCMAS process and application

NCMAS facilities overview and who should apply
https://youtu.be/7ZZVk4HtdDY

NCMAS process and application 2021
https://youtu.be/hmV_j5GFgI0


Qs 57: What kind of storage and compute is available on Griffith HPC

...

Additionally and in parallel, you can also apply directly when the application opens

https://my.nci.org.au/mancini/ncmas/2022/

Please note that projects will be given a fixed allocation which is given per quarter on a use it or loose it basis. Allocations cannot be carried forward or backward into other quarters. Standard disk space per project is 75GB in /scratch and if a project needs more you will need to contact help@nci.org.au.

Students cannot be a lead CI on an NCI project however, for the QCIF share postdocs can be. For NCMAS the lead CI is required to have an ARC or NHMRC grant or equivalent which is why larger groups apply for NCMAS. A grant is not required for a project under QCIF. However, the QCIF allocations are small, around 20-50 thousand per quarter. Larger allocations are only available through NCMAS.

Some applications like Mathematica and Matlab are licensed software. Mathematica is only available to ANU researchers on NCI. For Matlab, Griffith will need to get in contact with NCI to set up their institutional license. At the moment this is not available so one cannot use it. Unless you have your own license. But also in that case you would need to get in touch with NCI first to see if you can use Matlab on Gadi or not.

In general, allocations are given in service units SUs. 1 core hour is charged at 2 SUs. So if you have a calculation running using 4 cores and taking 48 hours then you will be charged 4*48*2=384 SUs for that calculation.

If a larger disk space (e.g 300GB) is needed, you would need to talk to NCI to increase the space in /scratch to accommodate this. If a larger RAM (e.g 400GB ), then you would need to make sure you run in a queue that supports that ram request. They could be charged more than 2 SUs per core hour though, so you would need to factor that in.

But talk to NCI, help@nci.org.au, first to see if you could use the application (e.g Matlab) onNCI before you even consider applying for an allocation on NCI.


Qs69: How to install bioinformatics software in your home directory

...

Reference: https://kb.hlrs.de/platforms/index.php/Batch_System_PBSPro_(vulcan)#DISPLAY:_X11_applications_on_interactive_batch_jobs

The login or head node of each cluster is a resource that is shared by many users. Running a GUI job on the login node is prohibited and may adversely affect other users. X11 Forwarding is only possible for interactive jobs.

Please note that there is a performance penalty when running a GUI job on the compute nodes using the method outlined below. 

Set up X11 forwarding

To use X11 port forwarding, Install Xming X Server on Windows laptop/desktop first. Install the xming fonts package as well.
See instructions here: https://griffith.atlassian.net/wiki/spaces/GHCD/pages/4035477/xming

...

  1. http://mobaxterm.mobatek.net/download-home-edition.html
  2. putty
  3. Filezillia
  4. Windows WSL system lets you run the linux versions of ssh under windows.
       wsl --install
    This should get you command line:  ssh,  scp,  and sftp; 

...

No Format
n061 nvidia driver is 530.30.02 and cuda toolkit 12.1. As this is the latest (as of March 2023), pytorch was not compatible. The nvidia drivers have been downgraded to 520.61.05 and cuda toolkit 11.8. Even after the downgrade, torch still could not detect a cuda device.The workaround was to manually compile it 

Here is the installation notes:
>>>>>>
Edit ~/.condarc (vi ~/.condarc)
>>>>>>
channels:
  - defaults

envs_dirs:
  - /lscratch/s12345/.conda/envs
  - /export/home/s12345/.conda/envs

>>>>>>>
source /usr/local/bin/s3proxy.sh
module load anaconda3/2022.10
module load gcc/11.2.0
module load cmake/3.26.4
module load cuda/11.4
#If you have an existing environment, you can use it. If not create it with: conda create -n myTorch
source activate modelT5b
conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools  cffi typing_extensions future six requests dataclasses #cmake
conda install -c pytorch magma-cuda118

mkdir  /tmp/bela  #Any name is fine. Here I named it bela. It is temp dir
cd /tmp/bela
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
git checkout v1.13.1 # Or any version you want
git submodule sync
git submodule update --init --recursive
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py install 2>&1 | tee pythonSetupLogs.txt
>>>>>>>

Now you can test your installation:
source activate myenv
python
Python 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>>

Another test:
python isCuda.py 

CUDA AVILABLE

>>>>cat isCuda.py<<<<<
import torch
if torch.cuda.is_available():
    print('CUDA AVILABLE')
else:
    print('NO CUDA')
>>>>>>>>>>>>>>>>>>
 
To install torchvision from source:
conda install -c conda-forge libjpeg-turbo
git clone https://github.com/uploadcare/pillow-simd
cd pillow-simd
python setup.py install 2>&1 | tee pythonInstallPillowSimd.txt
git clone https://github.com/pytorch/vision.git
cd vision
git checkout v0.14.1
python setup.py install 2>&1  | tee pythonInstalltorchvision.txt
 For tensorflow:
conda install -c conda-forge tensorflow=2.12.0=gpu_py310hfda07e1_0
>>>>>>>>>
Another way:
source /usr/local/bin/s3proxy.sh
module load anaconda3/2023.09
conda create -n pytorchGPUmytorchA100 -c pytorch -c nvidia
source activate pytorchGPUmytorchA100
conda search -c pytorch pytorch
conda installinstall pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c pytorch pytorch=2.1.0=py3.11_cuda11.8_cudnn8.7.0_0nvidia




Qs84: How do I use the tensorflow singularity/docker containers on the A100 gpu node

...

No Format
You can temporarily use the local scratch to install application and run data out of.
#Please note /lscratch is temporary storage and will disappear next time the server is imaged. So always have a backup.

To install from /lscratch, you can do the following:
mkdir -p /lscratch/s12345/.conda/envs
mkdir -p /lscratch/s12345/.conda/pkgs

Edit ~/.condarc (vi ~/.condarc)
>>>>>>
channels:
  - defaults

envs_dirs:
  - /lscratch/s12345/.conda/envs
  - /export/home/s12345/.conda/envs
>>>>>>

module use /lscratch/sw/Modules  #put it in your ~/.bashrc if you are going to use this often
module load anaconda3/2023.03
source /usr/local/bin/s3proxy.sh
conda create -n A100local   #any name will do
source activate A100local
Now you can install the packages with
conda search -c pytorch pytorch #example of a package 
conda search -c pytorch-nightly  pytorch #example of a package from development site
conda search tensorflow #example of a package 
conda install <package> 
e.g: conda install -c pytorch pytorch=2.0.1=py3.11_cuda11.8_cudnn8.7.0_0
e.g: conda install  -c pytorch-nightly tensorflow=2.12.0=gpu_py310hfda07e1_0
e.g. conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia


Qs86: Useful external video tutorials

...