LEGION support: rc-support@ucl.ac.uk
MYSCRIPT
#!/bin/bash -l
#PBS -N NAME
#PBS -j oe
#PBS -l walltime=00:04:59
#PBS -l pmem=2048M
#PBS -l pvmem=8GB
#PBS -A ucl/BioinfCompBio/mol_evol
export OMP_NUM_THREADS=1
cd ~/Prank_analysis_23Oct
prank -d=ENSG00000000003 -fixedbranches=0.1 -t=Tree_ENSG00000000003 -o=Out_ENSG00000000003 -nopost -noxml -notree
cp Out_ENSG00000000003 MYRESULTS/Out_ENSG00000000003
* About walltime
http://www.ucl.ac.uk/research-computing/information/resource_allocation
* Add the job to the Legion
qsub MYSCRIPT
* Shell script
#!/bin/sh
cd myscripts
qsub script1; sleep 0.1;
qsub script2; sleep 0.1;
....
Login to node for running
slogin -X usertest11
QSTAT*
qstat | tr -s " " | awk 'BEGIN {count1=0; count2=0; count3=0; count4=0;}{count1++} $3 == "ucbtiju" {Found = 1;} NR > 2 && Found != 1 {count2++}$3 == "ucbtiju" {count3++} $3 == "ucbtiju" && $4 == "0" {count4++} END{print "\n No. of jobs in queue in total: " count1"\n No. above your 1st job in queue: " count2"\n No. of your jobs in queue: " count3"\n No. of your jobs running: " count3-count4 "\n ";}'date
Useful commands
Check the number of files
ls -laR | grep -c '^-'
qstat
To list *every* job in the queue
qstat
To list all of ucbtiju's jobs
qstat | grep ucbtiju
To do either of these with one page at a time which is more useful:
qstat | less
qstat | grep ucbtiju | less
You can check individual jobs using e.g.
checkjob 12467864.qm01
or for a more detailed description:
checkjob -v 12467864.qm01
This will tell you how long you are queueing and why the job hasn't
started yet etc...
Check the log file -why Legion does not perform my all 20,00 jobs?-
Each job you submit produces a log file. õI forget the exact format but
you can find them using:
find mydirectory -type f -name "*.o*"
They will be something like: NAME.o12313677
where 12313677 is the job number from 12313677.qm01 (that you see in
qstat), and the NAME is whatever you called the job in your script.
Anyway, if there are only 4000 of these output files then LEGION did miss
all your jobs, but if there are 20,000 of these files then that probably
means that it was you who made a mistake with a typing error somewhere in
your job e.g. missed a directory out of the path, or a space where it
shouldn't be etc...
Personally I gave up wasting my time looking at what went wrong unless it
happened a lot of times on the same jobs. õI think it is better to write a
little script to see which jobs did or did not run, then to re-submit the
jobs that were missed.
A quick check would be something like:
find mydirectory/myresults -type f -name "*.phy" õ| wc -l
this will tell you how many files of a certain type are in a folder. õThis
can be used to see if you are missing some output files when the jobs seem
to be finished.
The only problem with putting higher walltimes is that your jobs stay in
the queue much longer while they wait for other people. õIf possible it is
best to keep the walltime low and just rerun the jobs that were missed.
If you know that õa job takes 60 mins on tarsier then 90 mins for LEGION
is very safe and 24 hours is overkill!
I find it quicker to run with low walltime and re-run the missed jobs, but
one alternative is to request a large wall time like 24 hours and just run
25-50 prank jobs in each job script. õThis method also works quite well -
although I still prefer to use thousands of small jobs!
Kill job
If you want to kill a job in the queue:
qdel 1052353432.qm01
...you get the number 1052353432.qm01 from looking at qstat.
There are other variations, e.g. 'addding | wc -l'counts the number of
lines, so these two tell you how many jobs in the queue, and how many you
have in the queue:
qstat | wc -l
qstat | grep ucbpwaf | wc -l
Condition of use
http://www.ucl.ac.uk/research-computing/information/services/cluster/user_test