====== Rad sa AEGIS04-KG klasterom======
**AEGIS04-KG** je klaster od 50 procesora i pripada //Centru za naučna istraživanja SANU i Univerziteta u Kragujevcu//. Njegova pristupna tačka je **cream-ce.csk.kg.ac.rs**, a protokol koji se koristi za pruistup je klasični **ssh**. Na pristupnom hostu su kreirani nalozi koje mogu da koriste **isljučivo studenti master studija Instituta za matematiku i informatiku** PMF-a u Kragujevcu. Sledi osnovni spisak komandi za manipulaciju klaster poslovima, kao i nekoliko primera.
{{:mapa_aegis_site.jpg?200|}}
{{:aegis08.jpg?200|}}
===== Introduction =====
Once one has generated a binary for the application they wish to run, the next step is creating a PBS script which is just a simple text file. The PBS script will be used to inform the scheduler of the resources required by the job. The scheduler in use at the AEGIS04-KG cluster is **PBS/Torque with Maui** from //Cluster Resources//. The resources manager is //Torque//, an open source solution from the same company. Torque informs Maui of the available resources in the cluster and Maui decides when to run the queued jobs.
===== PBS/Torque Commands =====
Anything preceded by a **#PBS** is a //PBS/Torque// command. Anything else with a **#** in front, not followed by **PBS** is ignored by the shell and Torque.
==== Mandatory Items ====
Below is a snippet from a valid PBS script. These items must be included in any PBS script.
#!/bin/bash
#PBS -l nodes=2:ppn=2
#PBS -l walltime=30:00
#PBS -q batch
* The first line specifies the shell interpreter I wish to use. In this instance, /bin/bash
* The second line specifies the number of processors I require. I need two nodes, and two processors per node, for a total of four processors.
* The third line specifies the maximum amount of time I believe my jobs will run. In this instance, 30 minutes. Syntax is :::
Walltime is important. Maui is smart enough to allow backfill and preemptive runs of jobs if possible. This greatly increases the efficiency of the cluster. For Maui to do it's job however, the walltime needs to as close to reality as possible.
* The fourth line specifies the queue I am submitting to, //batch//
==== Various Useful PBS Commands ====
#Redirect Output and Error Files
#PBS -o
For more documentation on other various PBS commands, see the references below.
===== Shell Commands =====
Once you have filled in all of the PBS commands, it's time to actually launch your job. You now have a basic shell script, so anything you typically do in the shell can be done. Below I have written ways to launch the majority of the types of jobs I can think of.
#Some Shell commands just to demonstrate
#Change to binaries/ in my home directory
cd ~/binaries
#Echo the date, the will be printed to the output file.
echo `date`
#Launch a job without communication on every requested processor.
mpirun -np -mca btl self
We use Open MPI implementations installed across the cluster. OpenMPI uses '''mpirun'''. For more information on the flags for the Open MPI mpirun, see the references below.
===== Submitting the Job =====
This is the easiest part of the whole process. If you have done every thing correctly up to this point you may simply **qsub** the job. For example, here is a working script and submission. The job runs and is numbered 1424.
user@panopticon ~/src $ cat go.sub
#!/bin/bash
#PBS -l nodes=2:ppn=2
#PBS -l walltime=30:00
#PBS -q batch
#PBS -o out
#PBS -e err
#PBS -A systemTest
echo $PBS_JOBID
date
mpirun /bin/hostname
user@panopticon ~/src $ qsub go.hostname
1424.cluster1.csk.kg.ac.rs
===== Errors =====
A list of errors and how to fix them. Some of the qsub error messages are kind of cryptic.
* You forgot to include the //#PBS -q //
user@panopticon ~/src $ qsub go.hostname
qsub: No default queue specified MSG=cannot locate queue
* You specified an invalid queue.
user@panopticon ~/src $ qsub go.sub
qsub: Unknown queue MSG=cannot locate queue
* You do not have access to the queue specified
user@panopticon ~/src $ qsub go.sub
1427.cluster1.csk.kg.ac.rs
user@panopticon ~/src $ checkjob 1427
job 1427
AName: go.sub
State: Idle
....
Holds: Batch:PolicyViolation
NOTE: job cannot run (job has hold in place)
NOTE: job hold active - Batch
* Torque is unable to lookup your account (email the NOC list to let us know)
user@panopticon ~/src $ qsub go.sub
qsub: Bad UID for job execution
The above is the list of common errors that I can recall at the moment. If you experience any other errors, let me know and I'll update the list.
===== References =====
- [[http://www.open-mpi.org/faq/ OpenMPI Documentation | See the tuning sections for mpiexec commands. (Replace occurences of mpirun with mpiexec)]]
- [[http://www.clusterresources.com/wiki/doku.php?id=torque:torque_wiki Torque Documentation | PBS commands are listed here]].