Rad sa AEGIS04-KG klasterom

AEGIS04-KG je klaster od 50 procesora i pripada Centru za naučna istraživanja SANU i Univerziteta u Kragujevcu. Njegova pristupna tačka je cream-ce.csk.kg.ac.rs, a protokol koji se koristi za pruistup je klasični ssh. Na pristupnom hostu su kreirani nalozi koje mogu da koriste isljučivo studenti master studija Instituta za matematiku i informatiku PMF-a u Kragujevcu. Sledi osnovni spisak komandi za manipulaciju klaster poslovima, kao i nekoliko primera.

Introduction

Once one has generated a binary for the application they wish to run, the next step is creating a PBS script which is just a simple text file. The PBS script will be used to inform the scheduler of the resources required by the job. The scheduler in use at the AEGIS04-KG cluster is PBS/Torque with Maui from Cluster Resources. The resources manager is Torque, an open source solution from the same company. Torque informs Maui of the available resources in the cluster and Maui decides when to run the queued jobs.

PBS/Torque Commands

Anything preceded by a #PBS is a PBS/Torque command. Anything else with a # in front, not followed by PBS is ignored by the shell and Torque.

Mandatory Items

Below is a snippet from a valid PBS script. These items must be included in any PBS script.

#!/bin/bash
#PBS -l nodes=2:ppn=2
#PBS -l walltime=30:00
#PBS -q batch
  • The first line specifies the shell interpreter I wish to use. In this instance, /bin/bash
  • The second line specifies the number of processors I require. I need two nodes, and two processors per node, for a total of four processors.
  • The third line specifies the maximum amount of time I believe my jobs will run. In this instance, 30 minutes. Syntax is <days>:<hours>:<minutes>:<seconds>

Walltime is important. Maui is smart enough to allow backfill and preemptive runs of jobs if possible. This greatly increases the efficiency of the cluster. For Maui to do it's job however, the walltime needs to as close to reality as possible.

  • The fourth line specifies the queue I am submitting to, batch

Various Useful PBS Commands

#Redirect Output and Error Files
#PBS -o <output file>
#PBS -e <error file>
 
#Define an environment variable
#PBS -v LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/user/lib"
#Import all current environment variables from submitting shell
#PBS -V
 
#Give the Job a Name
#PBS -N <some identifying string>

For more documentation on other various PBS commands, see the references below.

Shell Commands

Once you have filled in all of the PBS commands, it's time to actually launch your job. You now have a basic shell script, so anything you typically do in the shell can be done. Below I have written ways to launch the majority of the types of jobs I can think of.

#Some Shell commands just to demonstrate
 
#Change to binaries/ in my home directory
cd ~/binaries
 
#Echo the date, the will be printed to the output file.
echo `date`
 
#Launch a job without communication on every requested processor.
mpirun -np <number of processors> -mca btl self <binary>

We use Open MPI implementations installed across the cluster. OpenMPI uses 'mpirun'. For more information on the flags for the Open MPI mpirun, see the references below.

Submitting the Job

This is the easiest part of the whole process. If you have done every thing correctly up to this point you may simply qsub the job. For example, here is a working script and submission. The job runs and is numbered 1424.

user@panopticon ~/src $ cat go.sub
#!/bin/bash
#PBS -l nodes=2:ppn=2
#PBS -l walltime=30:00
#PBS -q batch
#PBS -o out
#PBS -e err
#PBS -A systemTest
 
echo $PBS_JOBID
date
mpirun /bin/hostname
 
user@panopticon ~/src $ qsub go.hostname 
1424.cluster1.csk.kg.ac.rs

Errors

A list of errors and how to fix them. Some of the qsub error messages are kind of cryptic.

  • You forgot to include the #PBS -q <queue name>
user@panopticon ~/src $ qsub go.hostname 
qsub: No default queue specified MSG=cannot locate queue
  • You specified an invalid queue.
 user@panopticon ~/src $ qsub go.sub
 qsub: Unknown queue MSG=cannot locate queue
  • You do not have access to the queue specified
user@panopticon ~/src $ qsub go.sub
1427.cluster1.csk.kg.ac.rs
user@panopticon ~/src $ checkjob 1427
job 1427
 
 AName: go.sub
 State: Idle 
 ....
 Holds:          Batch:PolicyViolation
 NOTE:  job cannot run  (job has hold in place)
 NOTE:  job hold active - Batch
  • Torque is unable to lookup your account (email the NOC list to let us know)
 user@panopticon ~/src $ qsub go.sub
 qsub: Bad UID for job execution

The above is the list of common errors that I can recall at the moment. If you experience any other errors, let me know and I'll update the list.

References

 
klaster-komande.txt · Last modified: 2010/11/30 14:24 by milos
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki