Create and submit jobs to the Condor cluster
What is this about?
Here you will find an introduction on how to create and submit jobs; one fully working example (probably Java) is provided. Much more details are provided in the official Condor Version 6.6.10 Manual. Also see the local commands section.
Or go to the overview page
Principles
Condor does only send your programs to lots of different machines, where they are executed and their results returned to your machine. So all distributed routines have to be implemented by you. However this is not so hard to achieve for most problems (as long as there is some "natural" way of segmenting it). Please note that there is a kind of currency for the usage of cluster resources to avoid the "Tragedy of the commons" problem. The currency is user priority and works as follows: The rate of machines assigned to a batch of jobs is proportional to the to that users priority rating relative to other users. While your jobs run and use computing power your priority becomes lower and lower (i.e. the value shown by condor_userprio becomes higher, approaching the number of machines you actually use. The half-life of priority is set to be one day, so after one day without using any resources your prio value will have halved). This assures in the longrun every user a fair share of the available resources. Please think about the jobs and test them before submitting!
There are two reasons for the existence of an ideal job size: First the above mentioned priority - if one user claims lots of machines (and thus will have low priority after a while) this might block the whole cluster while jobs from people with higher priority are waiting. This is why after an hour of tolerance jobs of a low priority user might be killed to make room for other jobs. Second, on non-dedicated machines your job might start in the evening when the owner leaves. If the job is still running upon return next morning and the user starts working, the job is terminated after a while. In most cases the killed jobs then have to run again from scratch - a waste of processing time. An ideal job should complete on average after one till at most four hours.
If your problem does not naturally fall into chunks of the right size you can still with a little more effort save the state of your job and have it transfered back to your machine as finished. With an additional script (possibly a regularly running cron job) you can than resubmit the task as a new job, only that it now should start from the previously saved state. Various Condor mailing list discuss this and alternative options of handling long jobs.
Non-independent jobs
Apart from using the Condor system only as a queuing tool for independent computing jobs, it is possible to realize more complex inter-dependencies between jobs.
If the order of jobs can be represented as a directed acyclic graph (DAG), condor_submit_dagman is the command of choice. See http://computing.ee.ethz.ch/sepp/condor-6.3.1-to.SEPP/examples/dagman/ for an example and the corresponding section of the manual.
Also Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) are supported to facilitate parallel jobs for C/C++ programs, but have not been used here at UH yet.
Java example
The Java program will be very simple, simply outputting the job number and the machine it ran on, e.g. "job 0 ran on machine: jknabe" for the command "java test 0". Here is the code:
public class test { public static void main(String[] args){ System.out.print("job "+args[0]+" ran on machine: "); try{ java.net.InetAddress localMachine = java.net.InetAddress.getLocalHost(); System.out.println (localMachine.getHostName());} catch(java.net.UnknownHostException uhe){ System.out.println ("ERROR");} }//end main method }//end test class
Now save the code as "test.java", compile it and put it into a jar archive. For Linux this script will do, for Windows do:
javac *.java echo Main-Class: test > Manifest.txt jar cmf Manifest.txt test.jar *.classCondor wants jobs described in a special file, the example one should look like this:universe = Java #special environement for Java executable = test.jar #main jar file jar_files = test.jar #one might use other jar files should_transfer_files = YES #we want result files back when_to_transfer_output = ON_EXIT #only when finished output = test.out.part$(Process) #standard output (what you usually see in command line thingy) to this file error = test.err.part$(Process) #error (hopefully not needed...) log = condor.log notification = Never #should work without bothering us image_size = 60000 #memory size Rank = JavaMFlops # NiceUser = True Hold = False arguments = test $(Process) #start program like this, i.e. queue 30 #for 30 jobs you would run "test 0", "test 1",..., "test 29"Put this into "run.condor" and submit with the following:
condor_submit run.condorYou might want additionally to run:
condor_reschedule condor_qNow be patient (or check every 10 seconds "condor_q -run" :-), when all are finished have a look at the output files created by Condor: There you go!
Make sure that your code is compatible with the Java versions running on the machines in the cluster; to find out which versions are used type: condor_status -java
Another problem you might come accross is that for some reason there seems to be a bug in Condor that makes it think java jobs are bigger than they actually are and then they might get stuck. Periodically running this command will get them going again: condor_qedit $USER ImageSize 60.