A question that often arises when approaching Condor from other batch systems is “How does Condor deal with resubmission of failed/preempted/killed jobs?”
The answer requires a slight shift in thinking.
Condor provides more functionality around the resubmission use case than most other schedulers. And the default policy is setup in such a way that most Condor folks don’t ever think about “resubmission.”
Condor will keep your job in the queue (condor_schedd managed) until the policy attached to the job says otherwise.
The default policy says a job will be run as many time as necessary for the job to terminate. So if the machine a job is running on crashes (generally, becomes unavailable), the condor_schedd will automatically try to run the job on another machine.
When you start changing the default policy you can control things such as: if a job should be removed after a period of time, even if it is running or only if it hasn’t started running; if a job should run multiple times even if it terminated cleanly; if a termination w/ an error should make the job run again, be held in the queue for inspection, be removed from the queue; if a job held for inspection should be held forever or a specific amount of time; if a job should only start running at a specific time in the future, or be run at repeated intervals.
The condor_submit manual page can provide specifics.
