0% found this document useful (0 votes)

69 views

Lecture11 PDF

Uploaded by

BARUTI JUMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views

Lecture11 PDF

Uploaded by

BARUTI JUMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Job Scheduling and Management

Colin Perkins
http://csperkins.org/teaching/2004-2005/gc5/
Lecture Outline

• Job Scheduling and Management Concepts

– Resource Discovery
– Job Allocation
– Job and Data Distribution
– Job Scheduling
– Middleware
• Implementations
– Condor, OpenPBS, Sun Grid Engine, Xgrid
– GRAM, Condor-G
• Tutorial
Copyright © 2004 University of Glasgow
Job Scheduling and Management in a Grid

• Grid computing applications follow two popular patterns:

– Distributed Computing
• Exploring a large parameter space, repeating a task across a data set
– Manageable amounts of data, enormous need for computational cycles
– Master-worker model
– Embarrassingly parallel applications
• SETI@home, particle physics, bioinformatics, movie rendering
– Remote Resource Access
• Coordinating execution of small number of tasks, running on distributed
resources managed by diverse organizations
• Remote instrument access (scanning electron microscopes, sensors, etc.),
combining large database queries
• Need grid-aware scheduling algorithms to coordinate execution of
Copyright © 2004 University of Glasgow

jobs across multiple sites

The Job Scheduling Problem

• Job scheduling and management in a Grid is a wide- Diverse applications

area scheduling and resource management problem
– Discover remote resources
– Allocate jobs to resources Wide-area schedulers
– Distribute the code and data and resource brokers
– Schedule jobs to run
– Collect and collate results
• Desirable to hide details of the local infrastructure at
Middleware
each site from the wide-area scheduler
– Decouple implementations
• Use an hourglass model with middleware to isolate
the wide-area scheduler from the local schedulers Local schedulers
Copyright © 2004 University of Glasgow

Hardware, software
Local Scheduling

• How to schedule multiple jobs to run in Job(s) ready to execute

parallel across a set of hosts?
– Where to run jobs: Select eligible resources
• Need for specific resources
• System load Make scheduling decision
• Batch vs. interactive scheduling
• Real time vs. non-real time response
Distribute code and data
– Co-allocating related jobs
– How to distribute jobs Recover from failures Execute jobs
– Failure tolerance
• Occurs within an organization: Retrieve results
– Can control the system environment
Copyright © 2004 University of Glasgow

– Can make informed decisions

• Full knowledge of the system
– But often loosely coupled systems; not full control
Resource Selection

• Resources (hosts) vary in capacity, configuration, policy

• Jobs have difference requirements
• When scheduling a job, need global knowledge to make an
informed choice
⇒ Resource broker architecture

Broker

Resource
Copyright © 2004 University of Glasgow
Resource Selection

• Resources advertise status and configuration to broker

– Is broker centralised or distributed?
– Do hosts push status information to the broker? Or does it poll periodically?
– How to scale to many resources?
– How to specify resource status? Simple statistics vs. complex specification
languages?

Broker
1

Resource
1 Resources advertise status
Copyright © 2004 University of Glasgow
Resource Selection

• Job initiators solicit resources from broker

– How to contact the broker? What protocol?
– How to choose between multiple offers from distributed brokers?
– How to specify resource requirements? Domain specific language?
• Fixed (complete?) schema and matching rules
• Variable attributes and matching function
– How to specify dependencies on other jobs?
– Are advance reservations possible?
Broker
1
2
Resource
1 Resources advertise status Job
Copyright © 2004 University of Glasgow

2 Solicit resources for job

The Scheduling Decision

• Broker matches jobs to resources

– Based on most recent knowledge of resource status
• Set of eligible resources constrained by job requirements
– Hardware, software, configuration, policy
– Timing constraints for real-time jobs

Broker
3 1
2
Resource
1 Resources advertise status Job
Copyright © 2004 University of Glasgow

2 Solicit resources for job

3 Broker matches job with eligible resource
The Scheduling Decision

• Choice of resource from those eligible is driven by policy goals

– Minimize load on a particular resource
– Minimize execution time
– Minimize communication
– Balance load across resources
– Prefer to use certain resources
– Co-allocation of dependent jobs
– Etc.
Broker
3 1
2
Resource
1 Resources advertise status Job
Copyright © 2004 University of Glasgow

2 Solicit resources for job

3 Broker matches job with eligible resource
Distributing Jobs

• Jobs need a predictable execution environment

– Hardware differences: virtual machine languages like Java, C#, etc.
– Surprisingly hard to make a binary that’ll execute across different versions
of an operating system; need standard environment
– How to ensure code and data integrity and confidentiality?
• Don’t necessarily trust the host on which a job is running…
• Trusted computing (a.k.a. Palladium) very helpful here!

Broker
3 1
2
Resource
1 Resources advertise status Job
Copyright © 2004 University of Glasgow

4
2 Solicit resources for job
3 Broker matches job with eligible resource

4 Job assigned to resource

Distributing Jobs and Data

• Jobs need a predictable execution environment

• Jobs need to access data and store results

– Cannot assume a shared file system
– Does the scheduler automatically distribute input files and collect results?
Is this done manually as part of the job?
Copyright © 2004 University of Glasgow

– Does the job have access to a file system? All or part of it?
• How does constraining access to the file system affect ability to load shared
libraries? (e.g. the jail facility on FreeBSD)
Executing Jobs

• Hosts employ a range of scheduling algorithms

– Batch, interactive, real-time
• Hosts need resource reservation, for co-allocation
• Hosts may impose policy considerations
– E.g. only execute jobs when otherwise idle
• Hosts may require authentication, authorisation and accounting

• Execution environment may not trust the job

– Signed code
– Sandboxes and virtual machines
• Complete virtual machine architectures (e.g. Java)
Copyright © 2004 University of Glasgow

• Partial virtualisation (e.g. virtual system calls in Condor)

Robustness and Fault Tolerance

• Jobs may fail due to programmer error or environmental issues

• Need to ensure consistent results in presence of failure
– Graceful failure of a system of dependent jobs
– Checkpoint and retry of jobs
• May require modification of applications…
– Atomic operations
– Two-phase commit protocols to ensure agreement
1. The commit manager assembles votes on result: commit or abort
Hosts that vote commit guarantee they can complete action at later date
2. The commit manager decides on basis of vote, and propagates commit or
abort to other nodes
– Exception handling
Copyright © 2004 University of Glasgow

• Critical to understand failure and recovery modes

Policy Issues

• User policy issues:

– When to allow use of your workstation?
– How much to trust remote jobs?
– How much state of your host is visible to remote jobs?
– How much control do you have over jobs running on your host?

• Site policy issues:

– When to allow use of resources?
– How much to trust remote jobs?
– Are users allowed to see jobs executing on their host?
– How much of site state to expose to the outside world?
Copyright © 2004 University of Glasgow

Axioms: distrust and privacy…

Local Scheduling Examples

• Condor
• OpenPBS
• Sun Grid Engine
• Xgrid
Copyright © 2004 University of Glasgow
Local Scheduling: Condor

• http://www.cs.wisc.edu/condor/
• Network batch queuing system for clusters and cycle-stealing
– Jobs have dispatch priority; facilities for ordered jobs and master-worker
operation
– Scheduling performed using standard workstation scheduler
• Automatically distributes code, data and retrieves results
• Uses ClassAds to match jobs to resources
– Schema-free requirements specification language
• Robustness via checkpoint and recovery
– Requires re-linking against a modified libc
– Generally very robust in practice
Copyright © 2004 University of Glasgow

• Limited security and sandboxing; trusted environment

Local Scheduling: Other Systems

• OpenPBS http://www.openpbs.org/
• Sun Grid Engine http://gridengine.sunsource.net/
• Xgrid http://www.apple.com/acg/xgrid/

More limited systems… generally perform job scheduling; leave

job management, data distribution, fault tolerance to the user

• No real facilities for real-time jobs, resource reservation

• Limited robustness and distributed coordination facilities
Grid Scheduling

• Planning execution of jobs on particular sites Diverse applications

– Policy choice based on advertised capabilities of
sites, contracts
Wide-area schedulers
– Not concerned with the management of resources and resource brokers
within a particular site
– An issue of planning vs. scheduling

• Sites wish to retain their autonomy and hide

Hardware, software
Grid Scheduling

• Similar broker-based model to local scheduling

• But operates at a more abstract level
– Sites advertise capabilities and policies
– Brokers assign jobs to sites based on those capabilities and policies
– Brokers and job submitters plan execution, perhaps use contracts to
reserve resources
– Driven by economics and pricing models

Planner Capabilities
and policies

• Want to support: Diverse applications

– Site autonomy
– Heterogeneous substrates Wide-area schedulers
– Varied and extensible site policies and resource brokers
– Co-allocation of resources across sites
– Online control of jobs
… with a uniform interface that decouples grid
Middleware
applications from implementation details at
a particular site
Local schedulers
Copyright © 2004 University of Glasgow

⇒ Middleware to hide the differences Hardware, software

and expose common functionality
Middleware: GRAM

“The Grid Resource Allocation and Management (GRAM)

service provides a single interface for requesting and using
remote system resources for the execution of jobs. The most
common use of GRAM is remote job submission and control. It is
designed to provide a uniform, flexible interface to job scheduling
systems.” [http://www-unix.globus.org/toolkit/docs/3.2/gram/]

• Abstracts local scheduling functions as middleware, to allow

uniform wide-area scheduling
– Exposes a limited subset of local scheduling functionality, primarily job
submission and management
– Resource brokers are not part of GRAM; must be built above it
Copyright © 2004 University of Glasgow
Middleware: Condor-G

• Condor-G is a version of Condor that uses GRAM to assign jobs

to run queues
• Same interface as Condor, but schedules jobs using GRAM rather
than native Condor protocols
– Use Globus for authentication, remote program execution and data transfer
• Somewhat reduced functionality
– No matchmaking
– No automatic file transfer, staging
– No check pointing for fault tolerance
Copyright © 2004 University of Glasgow
Grid Scheduling: Fundamental Issues

• Effects of communication latency

– Embarrassingly parallel jobs in the master-worker model are effective
• Rendering
• Exhaustive search of parameter space
• Analysis of large data space
– Communication and coordination are very expensive in widely distributed
systems; significant performance bottleneck for some applications
• Co-scheduling jobs across organizational boundaries
– No central authority, so becomes a distributed coordination and resource
reservation problem
– Very difficult to guarantee if resources are scarce
• Effects of failure
Copyright © 2004 University of Glasgow

– Likelihood of increases with scale

– Understand failure and recovery modes
Tutorial and References

• Tutorial on Thursday: discussion of job scheduling papers

– D. Thain, T. Tannenbaum & M. Livny, “Distributed Computing in
Practice: The Condor Experience”, to appear in Concurrency and
Computation: Practice and Experience, 2004.
– K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith
and S. Tuecke, “A Resource Management Architecture for Metacomputing
Systems”, Proceedings of the IPPS/SPDP '98 Workshop on Job Scheduling
Strategies for Parallel Processing, Orlando, FL, USA, March 1998.
– A. Andrieux, D. Berry, J. Garibaldi, S. Jarvis, J. MacLaren, D. Ouelhadj
and D. Snelling, “Open Issues in Grid Scheduling”, Proceedings of the
workshop held at the e-Science Institute, Edinburgh, October 2003.
Copyright © 2004 University of Glasgow

X99-RS9 User's Manual
100% (1)
X99-RS9 User's Manual
15 pages
Linux Clusters Institute: Scheduling
No ratings yet
Linux Clusters Institute: Scheduling
93 pages
Ijctt V3i4p103
No ratings yet
Ijctt V3i4p103
6 pages
Cloudscheduling Backfills
No ratings yet
Cloudscheduling Backfills
19 pages
Scheduling in Distributed Systems
No ratings yet
Scheduling in Distributed Systems
9 pages
Chapter 6
No ratings yet
Chapter 6
46 pages
CHAPTER 4 MODIFIED 2023_ebc12e975a0373e63b48b37ad01f0660
No ratings yet
CHAPTER 4 MODIFIED 2023_ebc12e975a0373e63b48b37ad01f0660
26 pages
Processor Management
No ratings yet
Processor Management
51 pages
Session Title: Bob Johnston / IPS Grid & HA Solutions
No ratings yet
Session Title: Bob Johnston / IPS Grid & HA Solutions
31 pages
working principle of grid scheduling
No ratings yet
working principle of grid scheduling
3 pages
Linux Cpu Scheduler
100% (1)
Linux Cpu Scheduler
38 pages
Processor Management
No ratings yet
Processor Management
49 pages
TorqueAdminGuide 2.5.12
No ratings yet
TorqueAdminGuide 2.5.12
282 pages
The Xtreemos Jscheduler: Using Self-Scheduling Techniques in Large Computing Architectures
No ratings yet
The Xtreemos Jscheduler: Using Self-Scheduling Techniques in Large Computing Architectures
10 pages
CPU SCHEDULING
No ratings yet
CPU SCHEDULING
7 pages
Job_scheduling_group2-1
No ratings yet
Job_scheduling_group2-1
20 pages
Lecture 5 Scheduling Algorithms
No ratings yet
Lecture 5 Scheduling Algorithms
27 pages
Metasch Differ From Resouce Broker
No ratings yet
Metasch Differ From Resouce Broker
17 pages
Job Scheduling Algorithms in Cloud Computing: (With Special Importance On Round-Robin Algorithm)
No ratings yet
Job Scheduling Algorithms in Cloud Computing: (With Special Importance On Round-Robin Algorithm)
29 pages
06 Scheduling
No ratings yet
06 Scheduling
53 pages
Grid Engine Users's Guide: Univa Corporation
No ratings yet
Grid Engine Users's Guide: Univa Corporation
113 pages
Qadiyani Shubhat Ke Jawabat - 1 by SHEIKH ALLAH WASAYA
No ratings yet
Qadiyani Shubhat Ke Jawabat - 1 by SHEIKH ALLAH WASAYA
208 pages
Distributed CS571
No ratings yet
Distributed CS571
36 pages
OS Notes - BK
No ratings yet
OS Notes - BK
54 pages
Lec 24
No ratings yet
Lec 24
27 pages
Slide 02
No ratings yet
Slide 02
101 pages
Scheduling Algorithms: (Or The Process of Choosing The Best Job To Run)
No ratings yet
Scheduling Algorithms: (Or The Process of Choosing The Best Job To Run)
36 pages
Deadlock Avoidance
No ratings yet
Deadlock Avoidance
19 pages
05 Scheduling
No ratings yet
05 Scheduling
46 pages
Cobra Design v1
No ratings yet
Cobra Design v1
42 pages
Badger - Design v1
No ratings yet
Badger - Design v1
42 pages
Chapter 16a-OS
No ratings yet
Chapter 16a-OS
22 pages
Distributed Operating System
No ratings yet
Distributed Operating System
69 pages
A To Z PPT Notes
No ratings yet
A To Z PPT Notes
62 pages
On Distributed Os
100% (1)
On Distributed Os
131 pages
(Ebook) Operating systems and middleware: supporting controlled interaction by Max Hailperin ISBN 9780534423698, 0534423698 pdf download
100% (1)
(Ebook) Operating systems and middleware: supporting controlled interaction by Max Hailperin ISBN 9780534423698, 0534423698 pdf download
56 pages
Operating Systems - Print for Students
No ratings yet
Operating Systems - Print for Students
89 pages
Grid Computing
No ratings yet
Grid Computing
10 pages
Assignment_DCA1201_Set 1 & 2_ Ans
No ratings yet
Assignment_DCA1201_Set 1 & 2_ Ans
16 pages
NESDIS Scheduler Presentation Welkin LBL V0.4
No ratings yet
NESDIS Scheduler Presentation Welkin LBL V0.4
24 pages
OPERATING SYSTEM
No ratings yet
OPERATING SYSTEM
8 pages
Unit 1 System Models and Issues - MP
No ratings yet
Unit 1 System Models and Issues - MP
71 pages
Naresh Chauhan OS Book Index
No ratings yet
Naresh Chauhan OS Book Index
8 pages
Free BSD Operating Sytem
No ratings yet
Free BSD Operating Sytem
23 pages
What Is Operating System
100% (1)
What Is Operating System
2 pages
1 - UNIT - I - Operating Ssystem Overview and Process
No ratings yet
1 - UNIT - I - Operating Ssystem Overview and Process
68 pages
Operating Systems Scheduling
No ratings yet
Operating Systems Scheduling
23 pages
UNIT 6 Hardware & Software Concepts PDF
No ratings yet
UNIT 6 Hardware & Software Concepts PDF
9 pages
Introduction To Basic OS Concepts
No ratings yet
Introduction To Basic OS Concepts
39 pages
Pxc3876861jobsched in Greid
No ratings yet
Pxc3876861jobsched in Greid
7 pages
Borg the Next Generation
No ratings yet
Borg the Next Generation
14 pages
Schwarzkopf - Omega Algorithm
No ratings yet
Schwarzkopf - Omega Algorithm
14 pages
CM Os 5sem Chapter4
No ratings yet
CM Os 5sem Chapter4
92 pages
TORQUE Administrator's Guide
No ratings yet
TORQUE Administrator's Guide
205 pages
Linux Kernel Understanding Que Udet
No ratings yet
Linux Kernel Understanding Que Udet
26 pages
Osln PDF
No ratings yet
Osln PDF
183 pages
Parallel Computing - Unit IV
No ratings yet
Parallel Computing - Unit IV
40 pages
Mastering Terraform A Comprehensive Guide to Infrastructure As Code
From Everand
Mastering Terraform A Comprehensive Guide to Infrastructure As Code
Mario Marinov
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Essays on Infrastructure-as-code
From Everand
Essays on Infrastructure-as-code
Ravi Rajamani
No ratings yet
Mastering the Art of Nix Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Nix Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Grid Security Concepts: Richard Sinnott
No ratings yet
Grid Security Concepts: Richard Sinnott
36 pages
Picus
No ratings yet
Picus
175 pages
Udom Workshop Word Excel
No ratings yet
Udom Workshop Word Excel
3 pages
Grid Security in Practice: John Watt
No ratings yet
Grid Security in Practice: John Watt
42 pages
Virtual Organisations: Richard Sinnott
No ratings yet
Virtual Organisations: Richard Sinnott
22 pages
Technologies For Building Grids: 15 October 2004
No ratings yet
Technologies For Building Grids: 15 October 2004
48 pages
Lecture04 PDF
No ratings yet
Lecture04 PDF
34 pages
Open Standards and Architectures: Richard Sinnott
No ratings yet
Open Standards and Architectures: Richard Sinnott
34 pages
Web Services: Richard Sinnott
No ratings yet
Web Services: Richard Sinnott
44 pages
Resource Discovery and Information Services: John Watt
No ratings yet
Resource Discovery and Information Services: John Watt
36 pages
Scalability and Heterogeneity: Colin Perkins
No ratings yet
Scalability and Heterogeneity: Colin Perkins
25 pages
Debug 1214
No ratings yet
Debug 1214
3 pages
Cloud Computing
No ratings yet
Cloud Computing
15 pages
Planmeca Romexis: Technical Manual
No ratings yet
Planmeca Romexis: Technical Manual
217 pages
GENESIS64 - Reserved Licenses
No ratings yet
GENESIS64 - Reserved Licenses
5 pages
BCSL 022 PDF
No ratings yet
BCSL 022 PDF
20 pages
Xilinx-13.2 ISE Tutorial
No ratings yet
Xilinx-13.2 ISE Tutorial
33 pages
USB System Architecture
100% (1)
USB System Architecture
14 pages
FanucPlugin 1
No ratings yet
FanucPlugin 1
15 pages
Software Rip
No ratings yet
Software Rip
1,320 pages
Microprocessor Report 1
No ratings yet
Microprocessor Report 1
11 pages
In01 A
No ratings yet
In01 A
2 pages
Addressing: Four Levels of Addresses Are Used in An Internet Employing The TCP/IP Protocols:, ,, and
0% (1)
Addressing: Four Levels of Addresses Are Used in An Internet Employing The TCP/IP Protocols:, ,, and
11 pages
Lab - 3 Configuring New Domain Tree in Existing Forest
No ratings yet
Lab - 3 Configuring New Domain Tree in Existing Forest
24 pages
Magna-Mike 8600 SW Installation 8600 Fa and Cal Program: Rel/Eco No
No ratings yet
Magna-Mike 8600 SW Installation 8600 Fa and Cal Program: Rel/Eco No
9 pages
IBM Storage Scale and Storage Scale Server Level 2 Quiz_ Attempt review
No ratings yet
IBM Storage Scale and Storage Scale Server Level 2 Quiz_ Attempt review
13 pages
Unit 1
No ratings yet
Unit 1
57 pages
Lecture11 Cda3101
No ratings yet
Lecture11 Cda3101
73 pages
Zoning in Cisco SAN Switch For Beginners - SAN Enthusiast
No ratings yet
Zoning in Cisco SAN Switch For Beginners - SAN Enthusiast
6 pages
IEC60870 5 101 - 104 User Manual
No ratings yet
IEC60870 5 101 - 104 User Manual
70 pages
How To Enable NIC Teaming in Windows 10 Using PowerShell - MCSAGURU - Learn & Troubleshoot Windows, Linux
No ratings yet
How To Enable NIC Teaming in Windows 10 Using PowerShell - MCSAGURU - Learn & Troubleshoot Windows, Linux
2 pages
Understanding Modern Digital Modulation Techniques
No ratings yet
Understanding Modern Digital Modulation Techniques
8 pages
Emerson PAC Machine Software v10 Maual PDF
No ratings yet
Emerson PAC Machine Software v10 Maual PDF
115 pages
Crash 2024 04 23 - 18.32.46 Client
No ratings yet
Crash 2024 04 23 - 18.32.46 Client
10 pages
TAFC R20 Release Notes
No ratings yet
TAFC R20 Release Notes
16 pages
TIA PRO1 14 Troubleshooting ENG
No ratings yet
TIA PRO1 14 Troubleshooting ENG
39 pages
System Board D1790 For RX200 S2: Technical Manual
No ratings yet
System Board D1790 For RX200 S2: Technical Manual
40 pages
Deadlock in OS
No ratings yet
Deadlock in OS
8 pages
commerce_bba-ca_semester-4_2023_november_operating-system-2019-pattern
No ratings yet
commerce_bba-ca_semester-4_2023_november_operating-system-2019-pattern
2 pages
Siaem
100% (1)
Siaem
332 pages

Lecture11 PDF

Uploaded by

Lecture11 PDF

Uploaded by

Job Scheduling and Management

• Job Scheduling and Management Concepts

• Grid computing applications follow two popular patterns:

jobs across multiple sites

• Job scheduling and management in a Grid is a wide- Diverse applications

• How to schedule multiple jobs to run in Job(s) ready to execute

– Can make informed decisions

• Resources (hosts) vary in capacity, configuration, policy

• Resources advertise status and configuration to broker

• Job initiators solicit resources from broker

2 Solicit resources for job

• Broker matches jobs to resources

2 Solicit resources for job

• Choice of resource from those eligible is driven by policy goals

2 Solicit resources for job

• Jobs need a predictable execution environment

4 Job assigned to resource

• Jobs need a predictable execution environment

• Jobs need to access data and store results

• Hosts employ a range of scheduling algorithms

• Execution environment may not trust the job

• Partial virtualisation (e.g. virtual system calls in Condor)

• Jobs may fail due to programmer error or environmental issues

• Critical to understand failure and recovery modes

• User policy issues:

• Site policy issues:

Axioms: distrust and privacy…

• Limited security and sandboxing; trusted environment

More limited systems… generally perform job scheduling; leave

• No real facilities for real-time jobs, resource reservation

• Planning execution of jobs on particular sites Diverse applications

• Sites wish to retain their autonomy and hide

• Similar broker-based model to local scheduling

• Want to support: Diverse applications

⇒ Middleware to hide the differences Hardware, software

“The Grid Resource Allocation and Management (GRAM)

• Abstracts local scheduling functions as middleware, to allow

• Condor-G is a version of Condor that uses GRAM to assign jobs

• Effects of communication latency

– Likelihood of increases with scale

• Tutorial on Thursday: discussion of job scheduling papers

You might also like