Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3
Note
titleWork in progress

This site is in the process of being reviewed and updated.

Pre-Installation

User virtualization (consistent username, UID, and GID values)

The username of the user submitting a job must be recognized on the compute host where the job runs and each user must have unique and consistent UID/GID values.

Creating the sgeadmin user account

Similarly virtualized.

Home directories

Grid Engine runs jobs in the user's home directory. For every user, and on every compute host, a home directory is present and contains all the desired dot-file configurations.

Hostnames and DNS

Grid Engine likes DNS and both forward and reverse DNS queries must be configured.

On all hosts, edit /etc/services
Code Block
Wiki Markup
h3. Pre-Installation

h5. User virtualization (consistent username, UID, and GID values)

The username of the user submitting a job must be recognized on the compute host where the job runs and each user must have unique and consistent UID/GID values.

h5. Creating the sgeadmin user account

Similarly virtualized.

h5. Home directories

Grid Engine runs jobs in the user's home directory.  For every user, and on every compute host, a home directory is present and contains all the desired dot-file configurations.

h5. Hostnames and DNS

Grid Engine likes DNS and both forward and reverse DNS queries must be configured.

h5. On all hosts, edit /etc/services
{code}
sge_qmaster     536/tcp                         # Sun Grid Engine queue master
sge_execd       537/tcp                         # Sun Grid Engine exec daemon
{code}

h5. Creating the SGE root directory and exporting it via NFS to all cluster nodes.

All compute farm members must share a common path to the SGE root so be careful to ensure that the path to the GridEngine files is the same on the master node as it is on the other servers and compute elements.  This path should be what is used globally as the SGE root directory.  For example:
{code}
/opt/sge
{code}

h5. On Execution Hosts, NFS mount the Grid Engine directory of the Master node, $SGE_ROOT

h5. On Submit hosts

Insert the proper line into system or user .bashrc files.
{code}
Creating the SGE root directory and exporting it via NFS to all cluster nodes.

All compute farm members must share a common path to the SGE root so be careful to ensure that the path to the GridEngine files is the same on the master node as it is on the other servers and compute elements. This path should be what is used globally as the SGE root directory. For example:

Code Block

/opt/sge
On Execution Hosts, NFS mount the Grid Engine directory of the Master node, $SGE_ROOT
On Submit hosts

Insert the proper line into system or user .bashrc files.

Code Block
. /opt/sge/default/common/settings.sh
{code}

h5. Application and data files

The prolog and epilog script feature of Grid Engine provides a generic mechanism for implementing a site-specific 
Application and data files

The prolog and epilog script feature of Grid Engine provides a generic mechanism for implementing a site-specific stage-in/stage-out

...

facility.

...

Alternatively,

...

these

...

steps

...

could

...

be

...

embedded

...

into

...

jobs

...

scripts

...

directly.

...

Shared

...

filesystem

...

options

...

If

...

you

...

plan

...

to

...

install

...

into

...

a

...

shared

...

NFS

...

filesystem,

...

make

...

sure

...

the

...

server

...

is

...

not

...

mounting

...

the

...

filesystem

...

with

...

options

...

that

...

block

...

the

...

root

...

user

...

or

...

remap

...

the

...

root

...

UID

...

value

...

to

...

a

...

non-priviledged

...

value.

...

Grid

...

Engine

...

can

...

run

...

as

...

a

...

non-root

...

user

...

but

...

it

...

needs

...

to

...

be

...

started

...

by

...

root.

...

There

...

are

...

also

...

setuid

...

binaries

...

in

...

the

...

distribution

...

that

...

will

...

break

...

if

...

root-squashing

...

is

...

enabled.

...

Classic

...

Spooling

...

vs.

...

Berkeley-DB

...

Spooling

...

If

...

you

...

are

...

just

...

starting

...

out

...

with

...

Grid

...

Engine,

...

use

...

classic

...

spooling.

...

If

...

your

...

cluster

...

is

...

less

...

than

...

20

...

nodes

...

in

...

size,

...

use

...

classic

...

spooling.

...

Once

...

you

...

have

...

the

...

system

...

up

...

and

...

running

...

for

...

a

...

while

...

you'll

...

easily

...

be

...

able

...

to

...

tell

...

if

...

your

...

standard

...

sorts

...

of

...

workload

...

and

...

workflows

...

are

...

being

...

affected

...

by

...

spool

...

performance.

...

By

...

that

...

time,

...

you'll

...

be

...

comfortable

...

enough

...

with

...

Grid

...

Engine

...

that

...

you'll

...

have

...

no

...

trouble

...

backing

...

up

...

your

...

configuration

...

and

...

reinstalling

...

with

...

berkeley

...

spooling

...

enabled.

...

The

...

automatic

...

install

...

scripts

...

are

...

not

...

worth

...

dealing

...

with

...

on

...

small

...

clusters

...

For

...

clusters

...

smaller

...

than

...

30

...

nodes

...

in

...

size

...

(where

...

I

...

already

...

have

...

passwordless

...

SSH

...

access

...

set

...

up)

...

it

...

is

...

actually

...

quicker

...

to

...

manually

...

log

...

into

...

each

...

node

...

and

...

invoke

...

the

...

"./install_execd"

...

script

...

by

...

hand.

Qmaster Installation

Unpacking and initial setup
Code Block



h3. Qmaster Installation

h5. Unpacking and initial setup
{code}
[DIRxSRVx10:root@host ~]# SGE_ROOT=/opt/sge; export SGE_ROOT
[DIRxSRVx10:root@host ~]# cd ${SGE_ROOT}
[DIRxSRVx10:root@host ~]# gzip -dc sge-6.0u8-common.tar.gz | tar xvpf -
[DIRxSRVx10:root@host ~]# gzip -dc sge-6.0u8-bin-lx24-x86.tar.gz | tar xvpf -
[DIRxSRVx10:root@host ~]# gzip -dc sge-6.0u8-bin-lx24-amd64.tar.gz | tar xvpf -
[DIRxSRVx10:root@host ~]# util/setfileperm.sh $SGE_ROOT
{code}

h5. Create a db spool dir and start the installation on the master host
{code}
Create a db spool dir and start the installation on the master host
Code Block
[DIRxSRVx10:root@host ~]# export SGE_ROOT=/opt/sge
[DIRxSRVx10:root@host ~]# mkdir -p /var/spool/sge
[DIRxSRVx10:root@host ~]# chown -R sgeadmin /var/spool/sge
[DIRxSRVx10:root@host ~]# cd $SGE_ROOT
[DIRxSRVx10:root@host ~]# ./install_qmaster
Accept defaults except
  • User name to install as sgeadmin
  • Grid Engine group id range of 20000-20200
  • <administrator_mail> set to sgeadmin@example.com
  • Adding admin and submit hosts set to server1 server2 server3
  • Do you want to add your shadow host(s) now? (y/n) [y] >> n

Execution Host Installation

Add execution hosts as administrative hosts

All execution hosts must be administrative hosts during their installation. You may verify your administrative hosts with the command

Code Block
{code}

h5. Accept defaults except
* User name to install as sgeadmin
* Grid Engine group id range of 20000-20200
* <administrator_mail> set to sgeadmin@example.com
* Adding admin and submit hosts set to server1 server2 server3
* Do you want to add your shadow host(s) now? (y/n) \[y] >> n

h3. Execution Host Installation

h5. Add execution hosts as administrative hosts

All execution hosts must be administrative hosts during their installation.  You may verify your administrative hosts with the command
{code}
[DIRxSRVx10:root@host ~]# qconf -sh
{code}

and

...

you

...

may

...

add

...

new

...

administrative

...

hosts

...

on

...

the

...

master

...

host

...

with

...

the

...

command

{
Code Block
}
[DIRxSRVx10:root@host ~]# qconf -ah <hostname>
{code}

h5. Create spooling directories on each execution host:
{code}
Create spooling directories on each execution host:
Code Block
[DIRxSRVx10:root@host ~]# mkdir -p /var/spool/sge
[DIRxSRVx10:root@host ~]# chown sgeadmin /var/spool/sge
{code}

h5. Run the installer script in 
Run the installer script in auto-install

...

mode

...

The

...

install_execd

...

script

...

allows

...

options

...

which

...

will

...

install

...

the

...

exec

...

daemon

...

with

...

default

...

options,

...

without

...

interactive

...

input,

...

and

...

[DIRxSRVx10:optionally]

...

without

...

creating

...

the

...

default

...

queue.

{
Code Block
}
[DIRxSRVx10:root@host ~]# export SGE_ROOT=/opt/sge
[DIRxSRVx10:root@host ~]# cd ${SGE_ROOT}
[DIRxSRVx10:root@host ~]# ./install_execd -auto -fast [DIRxSRVx10:-noqueue]
{code}

h5. Run 
Run the installer script in interactive mode
Code Block
the installer script in interactive mode
{code}
[DIRxSRVx10:root@host ~]# export SGE_ROOT=/opt/sge
[DIRxSRVx10:root@host ~]# cd ${SGE_ROOT}
[DIRxSRVx10:root@host ~]# ./install_execd
{code}

h5. Accept defaults except
# Do you want to configure a local spool directory for this host 
Accept defaults except
  1. Do you want to configure a local spool directory for this host (y/n)

...

  1. [n]

...

  1. >>

...

  1. y

...

  1. Enter

...

  1. path

...

  1. /var/spool/sge

...

When

...

the

...

install

...

script

...

is

...

done,

...

Grid

...

Engine

...

should

...

be

...

installed

...

and

...

running.

...

Run

{
Code Block
}
[DIRxSRVx10:root@host ~]# qstat -f 
{code}

and

...

you

...

should

...

see

...

an

...

entry

...

for

...

all.q@hostname.

...

If

...

so,

...

everything

...

is

...

set

...

up.

...

Troubleshooting

Reinstallation

BEFORE you reinstall the server for any reason, you MUST stop the execution host daemons. Then after the install you must reinstall the execution hosts

Grid Engine messages

Grid Engine messages can be found at:

/tmp/qmaster_messages

...

(during

...

qmaster

...

startup)

...


/tmp/execd_messages

...

(during

...

execution

...

daemon

...

startup)

...

After

...

startup

...

the

...

daemons

...

log

...

their

...

messages

...

in

...

their

...

spool

...

directories.

...

Qmaster:

...

/var/spool/qmaster/messages

...


Exec

...

daemon:

...

<execd_spool_dir>/<hostname>/messages

...

Queue

...

error

...

states

...

If

...

a

...

queue

...

enters

...

an

...

error

...

state,

...

the

...

queue

...

must

...

be

...

reset

...

before

...

further

...

jobs

...

will

...

be

...

sheduled

...

on

...

that

...

queue.

...

To

...

reset

...

a

...

queue,

...

become

...

sgeadmin

...

on

...

the

...

qmaster

...

and

...

run

...

the

...

command

{
Code Block
}
[DIRxSRVx10:root@host ~]# qmod -cq <queuename>
{code}

h5. For 
For NFS-mounted

...

spool

...

dirs,

...

ensure

...

a

...

spool

...

dir

...

exists

...

and

...

permissions

...

are

...

set
{
Code Block
}
[DIRxSRVx10:root@host ~]# mkdir <SGE_CELL>/spool/<HOSTNAME>
[DIRxSRVx10:root@host ~]# chown sgeadmin.root <SGE_CELL>/spool/<HOSTNAME>/
{code}

h3. Resources
# {link:Department-Based Resource Allocation within Grid Engine|http://bioteam.net/dag/sge6-funct-share-dept.html}
# {link:File-Staging approaches in Grid Engine|http://gridengine.sunsource.net/howto/filestaging/}
# {link:Delegated File Staging with GridEngine|http://gridengine.sunsource.net/howto/filestaging/filestaging6.html}
# {link:Sun's Compute Server technology|https://computeserver.developer.network.com/} aims to enable Java developers to easily and efficiently use the Sun Grid Compute Utility as a platform for the distributed execution of parallel computations.
# {link:GridEngine Documents and Binaries|http://gridengine.sunsource.net/servlets/ProjectDocumentList}
# {link:DRMAA Java API|http://gridengine.sunsource.net/nonav/source/browse/~checkout~/gridengine/doc/javadocs/index.html?content-type=text/html}

Resources

  1. aims to enable Java developers to easily and efficiently use the Sun Grid Compute Utility as a platform for the distributed execution of parallel computations.