You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: cassio/README.md
+102Lines changed: 102 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -111,3 +111,105 @@ After installation type Y to agree to append neccessary lines into your `.bashrc
111
111
`conda install jupyterlab`
112
112
113
113
## Slurm Workload Manager
114
+
115
+
### Terminal multiplexer / screen / tmux
116
+
117
+
Tmux or any other multiplexer helps you to run a shell session on Cassio (or any other) node which you can detach from and attach anytime later *without losing the state.*
118
+
119
+
Learn more about it here: [https://github.com/tmux/tmux/wiki](https://github.com/tmux/tmux/wiki)
120
+
121
+
### Slurm quotas
122
+
123
+
1. Hot season (conference deadlines, dense queue)
124
+
125
+
2. Cold season (summer, sparse queue)
126
+
127
+
### Cassio node
128
+
129
+
`cassio.cs.nyu.edu` is the head node from where one can submit a job or request some resources.
130
+
131
+
**Do not run any intensive jobs on cassio node.**
132
+
133
+
Popular job management commands:
134
+
135
+
`sinfo -o --long --Node --format="%.8N %.8T %.4c %.10m %.20f %.30G"` - shows all available nodes with corresponding GPUs installed. **Note features – this allows you to specify the desired GPU.**
136
+
137
+
`squeue -u ${USER}` - shows state of your jobs in the queue.
138
+
139
+
`scancel <jobid>` - cancel job with specified id. You can only cancel your own jobs.
140
+
141
+
`scancel -u ${USER}` - cancel *all* your current jobs, use this one very carefully.
142
+
143
+
`scancel --name myJobName` - cancel job given the job name.
144
+
145
+
`scontrol hold <jobid>` - hold pending job from being scheduled. This may be helpful if you noticed that some data/code/files are not ready yet for the particular job.
146
+
147
+
`scontrol release <jobid>` - release the job from hold.
148
+
149
+
`scontrol requeue <jobid>` - cancel and submit the job again.
150
+
151
+
### Running the interactive job
152
+
153
+
By interactive job we define a shell on a machine (possibly with a GPU) where you can interactively run/debug code or run some software e.g. JupyterLab or Tensorboard.
154
+
155
+
In order to request a machine and instantly (after Slurm assignment) connect to assigned machine, run:
*`--qos=interactive` means your job will have a special QoS labeled 'interactive'. In our case this means your time limit will be longer than a usual job (7 days?), but there are max 2 jobs per user with such QoS.
162
+
163
+
*`--mem 16G` means the upper limit of RAM you expect your job to use. Machine will show all its RAM, however Slurm kills the job if it exceeds the requested RAM. **Do not set max possible RAM here, this may decrease your priority over time.** Instead, try to estimate the reasonable amount.
164
+
165
+
*`--gres=gpu:1` means number of gpus you will see in the requested instance. No gpus if you do not use this arg.
166
+
167
+
*`--constraint=gpu_12gb` each node has assigned features given what kind of GPU it has. Check `sinfo` command above to output all nodes with all possible features. Features may be combined using logical OR operator as `gpu_12gb|gpu_6gb`.
168
+
169
+
*`--pty bash` means that after connecting to the instance you will be given the bash shell.
170
+
171
+
You may remove `--qos` arg and run as many interactive jobs as you wish, if you need that.
172
+
173
+
#### Port forwarding from the client to Cassio node
174
+
175
+
As an example of port forwarding we will launch JupyterLab from interactive GPU job shell and connect to it from client browser.
176
+
177
+
1. Start an interactive job (you may exclude GPU to get it fast if your priority is low at the moment):
Note the host name of the machine you got e.g. lion4 (will be needed for port forwarding).
182
+
183
+
2. Activate the conda environment with installed JupyterLab:
184
+
185
+
`conda activate tutorial`
186
+
187
+
3. Start JupyterLab
188
+
189
+
`jupyter lab --no-browser --port <port>`
190
+
191
+
Explanation:
192
+
193
+
*`--no-browser` means it will not invoke default OS browser (you don't want CLI browser).
194
+
195
+
*`--port <port>` means the port JupyterLab will be listening for requests. Usually we choose some 4 digit number to make sure that we do not select any reserved ports like 80 or 443.
196
+
197
+
4. Open another tab on your terminal client and run:
198
+
199
+
`ssh -L <port>:localhost:<port> -J cims <interactive_job_hostname> -N` (job hostname may be short e.g. lion4)
200
+
201
+
Explanation:
202
+
203
+
*`-L <port>:localhost:<port>` Specifies that the given port on the local (client) host is to be forwarded to the given host and port on the remote side.
204
+
205
+
*`-J cims <other host>` means jump over cims to other host. This uses your ssh config to resolve what does cims mean.
206
+
207
+
*`-N` means there will no shell given upon connection, only tunnel will be started.
208
+
209
+
5. Go to your browser and open `localhost:<port>`. You should be able to open JupyterLab page. It may ask you for security token: get it form stdout of interactive job instance.
210
+
211
+
**Disclaimer:** there are many other ways to get set this up: one may use ssh SOCKS proxy, initialize tunnel from the interactive job itself etc. And all the methods are OK if you can run it.
0 commit comments