Skip to content

Commit 75ee1e3

Browse files
committed
cassio more
1 parent b83898c commit 75ee1e3

File tree

3 files changed

+148
-0
lines changed

3 files changed

+148
-0
lines changed

cassio/README.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,3 +111,105 @@ After installation type Y to agree to append neccessary lines into your `.bashrc
111111
`conda install jupyterlab`
112112

113113
## Slurm Workload Manager
114+
115+
### Terminal multiplexer / screen / tmux
116+
117+
Tmux or any other multiplexer helps you to run a shell session on Cassio (or any other) node which you can detach from and attach anytime later *without losing the state.*
118+
119+
Learn more about it here: [https://github.com/tmux/tmux/wiki](https://github.com/tmux/tmux/wiki)
120+
121+
### Slurm quotas
122+
123+
1. Hot season (conference deadlines, dense queue)
124+
125+
2. Cold season (summer, sparse queue)
126+
127+
### Cassio node
128+
129+
`cassio.cs.nyu.edu` is the head node from where one can submit a job or request some resources.
130+
131+
**Do not run any intensive jobs on cassio node.**
132+
133+
Popular job management commands:
134+
135+
`sinfo -o --long --Node --format="%.8N %.8T %.4c %.10m %.20f %.30G"` - shows all available nodes with corresponding GPUs installed. **Note features – this allows you to specify the desired GPU.**
136+
137+
`squeue -u ${USER}` - shows state of your jobs in the queue.
138+
139+
`scancel <jobid>` - cancel job with specified id. You can only cancel your own jobs.
140+
141+
`scancel -u ${USER}` - cancel *all* your current jobs, use this one very carefully.
142+
143+
`scancel --name myJobName` - cancel job given the job name.
144+
145+
`scontrol hold <jobid>` - hold pending job from being scheduled. This may be helpful if you noticed that some data/code/files are not ready yet for the particular job.
146+
147+
`scontrol release <jobid>` - release the job from hold.
148+
149+
`scontrol requeue <jobid>` - cancel and submit the job again.
150+
151+
### Running the interactive job
152+
153+
By interactive job we define a shell on a machine (possibly with a GPU) where you can interactively run/debug code or run some software e.g. JupyterLab or Tensorboard.
154+
155+
In order to request a machine and instantly (after Slurm assignment) connect to assigned machine, run:
156+
157+
`srun --qos=interactive --mem 16G --gres=gpu:1 --constraint=gpu_12gb --pty bash`
158+
159+
Explanation:
160+
161+
* `--qos=interactive` means your job will have a special QoS labeled 'interactive'. In our case this means your time limit will be longer than a usual job (7 days?), but there are max 2 jobs per user with such QoS.
162+
163+
* `--mem 16G` means the upper limit of RAM you expect your job to use. Machine will show all its RAM, however Slurm kills the job if it exceeds the requested RAM. **Do not set max possible RAM here, this may decrease your priority over time.** Instead, try to estimate the reasonable amount.
164+
165+
* `--gres=gpu:1` means number of gpus you will see in the requested instance. No gpus if you do not use this arg.
166+
167+
* `--constraint=gpu_12gb` each node has assigned features given what kind of GPU it has. Check `sinfo` command above to output all nodes with all possible features. Features may be combined using logical OR operator as `gpu_12gb|gpu_6gb`.
168+
169+
* `--pty bash` means that after connecting to the instance you will be given the bash shell.
170+
171+
You may remove `--qos` arg and run as many interactive jobs as you wish, if you need that.
172+
173+
#### Port forwarding from the client to Cassio node
174+
175+
As an example of port forwarding we will launch JupyterLab from interactive GPU job shell and connect to it from client browser.
176+
177+
1. Start an interactive job (you may exclude GPU to get it fast if your priority is low at the moment):
178+
179+
`srun --qos=interactive --mem 16G --gres=gpu:1 --constraint=gpu_12gb --pty bash`
180+
181+
Note the host name of the machine you got e.g. lion4 (will be needed for port forwarding).
182+
183+
2. Activate the conda environment with installed JupyterLab:
184+
185+
`conda activate tutorial`
186+
187+
3. Start JupyterLab
188+
189+
`jupyter lab --no-browser --port <port>`
190+
191+
Explanation:
192+
193+
* `--no-browser` means it will not invoke default OS browser (you don't want CLI browser).
194+
195+
* `--port <port>` means the port JupyterLab will be listening for requests. Usually we choose some 4 digit number to make sure that we do not select any reserved ports like 80 or 443.
196+
197+
4. Open another tab on your terminal client and run:
198+
199+
`ssh -L <port>:localhost:<port> -J cims <interactive_job_hostname> -N` (job hostname may be short e.g. lion4)
200+
201+
Explanation:
202+
203+
* `-L <port>:localhost:<port>` Specifies that the given port on the local (client) host is to be forwarded to the given host and port on the remote side.
204+
205+
* `-J cims <other host>` means jump over cims to other host. This uses your ssh config to resolve what does cims mean.
206+
207+
* `-N` means there will no shell given upon connection, only tunnel will be started.
208+
209+
5. Go to your browser and open `localhost:<port>`. You should be able to open JupyterLab page. It may ask you for security token: get it form stdout of interactive job instance.
210+
211+
**Disclaimer:** there are many other ways to get set this up: one may use ssh SOCKS proxy, initialize tunnel from the interactive job itself etc. And all the methods are OK if you can run it.
212+
213+
### Submitting a batch job
214+
215+
TODO

cassio/gpu_job.slurm

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#!/bin/bash
2+
#SBATCH --job-name=job_wgpu
3+
#SBATCH --open-mode=append
4+
#SBATCH --output=./%j_%x.out
5+
#SBATCH --error=./%j_%x.err
6+
#SBATCH --export=ALL
7+
#SBATCH --time=00:10:00
8+
#SBATCH --gres=gpu:1
9+
#SBATCH --constraint=gpu_12gb
10+
#SBATCH --mem=64G
11+
#SBATCH -c 4
12+
13+
python ./test_gpu.py

cassio/test_gpu.py

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
import torch
2+
import time
3+
4+
if __name__ == '__main__':
5+
6+
print(f"Torch cuda available: {torch.cuda.is_available()}")
7+
print(f"GPU name: {torch.cuda.get_device_name()}\n\n")
8+
9+
t1 = torch.randn(100,1000)
10+
t2 = torch.randn(1000,10000)
11+
12+
cpu_start = time.time()
13+
14+
for i in range(100):
15+
t = t1 @ t2
16+
17+
cpu_end = time.time()
18+
19+
print(f"CPU matmul elapsed: {cpu_end-cpu_start} sec.")
20+
21+
t1 = t1.to('cuda')
22+
t2 = t2.to('cuda')
23+
24+
gpu_start = time.time()
25+
26+
for i in range(100):
27+
t = t1 @ t2
28+
29+
gpu_end = time.time()
30+
31+
print(f"GPU matmul elapsed: {gpu_end-gpu_start} sec.")
32+
33+

0 commit comments

Comments
 (0)