Skip to content

Commit f57c6d2

Browse files
committed
prince draft
1 parent fc254e1 commit f57c6d2

File tree

1 file changed

+100
-0
lines changed

1 file changed

+100
-0
lines changed

prince/README.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# NYU HPC Prince cluster
2+
3+
Troubleshooting: [email protected]
4+
5+
NYU HPC website: [https://sites.google.com/a/nyu.edu/nyu-hpc/](https://sites.google.com/a/nyu.edu/nyu-hpc/)
6+
7+
One can find many useful tips under *DOCUMENTATION / GUIDES*.
8+
9+
In this part we will touch the following aspects of our own cluster:
10+
11+
* Connecting to Prince via CIMS Access
12+
13+
* Prince computing nodes
14+
15+
* Prince filesystems
16+
17+
* Software management
18+
19+
* Slurm Workload Manager
20+
21+
## Connecting to Prince via CIMS Access (without bastion HPC nodes)
22+
23+
HPC login nodes are reachable from CIMS access node, i.e. there is no need to login over another firewalled bastion HPC node. One can find corresponding `prince` block in the ssh config file (check client folder in this repo).
24+
25+
## Prince computing nodes
26+
27+
Learn more here: [https://sites.google.com/a/nyu.edu/nyu-hpc/systems/prince](https://sites.google.com/a/nyu.edu/nyu-hpc/systems/prince)
28+
29+
## Prince filesystems
30+
31+
Learn more here: [https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/data-management/prince-data](https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/data-management/prince-data)
32+
33+
## Software management
34+
35+
There is similar environment modules package as in Cassio for general libs/software.
36+
37+
There is no difference with managing **conda** on Cassio and on Prince except for installation path.
38+
39+
Learn more here: [https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/prince/packages/conda-environments](https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/prince/packages/conda-environments)
40+
41+
## Slurm Workload Manager
42+
43+
In general, Slurm behaves similarly on both Cassio and Prince, however there are differences in quotas and how GPU-equipped nodes are distributed:s
44+
45+
1. **There is no interactive QoS on Prince.** In other words, run `srun --pty bash` with additional args to get an interactive job allocated.
46+
47+
2. **There is no `--constraint` arg to specify a GPU you want.**
48+
49+
Instead, GPU nodes are separated into differnt **partitions** w.r.t. GPU type. Right now there are following parititons available: `--partition=k80_4,k80_8,p40_4,p100_4,v100_sxm2_4,v100_pci_2,dgx1,p1080_4`. Slurm will try to allocate nodes in the order given by this line.
50+
51+
|GPU|Memory|
52+
|---|------|
53+
|96 P40|24G|
54+
|32 P100|16G|
55+
|50 K80|24G bridged over 2 GPUs or 12G each|
56+
|26 V100|16G|
57+
|16 P1080|8G|
58+
|DGX1|16G ?|
59+
60+
### Port forwarding to interactive job
61+
62+
If one follows exactly same steps as for Cassio and run:
63+
64+
`ssh -L <port>:localhost:<port> -J prince <hpc_username>:<interactive_host>`
65+
66+
then the following error may be returned:
67+
68+
```
69+
channel 0: open failed: administratively prohibited: open failed
70+
stdio forwarding failed
71+
kex_exchange_identification: Connection closed by remote host
72+
```
73+
74+
which means that jump to your instance was not successful. In order to avoid this jump, we make a tunnel which will forward connection to the machine itself rather than localhost:
75+
76+
`ssh -L <port>:<interactive_host>:<port> prince -N`
77+
78+
**Important: you must run your JupyterLab or any other software with accepting requests from all ip addresses rather than from localhost only (which is a default usually).** To make this change in jupyter, add `--ip 0.0.0.0` arg:
79+
80+
`jupyter lab --no-browser --port <port> --ip 0.0.0.0`
81+
82+
Now you should be able to open JupyterLab tab in your browser.
83+
84+
### Submitting a batch job
85+
86+
As we noted before, one particular difference with Cassio is about GPU allocation (note `--partition` below):
87+
88+
```bash
89+
#!/bin/bash
90+
#SBATCH --job-name=<JOB_NAME>
91+
#SBATCH --open-mode=append
92+
#SBATCH --output=<OUTPUT_FILENAME>
93+
#SBATCH --error=<ERR_FILENAME>
94+
#SBATCH --export=ALL
95+
#SBATCH --time=24:00:00
96+
#SBATCH --partition=p40_4,p100_4,v100_sxm2_4,dgx1
97+
#SBATCH --gres=gpu:1
98+
#SBATCH --mem=64G
99+
#SBATCH -c 4
100+
```

0 commit comments

Comments
 (0)