|
| 1 | +# NYU HPC Prince cluster |
| 2 | + |
| 3 | +Troubleshooting: [email protected] |
| 4 | + |
| 5 | +NYU HPC website: [https://sites.google.com/a/nyu.edu/nyu-hpc/](https://sites.google.com/a/nyu.edu/nyu-hpc/) |
| 6 | + |
| 7 | +One can find many useful tips under *DOCUMENTATION / GUIDES*. |
| 8 | + |
| 9 | +In this part we will touch the following aspects of our own cluster: |
| 10 | + |
| 11 | +* Connecting to Prince via CIMS Access |
| 12 | + |
| 13 | +* Prince computing nodes |
| 14 | + |
| 15 | +* Prince filesystems |
| 16 | + |
| 17 | +* Software management |
| 18 | + |
| 19 | +* Slurm Workload Manager |
| 20 | + |
| 21 | +## Connecting to Prince via CIMS Access (without bastion HPC nodes) |
| 22 | + |
| 23 | +HPC login nodes are reachable from CIMS access node, i.e. there is no need to login over another firewalled bastion HPC node. One can find corresponding `prince` block in the ssh config file (check client folder in this repo). |
| 24 | + |
| 25 | +## Prince computing nodes |
| 26 | + |
| 27 | +Learn more here: [https://sites.google.com/a/nyu.edu/nyu-hpc/systems/prince](https://sites.google.com/a/nyu.edu/nyu-hpc/systems/prince) |
| 28 | + |
| 29 | +## Prince filesystems |
| 30 | + |
| 31 | +Learn more here: [https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/data-management/prince-data](https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/data-management/prince-data) |
| 32 | + |
| 33 | +## Software management |
| 34 | + |
| 35 | +There is similar environment modules package as in Cassio for general libs/software. |
| 36 | + |
| 37 | +There is no difference with managing **conda** on Cassio and on Prince except for installation path. |
| 38 | + |
| 39 | +Learn more here: [https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/prince/packages/conda-environments](https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/prince/packages/conda-environments) |
| 40 | + |
| 41 | +## Slurm Workload Manager |
| 42 | + |
| 43 | +In general, Slurm behaves similarly on both Cassio and Prince, however there are differences in quotas and how GPU-equipped nodes are distributed:s |
| 44 | + |
| 45 | +1. **There is no interactive QoS on Prince.** In other words, run `srun --pty bash` with additional args to get an interactive job allocated. |
| 46 | + |
| 47 | +2. **There is no `--constraint` arg to specify a GPU you want.** |
| 48 | + |
| 49 | + Instead, GPU nodes are separated into differnt **partitions** w.r.t. GPU type. Right now there are following parititons available: `--partition=k80_4,k80_8,p40_4,p100_4,v100_sxm2_4,v100_pci_2,dgx1,p1080_4`. Slurm will try to allocate nodes in the order given by this line. |
| 50 | + |
| 51 | + |GPU|Memory| |
| 52 | + |---|------| |
| 53 | + |96 P40|24G| |
| 54 | + |32 P100|16G| |
| 55 | + |50 K80|24G bridged over 2 GPUs or 12G each| |
| 56 | + |26 V100|16G| |
| 57 | + |16 P1080|8G| |
| 58 | + |DGX1|16G ?| |
| 59 | + |
| 60 | +### Port forwarding to interactive job |
| 61 | + |
| 62 | +If one follows exactly same steps as for Cassio and run: |
| 63 | + |
| 64 | +`ssh -L <port>:localhost:<port> -J prince <hpc_username>:<interactive_host>` |
| 65 | + |
| 66 | +then the following error may be returned: |
| 67 | + |
| 68 | +``` |
| 69 | +channel 0: open failed: administratively prohibited: open failed |
| 70 | +stdio forwarding failed |
| 71 | +kex_exchange_identification: Connection closed by remote host |
| 72 | +``` |
| 73 | + |
| 74 | +which means that jump to your instance was not successful. In order to avoid this jump, we make a tunnel which will forward connection to the machine itself rather than localhost: |
| 75 | + |
| 76 | +`ssh -L <port>:<interactive_host>:<port> prince -N` |
| 77 | + |
| 78 | +**Important: you must run your JupyterLab or any other software with accepting requests from all ip addresses rather than from localhost only (which is a default usually).** To make this change in jupyter, add `--ip 0.0.0.0` arg: |
| 79 | + |
| 80 | +`jupyter lab --no-browser --port <port> --ip 0.0.0.0` |
| 81 | + |
| 82 | +Now you should be able to open JupyterLab tab in your browser. |
| 83 | + |
| 84 | +### Submitting a batch job |
| 85 | + |
| 86 | +As we noted before, one particular difference with Cassio is about GPU allocation (note `--partition` below): |
| 87 | + |
| 88 | +```bash |
| 89 | +#!/bin/bash |
| 90 | +#SBATCH --job-name=<JOB_NAME> |
| 91 | +#SBATCH --open-mode=append |
| 92 | +#SBATCH --output=<OUTPUT_FILENAME> |
| 93 | +#SBATCH --error=<ERR_FILENAME> |
| 94 | +#SBATCH --export=ALL |
| 95 | +#SBATCH --time=24:00:00 |
| 96 | +#SBATCH --partition=p40_4,p100_4,v100_sxm2_4,dgx1 |
| 97 | +#SBATCH --gres=gpu:1 |
| 98 | +#SBATCH --mem=64G |
| 99 | +#SBATCH -c 4 |
| 100 | +``` |
0 commit comments