MSR GPU Servers
Servers
Hardware
Lambda Labs Workstations with:
- AMD Ryzen Threadripper PRO 5995WX with 64 Cores
- 1 TB of RAM
- Two NVIDIA RTX 6000 Ada GPUs with 48 GB Ram
- Disks: a 4TB and a 14TB NVME Drive
Access
- The servers maintained by MSR are accessible only on the Northwestern Network
- NUIT does not maintain and is unable to help you with these servers (even though you initially access them via your netID).
- For off-campus access, users must connect to the Northwestern network via VPN.
- The servers are available at:
lamb.mech.northwestern.edusheep.mech.northwestern.edu
- You will need your
netid(e.g.,nuit1337) and it's associated password, as provided by NUIT. - Use
ssh-copy-id netid@serverto copy the ssh key to the server you want access to.- Your password is your netID password
- For example
ssh-copy-id nuit1337@lamb.mech.northwestern.eduto gain access tolamb
- After running the above command you should be able to
ssh netid@serverand not be prompted for your password (e.g.,ssh nuit1337@lamb.mech.northwestern.edu). If you are prompted for a password:- If your
netIDpassword grants access, then the previous command failed, likely because you have not setup an ssh key (these were setup during the hackathon). - If your
ssh_keypassword grants access then, you have not loaded the key into thessh-agent. Either review the these instructions or suffer and enter your password every time!
- If your
- You should assume that data on the server can be deleted at the end of each quarter.
- There is no redundancy on the server so backup everything that is important to you.
- In addition to your home directory you have access to
/data/users/netID, which provides additional storage.- Each user is limited to
150GBin their/home/netIDdirectory - Each netID has an additional
850GBin their/data/users/netIDdirectory - Use
quota -vsto see how much disk space you are using relative to your quotas.
- Each user is limited to
Software
Some useful software includes:
- Ubuntu 24.04 LTS (using the Ubuntu Minimal installation.
- ROS 2 Jazzy (and image pipeline, moveit2, simulator, nav2, rtabmap, and others).
- Lambda Stack (provides PyTorch, TensorFlow, etc.)
- Docker
Docker
- If the preinstalled software is not sufficient (wrong versions etc) then you should make a docker image to do what you need
- All users can access Docker only in rootless mode. Docker rootless mode is already set up for you.
- When interpreting instructions for docker on the internet:
- Skip anything related to installing
docker, nvidia drivers, or the nvidia container toolkit. - If the instructions involve
sudoomit thesudo. Most commands should work but if they don't that means you can't (and likely do not actually need to) run it. - If the instructions involve
systemctl <SOMETHING> dockersubstitutesystemctl --user <SOMETHING> dockerinstead.- The docker daemon must be running. It should start automatically but use
systemctl --user start dockerto start it if it is not.
- The docker daemon must be running. It should start automatically but use
- If the instructions involve
--privilegedyou can't do it (but there is likely another way to do what you need to do).
- Skip anything related to installing
- To use the
nvidiagpus in a docker container add the following flags to thedocker runcommand:--runtime=nvidia --gpus all- For example
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
- For example
- For more information see Rootless Docker and NVIDIA Docker Rootless Mode
- By default your docker images are stored in
/data/users/<netID>/docker- Use
docker image lsto see downloaded docker images and the amount of disk space they are using - Use
docker image rmto delete a docker image.
- Use
Usage Hints
- Use tmux to persist your session across ssh logins
- Otherwise, what you run in an ssh session (e.g., a long ML training) will exit when ssh disconnects
- If doing RL training, run the simulator in
headlessmode on the server.- There is no GUI on the server, so everything will be via the command line.
- A useful workflow is to test small batches on laptop, then run on server
- Be sure to save checkpoints so you can train a little, see what is happening, then continue training.
- Use rsync to synchronize files between the server and your computer
- rsync can operate directly over ssh by specifying
user@server:/path/to/fileas either a source or destination - Be careful, rsync can delete files. I recommend never using it directly on your home directory but rather a sub-directory
- If you completely destroy your home directory the hidden files in
/etc/skelhave the default configuration and you can ssh-copy-id to gain passwordless access again.
- rsync can operate directly over ssh by specifying
- It is possible to run a jupyter notebook on the server and connect to it via your computer's web browser.
- You can use python3 virtual environments for python development
- However, most packages you would need (if starting from scratch) are already pre-installed
- Many C++ and Python packages can be placed directly in a
colconworkspace and compiled from scratch - If pre-existing software has dependencies that differ from our system (as is likely) your quickest path to success is a docker container.
- If you are coding your own script, your quickest path to success is to use the versions of the pre-installed packages that are on the server and on your laptop.
- There are no limits on GPU or CPU usage, but be mindful of other students by coordinating when you will use each machine and GPU.