MSR GPU Servers
Servers
Hardware
Lambda Labs Workstations with:
- AMD Ryzen Threadripper PRO 5995WX with 64 Cores
- 1 TB of RAM
- Two NVIDIA RTX 6000 Ada GPUs with 48 GB Ram
- Disks: a 4TB and a 14TB NVME Drive
Access
- The servers maintained by MSR are accessible only on the Northwestern Network
- NUIT does not maintain and is unable to help you with these servers (even though you initially access them via your netID).
- For off-campus access, users must connect to the Northwestern network via VPN.
- The servers are available at:
lamb.mech.northwestern.edu
sheep.mech.northwestern.edu
- You will need your
netid
(e.g.,nuit1337
) and it's associated password, as provided by NUIT. - Use
ssh-copy-id netid@server
to copy the ssh key to the server you want access to.- Your password is your netID password
- For example
ssh-copy-id nuit1337@lamb.mech.northwestern.edu
to gain access tolamb
- After running the above command you should be able to
ssh netid@server
and not be prompted for your password (e.g.,ssh nuit1337@lamb.mech.northwestern.edu
). If you are prompted for a password:- If your
netID
password grants access, then the previous command failed, likely because you have not setup an ssh key (these were setup during the hackathon). - If your
ssh_key
password grants access then, you have not loaded the key into thessh-agent
. Either review the these instructions or suffer and enter your password every time!
- If your
- You should assume that data on the server can be deleted at the end of each quarter.
- There is no redundancy on the server so backup everything that is important to you.
- In addition to your home directory you have access to
/data/users/netID
, which provides additional storage.- Each user is limited to
150GB
in their/home/netID
directory - Each netID has an additional
850GB
in their/data/users/netID
directory - Use
quota -vs
to see how much disk space you are using relative to your quotas.
- Each user is limited to
Software
Some useful software includes:
- Ubuntu 24.04 LTS (using the Ubuntu Minimal installation.
- ROS 2 Jazzy (and image pipeline, moveit2, simulator, nav2, rtabmap, and others).
- Lambda Stack (provides PyTorch, TensorFlow, etc.)
- Docker
Docker
- If the preinstalled software is not sufficient (wrong versions etc) then you should make a docker image to do what you need
- All users can access Docker only in rootless mode. Docker rootless mode is already set up for you.
- When interpreting instructions for docker on the internet:
- Skip anything related to installing
docker
, nvidia drivers, or the nvidia container toolkit. - If the instructions involve
sudo
omit thesudo
. Most commands should work but if they don't that means you can't (and likely do not actually need to) run it. - If the instructions involve
systemctl <SOMETHING> docker
substitutesystemctl --user <SOMETHING> docker
instead.- The docker daemon must be running. It should start automatically but use
systemctl --user start docker
to start it if it is not.
- The docker daemon must be running. It should start automatically but use
- If the instructions involve
--privileged
you can't do it (but there is likely another way to do what you need to do).
- Skip anything related to installing
- To use the
nvidia
gpus in a docker container add the following flags to thedocker run
command:--runtime=nvidia --gpus all
- For example
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
- For example
- For more information see Rootless Docker and NVIDIA Docker Rootless Mode
- By default your docker images are stored in
/data/users/<netID>/docker
- Use
docker image ls
to see downloaded docker images and the amount of disk space they are using - Use
docker image rm
to delete a docker image.
- Use
Usage Hints
- Use tmux to persist your session across ssh logins
- Otherwise, what you run in an ssh session (e.g., a long ML training) will exit when ssh disconnects
- If doing RL training, run the simulator in
headless
mode on the server.- There is no GUI on the server, so everything will be via the command line.
- A useful workflow is to test small batches on laptop, then run on server
- Be sure to save checkpoints so you can train a little, see what is happening, then continue training.
- Use rsync to synchronize files between the server and your computer
- rsync can operate directly over ssh by specifying
user@server:/path/to/file
as either a source or destination - Be careful, rsync can delete files. I recommend never using it directly on your home directory but rather a sub-directory
- If you completely destroy your home directory the hidden files in
/etc/skel
have the default configuration and you can ssh-copy-id to gain passwordless access again.
- rsync can operate directly over ssh by specifying
- It is possible to run a jupyter notebook on the server and connect to it via your computer's web browser.
- You can use python3 virtual environments for python development
- However, most packages you would need (if starting from scratch) are already pre-installed
- Many C++ and Python packages can be placed directly in a
colcon
workspace and compiled from scratch - If pre-existing software has dependencies that differ from our system (as is likely) your quickest path to success is a docker container.
- If you are coding your own script, your quickest path to success is to use the versions of the pre-installed packages that are on the server and on your laptop.
- There are no limits on GPU or CPU usage, but be mindful of other students by coordinating when you will use each machine and GPU.