UP | HOME

MSR GPU Servers

Servers

Hardware

Lambda Labs Workstations with:

  1. AMD Ryzen Threadripper PRO 5995WX with 64 Cores
  2. 1 TB of RAM
  3. Two NVIDIA RTX 6000 Ada GPUs with 48 GB Ram
  4. Disks: a 4TB and a 14TB NVME Drive

Access

  1. The servers maintained by MSR are accessible only on the Northwestern Network
    • NUIT does not maintain and is unable to help you with these servers (even though you initially access them via your netID).
  2. For off-campus access, users must connect to the Northwestern network via VPN.
  3. The servers are available at:
    • lamb.mech.northwestern.edu
    • sheep.mech.northwestern.edu
  4. You will need your netid (e.g., nuit1337) and it's associated password, as provided by NUIT.
  5. Use ssh-copy-id netid@server to copy the ssh key to the server you want access to.
    • Your password is your netID password
    • For example ssh-copy-id nuit1337@lamb.mech.northwestern.edu to gain access to lamb
  6. After running the above command you should be able to ssh netid@server and not be prompted for your password (e.g., ssh nuit1337@lamb.mech.northwestern.edu). If you are prompted for a password:
    1. If your netID password grants access, then the previous command failed, likely because you have not setup an ssh key (these were setup during the hackathon).
    2. If your ssh_key password grants access then, you have not loaded the key into the ssh-agent. Either review the these instructions or suffer and enter your password every time!
  7. You should assume that data on the server can be deleted at the end of each quarter.
    • There is no redundancy on the server so backup everything that is important to you.
  8. In addition to your home directory you have access to /data/users/netID, which provides additional storage.
    • Each user is limited to 150GB in their /home/netID directory
    • Each netID has an additional 850GB in their /data/users/netID directory
    • Use quota -vs to see how much disk space you are using relative to your quotas.

Software

Some useful software includes:

  1. Ubuntu 24.04 LTS (using the Ubuntu Minimal installation.
  2. ROS 2 Jazzy (and image pipeline, moveit2, simulator, nav2, rtabmap, and others).
  3. Lambda Stack (provides PyTorch, TensorFlow, etc.)
  4. Docker

Docker

  • If the preinstalled software is not sufficient (wrong versions etc) then you should make a docker image to do what you need
  • All users can access Docker only in rootless mode. Docker rootless mode is already set up for you.
  • When interpreting instructions for docker on the internet:
    • Skip anything related to installing docker, nvidia drivers, or the nvidia container toolkit.
    • If the instructions involve sudo omit the sudo. Most commands should work but if they don't that means you can't (and likely do not actually need to) run it.
    • If the instructions involve systemctl <SOMETHING> docker substitute systemctl --user <SOMETHING> docker instead.
      • The docker daemon must be running. It should start automatically but use systemctl --user start docker to start it if it is not.
    • If the instructions involve --privileged you can't do it (but there is likely another way to do what you need to do).
  • To use the nvidia gpus in a docker container add the following flags to the docker run command: --runtime=nvidia --gpus all
    • For example docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
  • For more information see Rootless Docker and NVIDIA Docker Rootless Mode
  • By default your docker images are stored in /data/users/<netID>/docker
    • Use docker image ls to see downloaded docker images and the amount of disk space they are using
    • Use docker image rm to delete a docker image.

Usage Hints

  1. Use tmux to persist your session across ssh logins
    • Otherwise, what you run in an ssh session (e.g., a long ML training) will exit when ssh disconnects
  2. If doing RL training, run the simulator in headless mode on the server.
    • There is no GUI on the server, so everything will be via the command line.
    • A useful workflow is to test small batches on laptop, then run on server
    • Be sure to save checkpoints so you can train a little, see what is happening, then continue training.
  3. Use rsync to synchronize files between the server and your computer
    • rsync can operate directly over ssh by specifying user@server:/path/to/file as either a source or destination
    • Be careful, rsync can delete files. I recommend never using it directly on your home directory but rather a sub-directory
    • If you completely destroy your home directory the hidden files in /etc/skel have the default configuration and you can ssh-copy-id to gain passwordless access again.
  4. It is possible to run a jupyter notebook on the server and connect to it via your computer's web browser.
  5. You can use python3 virtual environments for python development
    • However, most packages you would need (if starting from scratch) are already pre-installed
  6. Many C++ and Python packages can be placed directly in a colcon workspace and compiled from scratch
  7. If pre-existing software has dependencies that differ from our system (as is likely) your quickest path to success is a docker container.
  8. If you are coding your own script, your quickest path to success is to use the versions of the pre-installed packages that are on the server and on your laptop.
  9. There are no limits on GPU or CPU usage, but be mindful of other students by coordinating when you will use each machine and GPU.

Author: Matthew Elwin.