NEWTON – FAQ - AI & HPC Site

Problem: Error message “ssh: Could not resolve hostname newton: Name or service not known”.
Solution: Occurs mainly when connecting from outside the campus. Make sure the VPN is connected and try to SSH to 132.68.39.200 instead.

Problem: Error message “Remote side unexpectedly closed network connection” when trying to upload or download files to or from the server.
Solution: SSH sessions limit is 10. Close other connections and try again.

Problem: Job pending.
Solution: Resource over-scheduling (for the pending job) or cluster load. See the Workload management and Job and resource limits sections.

Problem: Error message “Could not load dynamic library ‘libcudart.so.*’”
Solution: If you receive this error on newton(the login server) – launch the script on a node using Slurm. newton is a virtual server, has no GPUs and doesn’t have CUDA installed.
If you receive the error on a node:
Add these lines at the end of the .bashrc file in your home folder (~/.bashrc):
export PATH=”/usr/local/cuda/bin:$PATH”
export LD_LIBRARY_PATH=”/usr/local/cuda/lib64:$LD_LIBRARY_PATH”
Save the file, exit the session (or job) and reconnect. Start bash on a node:
srun –pty /bin/bash
and run this command
nvcc -V
You should get CUDA’s version details.

Problem: RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 10.76 GiB total capacity; 8.94 GiB already allocated; 45.44 MiB free; 9.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Solution:

- - - - Run this on some different Node (with flag -w) with different gpu and did you receive same error .
      - Set reserve memory for pytorch (clean cache by # torch.cuda.empty_cache() )
      - Release allocated memory :. It can be done using the API functions free() and cudaFree(). Notice that, any memory declared using CUDA API, e.g. cudaMallocHost(), needs to be freed using cudaFree() as follows:
      - Check summary CUDA use : torch.cuda.memory_summary()