Problem: Error message “ssh: Could not resolve hostname newton: Name or service not known”.
Solution: Occurs mainly when connecting from outside the campus. Make sure the VPN is connected and try to SSH to 220.127.116.11 instead.
Problem: Error message “Remote side unexpectedly closed network connection” when trying to upload or download files to or from the server.
Solution: SSH sessions limit is 10. Close other connections and try again.
Problem: Job pending.
Solution: Resource over-scheduling (for the pending job) or cluster load. See the Workload management and Job and resource limits sections.
Problem: Error message “Could not load dynamic library ‘libcudart.so.*’”
Solution: If you receive this error on newton(the login server) – launch the script on a node using Slurm. newton is a virtual server, has no GPUs and doesn’t have CUDA installed.
If you receive the error on a node:
Add these lines at the end of the .bashrc file in your home folder (~/.bashrc):
Save the file, exit the session (or job) and reconnect. Start bash on a node:
srun –pty /bin/bash
and run this command
You should get CUDA’s version details.
Problem: RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 10.76 GiB total capacity; 8.94 GiB already allocated; 45.44 MiB free; 9.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
- Run this on some different Node (with flag -w) with different gpu and did you receive same error .
- Set reserve memory for pytorch (clean cache by # torch.cuda.empty_cache() )
- Release allocated memory :. It can be done using the API functions free() and cudaFree(). Notice that, any memory declared using CUDA API, e.g. cudaMallocHost(), needs to be freed using cudaFree() as follows:
- Check summary CUDA use : torch.cuda.memory_summary()