📖 How to Find a GPU Hosting Service – a Guide by Viraaj Akuthota
To fine-tune models and create embeddings on large corpuses of qualitative data, a high amount of GPU RAM (VRAM) is required. For example, fine-tuning BERT on a dataset of 15k cases that vary in size creates roughly 100k-200k sequences at a 512 token limit. This requires approximately 140 GB of VRAM. This hardware requirement means such tasks cannot be conducted on most consumer-grade machines. I conducted an exercise to hopefully identify an affordable and relatively easy-to-use cloud compute option. During this search, I faced many difficulties. The benefits and disadvantages of the majority of service providers I reviewed can be found in the table below.
Overall, the production system I landed on is to utilize:
·   PaperSpace's Core using a Windows Server instance to avoid using the terminal as much as possible.
·   Always available Multi-GPU instances, for example, 4 x A6000 Nvidia GPUs with 192 GB VRAM total for roughly $7 USD an hour.
·   Approximately $3 USD per month for 50 GB persistent storage, making offline costs negligible.
·   For Linux users, they have a Python ML template which will save time installing python, packages, cuda, etc.
Before production, I utilise either Google Colab or HuggingFace:
·   For  testing fine-tuning or creating embeddings, I believe Google Colab's  free T4 instance provides the highest amount of VRAM for any free tier.
·   For  testing LLMs, HuggingFace's serverless inference free tier allows you  to utilize a variety of LLMs such as LLAMA 405B. However, the Pro tier  at $9 USD per month increases the rate limit on this inference. I  receive approximately 300 API calls per hour.
Provider | Benefits | Disadvantages | GPU Limit |
Amazon EC2 |
|
|
|
Amazon Notebooks |
|
|
|
Microsoft Azure |
|
|
|
Google Cloud |
|
||
Google Colab |
|
|
|
Paperspace Notebooks |
|
|
|
Paperspace Server/Console |
|
|
|