LLMs
Large-language models (LLMs) are AI models that can understand and generate text, primarily using transformer architectures. They are extensively used for tools and tasks such as chatbots, translation, summarization, sentiment analysis, and question answering.
This page is about running LLMs on Aalto Triton. As a prerequisite, it is recommended to get familiar with the basics of using the cluster, including running jobs and using Python (Tutorials).
Note
If at any point something doesn’t work, you are unsure how to get started or proceed, do not hesitate to contact the Aalto RSEs.
You can visit us at the daily Zoom help session at 13.00-14.00.
HuggingFace Models
The most common way to use pre-trained open-source LLMs is to access them through HuggingFace and to leverage their 🤗 Transformers Python library.
HuggingFace provides a wide range of tools and pre-trained models, making it easy to integrate and utilize these models in your projects.
You can explore their offerings at 🤗 HuggingFace.
Note
We are keeping an eye on the latest models and have pre-downloaded some of them for you. If you need any other models, please contact the Aalto RSEs.
Run command ls /scratch/shareddata/dldata/huggingface-hub-cache/hub to see the full list of all the available models.
Below is an example of how to use the 🤗 Transformers pipeline() to load a pre-trained model and use it for question answering.
Example: Question Answering
In the following sbatch script, we request computational resources, load the necessary modules, and run a Python script that uses a HuggingFace model for question answering.
huggingface_example.sh:
#!/bin/bash #SBATCH --time=00:30:00 #SBATCH --cpus-per-task=4 #SBATCH --mem=16GB # This is system memory, not GPU memory. #SBATCH --gpus=1 #SBATCH --output huggingface.%J.out #SBATCH --error huggingface.%J.err # By loading the model-huggingface module, models will be loaded from /scratch/shareddata/dldata/huggingface-hub-cache which is a shared scratch space. module load model-huggingface # Load a ready to use conda environment to use HuggingFace Transformers module load scicomp-llm-env python huggingface_example.py
The huggingface_example.py Python script uses a HuggingFace model mistralai/Mistral-7B-Instruct-v0.3 for conversations and instructions.
huggingface_example.py:
from transformers import pipeline
import torch
# Initialize the pipeline
pipe = pipeline(
"text-generation", # Task type
model="mistralai/Mistral-7B-Instruct-v0.3", # Model name
device_map="auto", # Let the pipeline automatically select best available device
max_new_tokens=1000
)
# Prepare prompts
messages = [
{"role": "system", "content": "You're an helpful assistant. Answer to the questions with the best of your abilities."},
{"role": "user", "content": "Continue the following sequence: 1, 2, 3, 5, 8"},
]
# Generate text and print the response
response = pipe(messages, return_full_text=False)[0]["generated_text"]
print(response)
For reference, here is a table of model size and memory requirements for different model sizes and data types:
Model Size |
Parameters |
float32 (4B) |
float16 (2B) |
int8 (1B) |
|---|---|---|---|---|
1B parameters |
1e9 |
4 GB |
2 GB |
1 GB |
7B parameters |
7e9 |
28 GB |
14 GB |
7 GB |
13B parameters |
13e9 |
52 GB |
26 GB |
13 GB |
In addition to the model size, you should also consider additional memory overhead for intermediate activations and input token embeddings.
Note: this is the scenario where you are using the model for inference. For training, memory requirements are significantly higher due to gradients, optimizer states (e.g., Adam maintains momentum and variance estimates), gradient accumulation buffers, and larger activation caches. Training can require 3-4x more memory than the model size alone.
You can look at the model card for more information about the model.
Other Frameworks
While HuggingFace provides a convenient way to access and use LLMs, there are other popular frameworks available for running LLMs, such as vLLM for high-performance inference, Ollama for local deployment, DeepSpeed, and LangChain for building LLM applications.
If you need assistance running LLMs in these or other frameworks, please contact the Aalto RSEs.
More examples
AaltoRSE has prepared a repository with miscellaneous examples of using LLMs on Triton. You can find it here.