This guide provides a practical introduction to the Hugging Face ecosystem. You’ll learn how to find and use models via the API, work with key components like Transformers and Tokenizers, and grasp the fundamentals of fine-tuning. Finally, we cover designing, training, and evaluating custom models for downstream tasks.
Hugging face intro
The paltform where the machine learning community collaborates on models, datasets, and applications.
Hugging Face provide transformers
library, which is using to load and use pre-trained models. It also provides datasets
library, which is used to load datasets. The tokenizers
library is used to tokenize text data.
1
| pip install transformers datasets tokenizers
|
Models categories
The models are categorized into different types.
Task
- Text Generation: GPT, BERT, T5, etc.
- Any-to-Any: Translation, Summarization, etc.
- Image-Text-to-Text: CLIP, BLIP, etc.
- Text-to-Video: VideoGPT, etc.
Parameters
- < 1B: gpt2
- 1B - 6B: whisper-large-v3
- 6B - 12B: qwen2
- 12B - 32B: DeepSeek-R1-Distill-Qwen-14B
- 32B - 128B: LLaMA-2-70B
- 128B - 500B: DeepSeek-V2.5
- > 500B: DeepSeek-R1
Libraries
- Pytorch: The primary deep learning framework, offering flexibility for building and training most Hugging Face models.
- TensorFlow: A powerful, production-ready framework also supported by Hugging Face for building and deploying models.
- JAX: A high-performance framework for cutting-edge research and fast model training on modern hardware accelerators.
- Transformers: The core library providing a unified interface to thousands of pretrained models for various tasks.
- Diffusers: A specialized library providing easy access to state-of-the-art models for image and audio generation.
Apps
- vLLM: A versatile language model supporting a wide range of tasks, including text generation and understanding.
- TGI: A framework for building and deploying text generation models with a focus on efficiency and scalability.
- llama.cpp: A lightweight implementation of LLaMA models for efficient inference on various devices.
Inference Providers
- Cerebras: A provider offering high-performance inference solutions for large language models.
- Novita: A platform specializing in efficient inference for various machine learning models.
- Nebius AI: A provider focused on scalable and efficient inference solutions for AI applications.
Licenses
- apache-2.0: A permissive license allowing for wide usage and modification, commonly used in open-source projects.
- mit: A simple and permissive license allowing for free use, modification, and distribution.
- openrail: A license designed to promote open collaboration and sharing of AI models and datasets.
- cc-by-nc-4.0: A Creative Commons license allowing for non-commercial use, requiring attribution to the original creator.
Hugging Face API
Online request hf directly
1
2
3
4
5
6
7
8
| import requests
API_URL = "https://api-inference.huggingface.co/models/gpt2"
API_TOKEN = "*"
headers = {"Authorization": f"Bearer {API_TOKEN}"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
|
Offline use
Download
1
2
3
4
5
6
| from transformers import AutoModel, AutoTokenizer
model_name = "bert-base-uncased"
cache_dir = "model/bert-base-uncased"
model = AutoModel.from_pretrained(model_name, cache_dir = cache_dir)
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir = cache_dir)
|
Model in local structure:
1
2
3
4
5
6
7
8
9
10
| - bert-base-uncased
├ blobs
├ refs
┗ snapshots
┗ (base64)
├ config.json
├ model.safetensors
├ tokenizer.json
├ tokenizer_config.json
┗ vocab.txt
|
- config.json: Defines the model’s architecture and hyperparameters, like hidden size and number of attention heads.
- model.safetensors: Contains the model’s trained weights in a secure and efficient format for fast loading.
- tokenizer.json: A single file that holds all the necessary tokenizer information, including vocabulary and rules.
- tokenizer_config.json: Specifies tokenizer settings, like whether to lowercase text, and special token information.
- vocab.txt: Lists the vocabulary of the tokenizer, mapping each token to a unique ID.
text-generation
1
2
3
4
5
6
7
8
9
| from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_dir = r"/path"
model = AutoModelForCausalLM.from_pretrained(model_dir)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device="cuda")
output = pipe("Hello, I'm a language model,", max_length=50, num_return_sequences=1)
print(output[0]['generated_text'])
|
Tuning parameters
1
2
3
4
5
6
7
8
9
| #...
output = pipe("Hello, I'm a language model,", # prompt, as the initial text, text-generation based on this
max_length=50, # maximum length of the generated text, 50 tokens
num_return_sequences=1, # number of generated sequences to return, 1 sequence
truncation=True, # truncate the input text if it exceeds the model's maximum length
temperature=0.7, # controls randomness in generation, lower values make output more deterministic. Higher values (e.g., 1.0) make it more random.
top_k=50, # limits the sampling to the top k most probable tokens, reducing randomness and focusing on high-probability tokens. top_k=50 means only the top 50 tokens are considered.
top_p=0.95, # nucleus sampling, considers the smallest set of tokens whose
clean_up_tokenization_spaces=False, # whether to clean up spaces in tokenization, False means no cleanup
|
text-classification
1
2
3
4
5
6
7
8
9
10
11
| from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model_dir = r"/path/to/your/model"
model = AutoModelForSequenceClassification.from_pretrained(model_dir)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, device="cuda")
output = pipe("This is a great movie!", truncation=True, max_length=512)
# output
# [{'label': 'POSITIVE', 'score': 0.9998}]
|
question answering
1
2
3
4
5
6
| from transformers import pipeline
pipe = pipeline("question-answering", model="distilbert-base-cased-distilled-squad", device="cuda")
result = pipe({
'question': 'What is the capital of France?',
'context': 'Paris is the capital of France.'
})
|
Tokenizer
vocab
BERT tokenizer uses a vocabulary transform the text into tokens. The vocabulary includes all the tokens that the model can understand. Each token is mapped to a unique ID, which is used as input to the model.
text to tokens
Using the tokenizer to convert text into tokens and then convert to index IDs. This step need to make sure the length of text and special tokens are same as the model’s input requirements.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
sentences = ["Hello, how are you?", "I am fine, thank you!"]
encode_output = tokenizer.batch_encode_plus(
batch_text_or_text_pairs=[sentences[0], sentences[1]],
add_special_tokens=True, # Add [CLS] and [SEP] tokens
truncation=True, # Truncate sentences to the model's max length
padding="max_length", # Pad sentences to the max length
max_length=128, # Set the maximum length for padding/truncation
return_tensors=None # Available options: "pt" (PyTorch), "tf" (TensorFlow), "np" (NumPy)
return_attention_mask=True, # Return attention masks for the input tokens
return_token_type_ids=True, # Return token type IDs for distinguishing sentences in pairs
return_special_tokens_mask=True, # Return a mask indicating special tokens
return_offsets_mapping=True, # Return offsets for each token in the original text
return_length=True, # Return the length of each encoded sentence
)
for k, v in encode_output.items():
print(f"{k}: {v}")
print(tokenizer.decode(encode_output['input_ids'][0]))
|
add tokens
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
new_tokens = ["[NEW_TOKEN1]", "[NEW_TOKEN2]"]
num_added_tokens = tokenizer.add_tokens(new_tokens)
print(f"Added {num_added_tokens} new tokens.")
print(f"New vocabulary size: {len(tokenizer)}")
encode_output = tokenizer.encode(
text="This is a NEW_TOKEN1 example.",
text_pair=None, # No second text input
truncation=True,
padding="max_length",
max_length=128,
add_special_tokens=True,
return_tensors=None)
print(f"Encoded output: {encode_output}")
print(f"Decoded text: {tokenizer.decode(encode_output)}")
|
Fine-tuning
concepts and workflow
Fine-tuning
means base on the pre-trained models, train the models to adapt to specific downstream tasks. BERT model, as an example, through pre-training to learn general language representations, and then fine-tuning on specific tasks like sentiment analysis and named entity recognition. In the fine-tuning process, the pre-trained level of the model is locked
, and only the task-specific layers are trained. This approach allows the model to leverage the knowledge learned during pre-training while adapting to the specific requirements of the downstream task.
Load dataset
Dataset of Sentiment analysis task includes text data and corresponding labels indicating sentiment (e.g., positive or negative). Using Hugging Face’s datasets
library, you can easily load and preprocess datasets for training and evaluation.
1
2
3
4
| # offline load dataset
from datasets import load_dataset
dataset = load_dataset('csv', data_files='path/to/your/dataset.csv')
print(dataset)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
| # CustomDataset.py
from torch.utils.data import Dataset
from datasets import load_from_disk
class CustomDataset(Dataset):
def __init__(self, dataset_path):
self.dataset = load_from_disk(dataset_path)
def __len__(self): # get the length of the dataset
return len(self.dataset)
def __getitem__(self, idx): # specified operation for dataset
return self.dataset[idx]['text'], self.dataset[idx]['label']
dataset = CustomDataset('path/to/your/dataset')
for data in dataset:
print(data) # Each data is a tuple (text, label)
|
Downstream tasks design
Befor fine-tuning, you need to design the downstream tasks. This includes one or more full-connected layers, which are used to adapt the pre-trained model to the specific task.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| # network.py
from transformers import BertModel
import torch
# Define the device for model training
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pretrained = BertModel.from_pretrained("bert-base-uncased").to(DEVICE)
# Define downstream task model
class DownstreamTaskModel(torch.nn.Module):
# Initialize the model with a pre-trained BERT model and a classifier layer
def __init__(self, pretrained_model, num_labels):
super(DownstreamTaskModel, self).__init__()
self.bert = pretrained_model
self.fc = torch.nn.Linear(self.bert.config.hidden_size, num_labels)
def forward(self, input_ids, attention_mask=None, token_type_ids=None):
# Only the downstream task model will be trained, upstream model do not need to be trained(locked)
with torch.no_grad():
output = pretrained(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
# Downstream training
output = self.fc(output.last_hidden_state[:, 0]) # Use the [CLS] token representation
return output.softmax(dim=1) # Apply softmax to get probabilities
|
1
2
3
4
5
6
| flowchart LR
dataset --> Bert
subgraph Model
Bert-->FullyConnected
end
FullyConnected --> output
|
Custom Model training
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
| import torch
from CustomDataset import MyDataset
from torch.utils.data import DataLoader
from network import Model
from transformers import BertTokenizer, AdamW
# Define devices
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
EPOCHS = 100 # training times
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# Encode the dataset
def collate_fn(data):
sentences, labels = [i[0] for i in data], [i[1] for i in data]
data = tokenizer.batch_encode_plus(
batch_text_or_text_pairs=sentences,
truncation=True,
padding="max_length",
max_length="128,
return_tensors="pt", # Return PyTorch tensors
return_length=True, # Return the length of each encoded sentence
)
input_ids = data['input_ids']
attention_mask = data['attention_mask']
token_type_ids = data.get('token_type_ids', None) # Optional, used
label = torch.LongTensor(labels) # Convert labels to tensor
return input_ids, attention_mask, token_type_ids, label
# Create dataset
train_dataset = MyDataset("train")
# Create DataLoader
train_loader = DataLoader(
dataset=train_dataset,
batch_size=32, # Batch size for every load
shuffle=True, # Shuffle the dataset for each epoch
drop_last=True, # Drop the last incomplete batch
collate_fn=collate_fn # Custom collate function for batching
)
if __name__ == "__main__":
model = Model().to(DEVICE) # Initialize the model
optimizer = AdamW(model.parameters(), lr=5e-4) # Optimizer for training
loss_fn = torch.nn.CrossEntropyLoss() # Loss function for classification
for epoch in range(EPOCHS):
for i, (input_ids, attention_mask, token_type_ids, labels) in enumerate(train_loader):
input_ids = input_ids.to(DEVICE)
attention_mask = attention_mask.to(DEVICE)
token_type_ids = token_type_ids.to(DEVICE)
labels = labels.to(DEVICE)
# Forward computation
outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
# Calculate loss
loss = loss_fn(outputs, labels)
# optimization steps
optimizer.zero_grad() # 1.lear gradients
loss.backward() # 2.Backpropagation
optimizer.step() # 3.Update model parameters
if i % 5 == 0: # Print loss every 5 batches
print(f"Epoch [{epoch+1}/{EPOCHS}], Step [{i+1}/{len(train_loader)}], Loss: {loss.item():.4f}")
accuracy = (outputs.argmax(dim=1) == labels).sum().item() / len(labels)
print(f"Accuracy: {accuracy:.4f}")
|
Save the model
After training the model, you can save the model parameters and configuration to a file. This allows you to load the model later for inference or further training.
1
2
3
4
5
6
| # Save the model
torch.save(model.state_dict(), "sentiment_analysis_model.pt")
# Load the model
model = Model()
model.load_state_dict(torch.load("sentiment_analysis_model.pt"))
model.eval() # Set the model to evaluation mode
|
Evaluate the model
Accuracy is a common metric for evaluating classification models. It measures the proportion of correct predictions made by the model compared to the total number of predictions.
The test of model is similar to the training process, but without the optimization steps. You can use the same dataset and DataLoader, but set the model to evaluation mode and disable gradient calculation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| #...
if __name__ == "__main__":
model = Model().to(DEVICE) # Initialize the model
model.load_state_dict(torch.load("params/2bert.pt")) # Load the trained model parameters
model.eval() # Set the model to evaluation mode
correct = 0
total = 0
with torch.no_grad(): # Disable gradient calculation for evaluation
for input_ids, attention_mask, token_type_ids, labels in train_loader:
input_ids = input_ids.to(DEVICE)
attention_mask = attention_mask.to(DEVICE)
token_type_ids = token_type_ids.to(DEVICE)
labels = labels.to(DEVICE)
outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
_, predicted = torch.max(outputs.data, 1) # Get the predicted class
total += labels.size(0) # Update total count
correct += (predicted == labels).sum().item() # Count correct predictions
accuracy = correct / total # Calculate accuracy
print(f"Accuracy: {accuracy:.4f}")
|