How to refine a transformer architecture NLP model – visual studio magazine
[ad_1]
The data science lab
How to refine an NLP transformer architecture model
The goal is sentiment analysis: accept text from a movie review (for example, “This movie was a great waste of time.”) And rate output 0 (negative review) or class 1 ( positive review).
This article describes how to refine a pre-trained Transformer architecture model for natural language processing. Specifically, this article explains how to refine a condensed version of a pre-trained BERT model to create a binary classifier for a subset of the IMDB movie reviews dataset. The goal is sentiment analysis: accept text from a movie review (for example, “This movie was a great waste of time.”) And rate output 0 (negative review) or class 1 ( positive review).
You can think of a preformed transformer architecture (TA) model as a kind of expert in English. But the TA expert doesn’t know anything about movies, so you provide additional training to refine the model so that he understands the difference between a positive movie review and a negative review.
There are several pre-trained TA models for Natural Language Processing (NLP). Two of the best known are BERT (representations of bidirectional encoder from transformers) and GPT (generative pre-driven transformer). TA models are huge, with millions of weights and bias settings.
TA models have revolutionized NLP, but TA systems are extremely complex and implementing them from scratch can take hundreds or even thousands of hours of work. Hugging Face (HF) is an open source code library that provides pre-trained models and a set of APIs to work with models. The HF library makes implementing NLP systems using TA models much less difficult (see “How to Create a Transformer Architecture Model for Natural Language Processing”).
A good way to see where this article is going is to take a look at the screenshot of a demo program in Figure 1. The demo program begins by loading into memory a small subset of 200 items of the IMDB movie review dataset. The full dataset contains 50,000 film reviews – 25,000 reviews for training and 25,000 reviews for testing, where there are 12,500 positive reviews and 12,500 negative reviews. Working with the full dataset is very time consuming, so the demo data only uses the first 100 positive training reviews and the first 100 negative training reviews.
Movie reviews are in plain text form. The notices are read into memory and then converted to a data structure containing integer tokens. For example, the word “movie” has a token ID = 3185. The data structure of tokenized movie reviews is passed to a PyTorch Dataset which is used to send batches of tokenized reviews and their associated tags to the tokenized review code. learning.
Once the movie review data is prepared, the demo loads a pre-trained DistilBERT model into memory. DistilBERT is a condensed (“distilled”), but still large, version of the enormous BERT model. The caseless version of DistilBERT has 66 million weights and skews. Then the demo refines the pre-trained model by training the model using standard PyTorch techniques. The demo ends with saving the refined model to a file.
This article assumes you have intermediate or above knowledge of a C family programming language, preferably Python, and a basic knowledge of PyTorch, but does not assume that you know anything about the code library Hugging Face. The full source code for the demo program is presented in this article, and the code is also available in the accompanying download file.
To run the demo program, you must have Python, PyTorch, and HF installed on your machine. The demo programs were developed on Windows 10 using the Anaconda 2020.02 64-bit distribution (which contains Python 3.7.6) and PyTorch version 1.8.0 for the processor installed via pip transformers and HF version 4.11.3. The installation is not trivial. You can find detailed step-by-step installation instructions for PyTorch in my blog post. The installation of the HF transformer library is relatively straightforward. You can run the “pip install transformers” shell command.
Overall program structure
The structure of the demonstration program is:
# import modules and packages device = torch.device('cpu') class IMDbDataset(T.utils.data.Dataset): . . . def read_imdb(root_dir): . . . def main(): # 0. preparation # 1. load raw IMDB train data into memory # 2. tokenize the raw data reviews text # 3. load tokenized text, labels into PyTorch Dataset # 4. load (possibly cached) pretrained HF model # 5. fine-tune / train model using standard PyTorch # 6. save trained model weights and biases if __name__ == "__main__": main()
The HF library can use the PyTorch or TensorFlow libraries. The demo uses PyTorch. The IMDbDataset is a program-defined PyTorch class that holds the training data and serves it in batches. The read_imdb () function is a helper that reads movie review text data from the file into memory. All program logic is in a single main () function.
The full demo code, with some minor edits to save space, is shown in List 1. I prefer to indent using two spaces rather than the standard four spaces. The backslash character is used for line continuation to break up long statements.
List 1: The complete fine-tuning demonstration program
# imdb_hf_01_tune.py # fine-tune HF pretrained model for IMDB # zipped raw data at: # https://ai.stanford.edu/~amaas/data/sentiment/ import numpy as np # not used from pathlib import Path from transformers import DistilBertTokenizer import torch as T from torch.utils.data import DataLoader from transformers import AdamW, DistilBertForSequenceClassification from transformers import logging # suppress warnings device = T.device('cpu') class IMDbDataset(T.utils.data.Dataset): def __init__(self, reviews_lst, labels_lst): self.reviews_lst = reviews_lst # list of token IDs self.labels_lst = labels_lst # list of 0-1 ints def __getitem__(self, idx): item = {} # [input_ids] [attention_mask] [labels] for key, val in self.reviews_lst.items(): item[key] = T.tensor(val[idx]).to(device) item['labels'] = T.tensor(self.labels_lst[idx]).to(device) return item def __len__(self): return len(self.labels_lst) def read_imdb(root_dir): reviews_lst = []; labels_lst = [] root_dir = Path(root_dir) for label_dir in ["pos", "neg"]: for f_handle in (root_dir/label_dir).iterdir(): reviews_lst.append(f_handle.read_text( encoding='utf-8')) if label_dir == "pos": labels_lst.append(1) else: labels_lst.append(0) return (reviews_lst, labels_lst) # lists of strings def main(): # 0. get ready print("nBegin fine-tune for IMDB sentiment ") logging.set_verbosity_error() # suppress wordy warnings T.manual_seed(1) np.random.seed(1) # 1. load raw IMDB train data into memory print("nLoading IMDB train data subset into memory ") train_reviews_lst, train_labels_lst = read_imdb(".DataSmallaclImdbtrain") print("Done ") # consider creating validation set here # 2. tokenize the raw data reviews text print("nTokenizing training text ") toker = DistilBertTokenizer.from_pretrained( 'distilbert-base-uncased') train_tokens = toker(train_reviews_lst, truncation=True, padding=True) # token IDs and mask # 3. load tokenized text and labels into PyTorch Dataset print("nLoading tokenized text into Pytorch Datasets ") train_dataset = IMDbDataset(train_tokens, train_labels_lst) print("Done ") # 4. load (possibly cached) pretrained HF model print("nLoading pre-trained DistilBERT model ") model = DistilBertForSequenceClassification.from_pretrained( 'distilbert-base-uncased') model.to(device) model.train() # set into training mode print("Done ") # 5. fine-tune / train model using standard PyTorch print("nLoading Dataset bat_size = 10 ") train_loader = DataLoader(train_dataset, batch_size=10, shuffle=True) print("Done ") print("nFine-tuning the model ") optim = AdamW(model.parameters(), lr=5.0e-5) # wt decay for epoch in range(3): epoch_loss = 0.0 for (b_ix, batch) in enumerate(train_loader): optim.zero_grad() inpt_ids = batch['input_ids'] # tensor attn_mask = batch['attention_mask'] # tensor lbls = batch['labels'] # tensor outputs = model(inpt_ids, attention_mask=attn_mask, labels=lbls) loss = outputs[0] epoch_loss += loss.item() # accumulate batch loss loss.backward() optim.step() if b_ix % 5 == 0: # 200 items is 20 batches of 10 print(" batch = %5d curr batch loss = %0.4f " % (b_ix, loss.item())) # if b_ix >= xx: break # to save time for demo print("end epoch = %4d epoch loss = %0.4f " % (epoch, epoch_loss)) print("Training complete ") # 6. save trained model weights and biases print("nSaving tuned model state ") model.eval() T.save(model.state_dict(), ".Modelsimdb_state.pt") # just state print("Done ") print("nEnd demo ") if __name__ == "__main__": main()
Get IMDB training data
Online IMDB movie review data is stored in compressed form in the aclImdb_v1.tar.gz file and on a Windows system must be unzipped and extracted using a utility program such as WinZip or 7-Zip. Both utilities are good, but I prefer 7-Zip.
The unzipped files will be in a root folder named aclImdb (“ACL IMDB”). The root folder contains subdirectories named test and train. The test and train directories contain subdirectories named pos and neg. Each of these two directories contains 12,500 text files where each file is a movie review.
The filenames look like 0_9.txt and 113_3.txt where the first part of the name, before the underscore, is a 0-based index and the second part of the name is the actual numeric score of the review. Ratings of 7, 8, 9, 10 are positive reviews (all in the “pos” repertoire) and ratings of 1, 2, 3, 4 are negative reviews. Movie reviews that received ratings of 5 and 6 (neither strong positive nor negative) are not included in the IMDB dataset. Note that the actual scores from 1 to 10 are not used in the demo.
To reduce the IMDB dataset to a manageable size for experimentation, I used only the training files and deleted all reviews except the first 100 positives and the first 100 negatives, leaving 200 reviews total training.
The program-defined read_imdb () function reads journal text and labels in memory. It is implemented as:
from pathlib import Path def read_imdb(root_dir): reviews_lst = []; labels_lst = [] root_dir = Path(root_dir) for label_dir in ["pos", "neg"]: for f_handle in (root_dir/label_dir).iterdir(): reviews_lst.append(f_handle.read_text( encoding='utf-8')) if label_dir == "pos": labels_lst.append(1) else: labels_lst.append(0) return (reviews_lst, labels_lst) # list of strings
The Python pathlib library is relatively new (from Python 3.4) and is a bit more robust than the old os library (which still works fine). The return result of the read_imdb () function is a Python tuple where the first element is a comma-separated Python list of notices, and the second element is a list of associated class labels, 0 for negative notice and 1 for a positive opinion.
Loading movie reviews
The demo program begins execution with these instructions:
def main(): # 0. get ready print("Begin fine-tune for IMDB sentiment ") logging.set_verbosity_error() # suppress warnings T.manual_seed(1) np.random.seed(1) . . .
Suppressing the warnings is not a good idea, but I did it to keep the output tidy for the screenshot in Figure 1. There’s no need to tune the NumPy torch and random number seeds, but it’s generally a good idea to try and make program executions repeatable.
The film review text and labels are loaded into memory as follows:
# 1. load raw IMDB train data into memory print("Loading IMDB train data subset into memory ") train_reviews_lst, train_labels_lst = read_imdb(".DataSmallaclImdbtrain") # text list print("Done ")
Tokenize the text of notices
The demo creates a tokenizer object, then tokenizes the text of the notices with these two declarations:
# 2. tokenize the raw data reviews text print("Tokenizing training text ") toker = DistilBertTokenizer.from_pretrained( 'distilbert-base-uncased') train_tokens = toker(train_reviews_lst, truncation=True, padding=True)
In general, each HF model has its own associated tokenizer to divide the text of the source sequence into tokens. This is different from earlier language systems which often use a generic tokenizer such as spaCy. Therefore, the demo loads the distilbert-base-uncased tokenizer. The return result of calling the tokenizer on IMDB journals is a data structure that has two components: an input_ids field which contains integer IDs that correspond to words in the journal text, and the attention_mask field which contains 0’s and 1’s indicating which tokens are active and which tokens to ignore (usually filler tokens).
The tokenized text and attention mask, along with the list of class labels are passed to the IMDbDataset constructor:
# 3. load tokenized text and labels into PyTorch Dataset print("Loading tokenized text into Pytorch Datasets ") train_dataset = IMDbDataset(train_tokens, train_labels_lst) print("Done ")
The returned result is a PyTorch Dataset object that can serve as batch training items. The underlying elements are collections of dictionaries with keys [input_ids], [attention_mask], [label].
[ad_2]
Comments are closed.