I fine-tuned a GPT-2 transformer model using the custom dataset to classify news articles. I built a pipeline for creating a custom dataset, fine-tuning the GPT-2 model, and analyzing the results.
The project began as a part of my Machine Learning course, with the scraping of data from various technological and political websites, resulting in the creation of the custom dataset (news_data.csv). GPT-2, a transformer model, was fine-tuned using this dataset for the specific task of classifying news into technology and politics.
A code snippet showing the websites and the data creation is given below:
tech_websites = ['<https://www.techradar.com/>', '<https://www.theverge.com/>', '<https://www.wired.com/>',
'<https://www.tomshardware.com/>']
political_websites = ['<https://www.politico.com/>',
'<https://www.nytimes.com/section/politics>']
for website in tech_websites + political_websites:
print(f'Scraping {website}...')
paper = newspaper.build(website)
count = 0
for article in paper.articles:
if count == 200:
break
try:
article.download()
article.parse()
title = article.title
text = article.text
label = 'technology' if website in tech_websites else 'political'
c.execute("INSERT INTO articles (title, text, label) VALUES (?, ?, ?)", (title, text, label))
count += 1
except Exception as e:
print(f"Error processing article: {str(e)}")
conn.commit()
c.execute("SELECT title, text, label FROM articles WHERE label='technology' OR label='political'")
results = c.fetchall()
random.shuffle(results)
with open('news_dataset.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['text', 'label'])
for result in results:
writer.writerow([result[1], result[2]])
conn.close()
Another snippet showing the data preprcosing into Datasets for train and test.
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
labels = {
"technology": 0,
"political": 1
}
class Dataset(torch.utils.data.Dataset):
def __init__(self, df):
self.labels = [labels[label] for label in df['label']]
self.texts = [tokenizer(text,
padding='max_length',
max_length=128,
truncation=True,
return_tensors="pt") for text in df['text']]
def classes(self):
return self.labels
def __len__(self):
return len(self.labels)
def get_batch_labels(self, idx):
return np.array(self.labels[idx])
def get_batch_texts(self, idx):
return self.texts[idx]
def __getitem__(self, idx):
batch_texts = self.get_batch_texts(idx)
batch_y = self.get_batch_labels(idx)
return batch_texts, batch_y
Tools used: The newspaper3k library was employed for web scraping. GPT-2, a powerful transformer model, was fine-tuned for the news classification task.
The project successfully demonstrated the capability of GPT-2 in classifying news articles. It laid the foundation for further improvements by suggesting avenues such as scraping more news articles to increase the training size and exploring other pre-trained transformer models like BERT, BART, and Roberta.
The work presented insights into the effectiveness of fine-tuning GPT-2 for news classification, paving the way for potential applications in information categorization.
The project provided hands-on experience in web scraping, dataset creation, and fine-tuning transformer models. It highlighted the significance of dataset size and diversity in model performance. In future iterations, I would consider scraping more diverse news articles and exploring different pre-trained transformer models to improve accuracy. Additionally, incorporating different contexts beyond news articles could enhance the model's understanding.