Fine-tuning GPT-2 News Classifier

Situation

Problem: The project aimed to classify news articles into two categories: technology and politics.
Context: The data for the project was scraped from multiple technological and political news websites using the newspaper3k library. The goal was to create a custom dataset containing only English news articles.

Task

I fine-tuned a GPT-2 transformer model using the custom dataset to classify news articles. I built a pipeline for creating a custom dataset, fine-tuning the GPT-2 model, and analyzing the results.

Action

The project began as a part of my Machine Learning course, with the scraping of data from various technological and political websites, resulting in the creation of the custom dataset (news_data.csv). GPT-2, a transformer model, was fine-tuned using this dataset for the specific task of classifying news into technology and politics.

A code snippet showing the websites and the data creation is given below:

tech_websites = ['<https://www.techradar.com/>', '<https://www.theverge.com/>', '<https://www.wired.com/>',
                 '<https://www.tomshardware.com/>']

political_websites = ['<https://www.politico.com/>',
                      '<https://www.nytimes.com/section/politics>']

for website in tech_websites + political_websites:
    print(f'Scraping {website}...')
    paper = newspaper.build(website)
    count = 0

    for article in paper.articles:
        if count == 200:
            break
        try:
            article.download()
            article.parse()
            title = article.title
            text = article.text
            label = 'technology' if website in tech_websites else 'political'
            c.execute("INSERT INTO articles (title, text, label) VALUES (?, ?, ?)", (title, text, label))
            count += 1
        except Exception as e:
            print(f"Error processing article: {str(e)}")

conn.commit()
c.execute("SELECT title, text, label FROM articles WHERE label='technology' OR label='political'")
results = c.fetchall()

random.shuffle(results)
with open('news_dataset.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['text', 'label'])
    for result in results:
        writer.writerow([result[1], result[2]])

conn.close()

Another snippet showing the data preprcosing into Datasets for train and test.

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
labels = {
    "technology": 0,
    "political": 1
         }

class Dataset(torch.utils.data.Dataset):
    def __init__(self, df):
        self.labels = [labels[label] for label in df['label']]
        self.texts = [tokenizer(text,
                                padding='max_length',
                                max_length=128,
                                truncation=True,
                                return_tensors="pt") for text in df['text']]
    def classes(self):
        return self.labels

    def __len__(self):
        return len(self.labels)

    def get_batch_labels(self, idx):
        return np.array(self.labels[idx])

    def get_batch_texts(self, idx):
        return self.texts[idx]

    def __getitem__(self, idx):
        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)
        return batch_texts, batch_y

Tools used: The newspaper3k library was employed for web scraping. GPT-2, a powerful transformer model, was fine-tuned for the news classification task.

Result

Outcome: The trained GPT-2 model achieved a validation accuracy of 86.1%, a training accuracy of 67.1%, and a test accuracy of 86.7%.

The project successfully demonstrated the capability of GPT-2 in classifying news articles. It laid the foundation for further improvements by suggesting avenues such as scraping more news articles to increase the training size and exploring other pre-trained transformer models like BERT, BART, and Roberta.

The work presented insights into the effectiveness of fine-tuning GPT-2 for news classification, paving the way for potential applications in information categorization.

Reflection

The project provided hands-on experience in web scraping, dataset creation, and fine-tuning transformer models. It highlighted the significance of dataset size and diversity in model performance. In future iterations, I would consider scraping more diverse news articles and exploring different pre-trained transformer models to improve accuracy. Additionally, incorporating different contexts beyond news articles could enhance the model's understanding.