fastai ULMFit versus a basic NLP Pytorch classification model for StockTwits messages
Comparing the performance of ULMFit with a standard NLP Pytorch model for StockTwits message sentiment analysis.
- Introduction
- PyTorch NLP model
- Preprocessing the Data
- Neural Network
- Training
- ULMFit approach
- Preprocessing the Data
- Self supervized language model
- Fine-Tuning the Language Model
- Classifier for Stocktwits sentiments
- Model results comparison
Introduction
In this project we will be comparing the results from a sentiment analysis model using a RNN implemented directly in PyTorch with another using the Fastai library with the ULMFit (Universal Language Model Fine-tuning) approach from this paper from Jeremy Howard and Sebastian Ruder.
The ULMFit approach involves using a language model first and then train it with the new vocabulary from the new application. Once our language model has been trained with the added new data, we use it as the base from our classification problem. A language model is a model that has been trained to guess the next word coming in a text from having seen the words that came before. This kind of process is called self supervized learning. In this case, no labels are provided to our model.
Lets move onto the actual classification problem we are going to solve.
When deciding the value of a company, it's important to follow the news. For example, a product recall or natural disaster in a company's product chain. You want to be able to turn this information into a signal.
We will be using posts from the social media site StockTwits. The community on StockTwits is full of investors, traders, and entrepreneurs. Each message posted is called a Twit. This is similar to Twitter's version of a post, called a Tweet. We will build a model around these twits that generate a sentiment score.
The data is a large collection of messages that were hand labeled with the sentiment of each. The degree of sentiment is a five-point scale: very negative, negative, neutral, positive, very positive. Each twit is labeled -2 to 2 in steps of 1, from very negative to very positive respectively.
The first thing we should do, is load the data.
PyTorch NLP model
Load Twits Data
This JSON file contains a list of objects for each twit in the 'data'
field:
{'data':
{'message_body': 'Neutral twit body text here',
'sentiment': 0},
{'message_body': 'Happy twit body text here',
'sentiment': 1},
...
}
The fields represent the following:
-
'message_body'
: The text of the twit. -
'sentiment'
: Sentiment score for the twit, ranges from -2 to 2 in steps of 1, with 0 being neutral.
To see what the data look like by printing the first 10 twits from the list.
from fastai.text.all import *
with open(os.path.join('data', 'stocktwits_sentiment', 'twits.json'), 'r') as f:
twits = json.load(f)
print(len(twits['data']))
And a couple of messages
print(twits['data'][:5])
Lets split the messages and the labels
messages = [twit['message_body'] for twit in twits['data']]
# Since the sentiment scores are discrete, we'll scale the sentiments to 0 to 4 for use in our network
sentiments = [twit['sentiment'] + 2 for twit in twits['data']]
messages[45], sentiments[45]
Preprocessing the Data
With our data in hand we need to preprocess our text. These twits are collected by filtering on ticker symbols where these are denoted with a leader $ symbol in the twit itself. For example,
{'message_body': 'RT @google Our annual look at the year in Google blogging (and beyond) http://t.co/sptHOAh8 $GOOG',
'sentiment': 0}
The ticker symbols don't provide information on the sentiment, and they are in every twit, so we should remove them. This twit also has the @google
username, again not providing sentiment information, so we should also remove it. We also see a URL http://t.co/sptHOAh8
. Let's remove these too.
import json
import nltk
import os
import random
import re
import torch
from torch import nn, optim
import torch.nn.functional as F
nltk.download('wordnet')
def preprocess_1(message):
"""
This function takes a string as input, then performs these operations:
- lowercase
- remove URLs
- remove ticker symbols
- removes punctuation
- tokenize by splitting the string on whitespace
- removes any single character tokens
Parameters
----------
message : The text message to be preprocessed.
Returns
-------
tokens: The preprocessed text into tokens.
"""
# Lowercase the twit message
text = message.lower()
# Replace URLs with a space in the message
text = re.sub(r'https?://[^\s]+', ' ', text)
# Replace ticker symbols with a space. The ticker symbols are any stock symbol that starts with $.
text = re.sub(r'\$[a-z]*\b', ' ', text)
# Replace StockTwits usernames with a space. The usernames are any word that starts with @.
text = re.sub(r'@\w*\b', ' ', text)
# Replace everything not a letter with a space
text = re.sub(r'[^a-z]', ' ', text)
# Tokenize by splitting the string on whitespace into a list of words
tokens = text.split()
# Lemmatize words using the WordNetLemmatizer. You can ignore any word that is not longer than one character.
wnl = nltk.stem.WordNetLemmatizer()
tokens = [wnl.lemmatize(t) for t in tokens if len(t) > 1]
return tokens
tokenized = [preprocess_1(m) for m in messages]
tokenized[:1], messages[:1]
Looking good
Lets check for empty tokens
len([token for token in tokenized if len(token) == 0])
Clean that up
good_tokens = [idx for idx, token in enumerate(tokenized) if len(token) > 0]
tokenized = [tokenized[idx] for idx in good_tokens]
sentiments = [sentiments[idx] for idx in good_tokens]
from collections import Counter
"""
Create a vocabulary by using Bag of words
"""
stacked_tokens = [word for twit in tokenized for word in twit]
bow = Counter(stacked_tokens)
# sort by decreasing order
sorted_bow = sorted(bow, key=bow.get, reverse=True)
Frequency of Words Appearing in Message
With our vocabulary parsed, lets remove some of the most common words such as 'the', 'and', 'it', etc. These words don't contribute to identifying sentiment and are really common, resulting in a lot of noise in our input. If we can filter these out, then our network should have an easier time learning.
We also want to remove really rare words that show up only in a few twits. Here you'll want to divide the count of each word by the number of messages. Then remove words that only appear in some small fraction of the messages.
# The key is the token and the value is the frequency of that word in the corpus.
total_words = len(bow)
freqs = {word: count/total_words for word, count in bow.items()}
# Float that is the frequency cutoff. Drop words with a frequency that is lower or equal to this number.
low_cutoff = 1e-5
# Integer that is the cut off for most common words. Drop words that are the `high_cutoff` most common words.
high_cutoff = 15
# The k most common words in the corpus. Use `high_cutoff` as the k.
K_most_common = [word[0] for word in bow.most_common(high_cutoff)]
filtered_words = [word for word in freqs if (freqs[word] > low_cutoff and word not in K_most_common)]
print(K_most_common)
len(filtered_words)
from tqdm import tqdm
# A dictionary for the `filtered_words`. The key is the word and value is an id that represents the word.
vocab = {word: i for i, word in enumerate(filtered_words, 1)}
# Reverse of the `vocab` dictionary. The key is word id and value is the word.
id2vocab = {i: word for i, word in enumerate(filtered_words, 1)}
# tokenized with the words not in `filtered_words` removed.
filtered = [ [w for w in msg if w in filtered_words] for msg in tqdm(tokenized) ]
Balancing the classes
If we look at how our twits are labeled, we'll find that 45% of them are neutral. This means that our network will be 45% accurate just by guessing 0 every single time. To help our network learn appropriately, we'll want to balance our classes. That is, make sure each of our different sentiment scores show up roughly as frequently in the data.
What we can do here is go through each of our examples and randomly drop twits with neutral sentiment. We want to get around 20% neutral twits starting from 50% neutral.
for i in range(0,5):
print(f'{i}: {sentiments.count(i)/len(sentiments)}')
balanced = {'messages': [], 'sentiments':[]}
n_neutral = sum(1 for each in sentiments if each == 2)
N_examples = len(sentiments)
keep_prob = (N_examples - n_neutral)/4/n_neutral
print(f'keep_prob: {keep_prob}')
for idx, sentiment in enumerate(sentiments):
message = filtered[idx]
if sentiment != 2 or random.random() < keep_prob:
balanced['messages'].append(message)
balanced['sentiments'].append(sentiment)
Lets check that we are balanced:
n_neutral = sum(1 for each in balanced['sentiments'] if each == 2)
N_examples = len(balanced['sentiments'])
n_neutral/N_examples
Looks good, 1/5th of our samples are neutral now.
Let's convert our tokens into integer ids which we can pass to the network.
token_ids = [[vocab[word] for word in message] for message in balanced['messages']]
sentiments = balanced['sentiments']
class TextClassifier(nn.Module):
def __init__(self, vocab_size, embed_size, lstm_size, output_size, lstm_layers=1, dropout=0.1):
"""
Initialize the model by setting up the layers.
Parameters
----------
vocab_size : The vocabulary size.
embed_size : The embedding layer size.
lstm_size : The LSTM layer size.
output_size : The output size.
lstm_layers : The number of LSTM layers.
dropout : The dropout probability.
"""
super().__init__()
self.vocab_size = vocab_size
self.embed_size = embed_size
self.lstm_size = lstm_size
self.output_size = output_size
self.lstm_layers = lstm_layers
self.dropout = dropout
# Setup embedding layer
self.embedding = nn.Embedding(vocab_size, embed_size)
self.lstm = nn.LSTM(embed_size, lstm_size, lstm_layers,
dropout=dropout, batch_first=False)
# Setup additional layers
self.dropout = nn.Dropout(dropout)
self.fc = nn.Linear(lstm_size, output_size)
self.logsoftmax = nn.LogSoftmax(dim=1)
def init_hidden(self, batch_size):
"""
Initializes hidden state
Parameters
----------
batch_size : The size of batches.
Returns
-------
hidden_state
"""
# Create two new tensors with sizes n_layers x batch_size x hidden_dim,
# initialized to zero, for hidden state and cell state of LSTM
weight = next(self.parameters()).data
hidden = (weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_(),
weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_())
return hidden
def forward(self, nn_input, hidden_state):
"""
Perform a forward pass of our model on nn_input.
Parameters
----------
nn_input : The batch of input to the NN.
hidden_state : The LSTM hidden state.
Returns
-------
logps: log softmax output
hidden_state: The new hidden state.
"""
# embeddings and lstm_out
nn_input = nn_input.long()
embeds = self.embedding(nn_input)
lstm_out, hidden_state = self.lstm(embeds, hidden_state)
lstm_out = lstm_out[-1, : , :]
# dropout and fully-connected layer
out = self.dropout(lstm_out)
out = self.fc(out)
logps = self.logsoftmax(out)
return logps, hidden_state
model = TextClassifier(len(vocab), 10, 6, 5, dropout=0.1, lstm_layers=2)
model.embedding.weight.data.uniform_(-1, 1)
input = torch.randint(0, 1000, (5, 4), dtype=torch.int64)
hidden = model.init_hidden(4)
logps, _ = model.forward(input, hidden)
print(logps)
print(model)
Training
DataLoaders and Batching
Now we should build a generator that we can use to loop through our data. It'll be more efficient if we can pass our sequences in as batches. Our input tensors should look like (sequence_length, batch_size)
. So if our sequences are 40 tokens long and we pass in 25 sequences, then we'd have an input size of (40, 25)
.
If we set our sequence length to 40, what do we do with messages that are more or less than 40 tokens? For messages with fewer than 40 tokens, we will pad the empty spots with zeros. We should be sure to left pad so that the RNN starts from nothing before going through the data. If the message has 20 tokens, then the first 20 spots of our 40 long sequence will be 0. If a message has more than 40 tokens, we'll just keep the first 40 tokens.
def dataloader(messages, labels, sequence_length=30, batch_size=32, shuffle=False):
"""
Build a dataloader.
"""
if shuffle:
indices = list(range(len(messages)))
random.shuffle(indices)
messages = [messages[idx] for idx in indices]
labels = [labels[idx] for idx in indices]
total_sequences = len(messages)
for ii in range(0, total_sequences, batch_size):
batch_messages = messages[ii: ii+batch_size]
# First initialize a tensor of all zeros
batch = torch.zeros((sequence_length, len(batch_messages)), dtype=torch.int64)
for batch_num, tokens in enumerate(batch_messages):
token_tensor = torch.tensor(tokens)
# Left pad!
start_idx = max(sequence_length - len(token_tensor), 0)
batch[start_idx:, batch_num] = token_tensor[:sequence_length]
label_tensor = torch.tensor(labels[ii: ii+len(batch_messages)])
yield batch, label_tensor
split_frac = 0.8
split_idx = int(len(token_ids)*split_frac)
train_features, valid_features = token_ids[:split_idx], token_ids[split_idx:]
train_labels, valid_labels = sentiments[:split_idx], sentiments[split_idx:]
text_batch, labels = next(iter(dataloader(train_features, train_labels, sequence_length=20, batch_size=64)))
model = TextClassifier(len(vocab)+1, 200, 128, 5, dropout=0.)
hidden = model.init_hidden(64)
logps, hidden = model.forward(text_batch, hidden)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = TextClassifier(len(vocab)+1, 1024, 512, 5, lstm_layers=2, dropout=0.2)
model.embedding.weight.data.uniform_(-1, 1)
model.to(device)
"""
Train your model with dropout. Make sure to clip your gradients.
Print the training loss, validation loss, and validation accuracy for every 100 steps.
"""
epochs = 3
batch_size = 1024
learning_rate = .001
clip = 5
print_every = 100
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
model.train()
train_losses = []
valid_losses = []
valid_accs = []
for epoch in range(epochs):
print('Starting epoch {}'.format(epoch + 1))
steps = 0
for text_batch, labels in dataloader(
train_features, train_labels, batch_size=batch_size, sequence_length=20, shuffle=True):
steps += 1
# initialize hidden state
hidden = model.init_hidden(batch_size=labels.shape[0])
# Set Device
text_batch, labels = text_batch.to(device), labels.to(device)
for each in hidden:
each.to(device)
# reset gradients
model.zero_grad()
# get output from model
log_probs, hidden = model(text_batch, hidden)
# calculate the loss and perform backprop
loss = criterion(log_probs, labels)
loss.backward()
# `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
nn.utils.clip_grad_norm_(model.parameters(), clip)
optimizer.step()
if steps % print_every == 0:
model.eval()
# Get validation loss
val_losses = []
for text_batch, labels in dataloader(valid_features, valid_labels,
batch_size=batch_size, sequence_length=20,
shuffle=True):
val_hidden = model.init_hidden(labels.shape[0])
text_batch, labels = text_batch.to(device), labels.to(device)
for each in val_hidden:
each.to(device)
val_log_probs, test_hidden = model(text_batch, val_hidden)
valid_loss = criterion(val_log_probs, labels)
# Accuracy
probs = torch.exp(val_log_probs)
top_prob, top_class = probs.topk(1)
equality = top_class == labels.view(*top_class.shape)
valid_acc = torch.mean(equality.type(torch.FloatTensor))
train_losses.append(loss.item())
valid_losses.append(valid_loss.item())
valid_accs.append(valid_acc.item())
model.train()
print(f'Epoch: {epoch+1} / {epochs} \tStep: {steps}',
f'\n Train Loss: {loss.item():.3f}',
f' Validation Loss: {valid_loss.item():.3f}',
f' Validation Accy: {valid_acc.item():.3f}')
So we get about 72% accuracy on the validation set. Lets try the ULMFit approach
from fastai.text.all import *
with open(os.path.join('data', 'stocktwits_sentiment', 'twits.json'), 'r') as f:
twits = json.load(f)
messages = [twit['message_body'] for twit in twits['data']]
# Since the sentiment scores are discrete, we'll scale the sentiments to 0 to 4 for use in our network
sentiments = [twit['sentiment'] + 2 for twit in twits['data']]
For the preprocess step, fastai does a lot of the work for us using the Spacy tokenizer by default. We keep track of uppercase letters so we do not lowercase everything. Our preprocess function is therefore simplified.
def preprocess(message):
text = message
# Replace URLs with a space in the message
text = re.sub(r'https?://[^\s]+', ' ', text)
# Replace ticker symbols with a space. The ticker symbols are any stock symbol that starts with $.
text = re.sub(r'\$[a-zA-Z]*\b', ' ', text)
# Replace StockTwits usernames with a space. The usernames are any word that starts with @.
text = re.sub(r'@\w*\b', ' ', text)
# Replace everything not a letter with a space
text = re.sub(r'[^a-zA-Z]', ' ', text)
# Remove multiple spaces
text = ' '.join(text.split())
# Tokenize by splitting the string on whitespace into a list of words
#tokens = text.split()
# Lemmatize words using the WordNetLemmatizer. You can ignore any word that is not longer than one character.
# wnl = nltk.stem.WordNetLemmatizer()
# tokens = [wnl.lemmatize(t) for t in tokens if len(t) > 1]
return text
messages_clean = [preprocess(m) for m in messages]
messages_clean[:10], sentiments[:10]
defaults.text_proc_rules
Again neutral messages make up around 45% of our sample data. So guessing neutral alone would give us an accuracy of 45%. We can randomly drop neutral labelled rows so that neutral data makes up around 1/5 of the total data.
balanced = {'messages': [], 'sentiments':[]}
n_neutral = sum(1 for each in sentiments if each == 2)
N_examples = len(sentiments)
keep_prob = (N_examples - n_neutral)/4/n_neutral
print(f'keep_prob: {keep_prob}')
for idx, sentiment in enumerate(sentiments):
message = messages_clean[idx]
if sentiment != 2 or random.random() < keep_prob:
balanced['messages'].append(message)
balanced['sentiments'].append(sentiment)
Check that neutral data is balanced
n_neutral = sum(1 for each in balanced['sentiments'] if each == 2)
N_examples = len(balanced['sentiments'])
n_neutral/N_examples
data = {'messages': balanced['messages'], 'sentiments': balanced['sentiments']}
df = pd.DataFrame(data); df.head()
Remove blank messages
df = df[df['messages']!='']
df.head()
df.shape
The first step of the ULMFit approach is to build a language model. A model that will predict the next word given the previous words. This is self supervized learning. So in a batch we have a text buffer as input and then the same buffer plus the next word as a target.
dls = TextDataLoaders.from_df(df, text_col='messages', label_col='sentiments', is_lm=True)
dls.show_batch(max_n=3)
dls.show_batch(max_n=3)
With this data we can now fine tune the language model. We will be using a recurrent neural network (RNN) using an architecture called AWD-LSTM
learn = language_model_learner(
dls, AWD_LSTM, drop_mult=0.3,
metrics=[accuracy, Perplexity()]).to_fp16()
Fit_one_cycle calls freeze when using a pretrained model so we are only training embeddings for words that are in our Stocktwits vocab, but aren't in the pretrained model vocab.
learn.fit_one_cycle(1, 2e-2)
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)
learn.summary()
We now have a language model originally trained on Wikipedia data trained with Stocktwits data. The goal is to predict the sentiment of messages but lets look at what this basic language model has learned
TEXT = "Earnings were above expectations. This stock should be trending up."
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75)
for _ in range(N_SENTENCES)]
print("\n".join(preds))
We now train our model with the sentiment label using our language model as a starting point.
dls_clas = TextDataLoaders.from_df(df, text_col='messages', label_col='sentiments', is_lm=False)
dls_clas.show_batch(max_n=3)
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
metrics=accuracy).to_fp16()
We now train our model with discriminative learning rates and gradual unfreezing. For NLP classifiers, the fastai library authors found that unfreezing a few layers at a time makes a real difference
learn.fit_one_cycle(1, 2e-2)
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))
learn.unfreeze()
learn.fit_one_cycle(10, slice(1e-3/(2.6**4),1e-3))
Around 75% accuracy which is 3% more than our basic PyTorch model. This is slightly better than our straight PyTorch model. 75% accuracy on a 5 sentiment level classifier is quite impressive. The fastai implementation is also based on PyTorch so it makes sense that the results are relatively close. The straight PyTorch implementation does require more work.
- 0: very negative
- 1: negative
- 2: neutral
- 3: positive
- 4: very positive
learn.predict(preprocess('$AAPL doing my part. Just bought my first iPad'))
learn.predict(preprocess('$AAPL historic reversal selloff on news , historic crash news'))
learn_inf.predict(preprocess('$TSLA if you haven’t seen it yet, Tsla had a double bottom and has upward momentum. $410 is strong support and this might close the week above $440. Only positive news from now till Election Day'))
learn.predict(preprocess('Sold my $AAPL 10/16 $120c this morning and MAN I'm glad I'm out of there!\n\nAlthough, this was one of my favorite trades, holding through the volatility last week was ROUGH and I didn't want to be caught holding the bag (after the iPhone event)\n\nWhat's next? since $SPY making a new high... Might looks for some shorts now 🐻🐻🐻'))
learn.predict(preprocess('$AAPL the news already leaked. Nothing new just a battery that last an extra hour and the 5G. They are going to launch 4 phones. Probably this time they will introduce the middle finger option to unlock the phone. To make the idiots happy that they brought something new. Lol'))