Introduction

In this project we will be comparing the results from a sentiment analysis model using a RNN implemented directly in PyTorch with another using the Fastai library with the ULMFit (Universal Language Model Fine-tuning) approach from this paper from Jeremy Howard and Sebastian Ruder.

The ULMFit approach involves using a language model first and then train it with the new vocabulary from the new application. Once our language model has been trained with the added new data, we use it as the base from our classification problem. A language model is a model that has been trained to guess the next word coming in a text from having seen the words that came before. This kind of process is called self supervized learning. In this case, no labels are provided to our model.

Lets move onto the actual classification problem we are going to solve.

When deciding the value of a company, it's important to follow the news. For example, a product recall or natural disaster in a company's product chain. You want to be able to turn this information into a signal.

We will be using posts from the social media site StockTwits. The community on StockTwits is full of investors, traders, and entrepreneurs. Each message posted is called a Twit. This is similar to Twitter's version of a post, called a Tweet. We will build a model around these twits that generate a sentiment score.

The data is a large collection of messages that were hand labeled with the sentiment of each. The degree of sentiment is a five-point scale: very negative, negative, neutral, positive, very positive. Each twit is labeled -2 to 2 in steps of 1, from very negative to very positive respectively.

The first thing we should do, is load the data.

PyTorch NLP model

Load Twits Data

This JSON file contains a list of objects for each twit in the 'data' field:

{'data':
  {'message_body': 'Neutral twit body text here',
   'sentiment': 0},
  {'message_body': 'Happy twit body text here',
   'sentiment': 1},
   ...
}

The fields represent the following:

  • 'message_body': The text of the twit.
  • 'sentiment': Sentiment score for the twit, ranges from -2 to 2 in steps of 1, with 0 being neutral.

To see what the data look like by printing the first 10 twits from the list.

from fastai.text.all import *
with open(os.path.join('data', 'stocktwits_sentiment', 'twits.json'), 'r') as f:
    twits = json.load(f)

Length of data

Lets look at the length of our data:

print(len(twits['data']))
1548010

And a couple of messages

print(twits['data'][:5])
[{'message_body': '$FITB great buy at 26.00...ill wait', 'sentiment': 2, 'timestamp': '2018-07-01T00:00:09Z'}, {'message_body': '@StockTwits $MSFT', 'sentiment': 1, 'timestamp': '2018-07-01T00:00:42Z'}, {'message_body': '#STAAnalystAlert for $TDG : Jefferies Maintains with a rating of Hold setting target price at USD 350.00. Our own verdict is Buy  http://www.stocktargetadvisor.com/toprating', 'sentiment': 2, 'timestamp': '2018-07-01T00:01:24Z'}, {'message_body': '$AMD I heard there’s a guy who knows someone who thinks somebody knows something - on StockTwits.', 'sentiment': 1, 'timestamp': '2018-07-01T00:01:47Z'}, {'message_body': '$AMD reveal yourself!', 'sentiment': 0, 'timestamp': '2018-07-01T00:02:13Z'}]

Lets split the messages and the labels

messages = [twit['message_body'] for twit in twits['data']]
# Since the sentiment scores are discrete, we'll scale the sentiments to 0 to 4 for use in our network
sentiments = [twit['sentiment'] + 2 for twit in twits['data']]
messages[45], sentiments[45]
('$NFLX just noticed they have the last jedi on stream. Love this stock', 2)

Preprocessing the Data

With our data in hand we need to preprocess our text. These twits are collected by filtering on ticker symbols where these are denoted with a leader $ symbol in the twit itself. For example,

{'message_body': 'RT @google Our annual look at the year in Google blogging (and beyond) http://t.co/sptHOAh8 $GOOG', 'sentiment': 0}

The ticker symbols don't provide information on the sentiment, and they are in every twit, so we should remove them. This twit also has the @google username, again not providing sentiment information, so we should also remove it. We also see a URL http://t.co/sptHOAh8. Let's remove these too.

import json
import nltk
import os
import random
import re
import torch

from torch import nn, optim
import torch.nn.functional as F
nltk.download('wordnet')

def preprocess_1(message):
    """
    This function takes a string as input, then performs these operations: 
        - lowercase
        - remove URLs
        - remove ticker symbols 
        - removes punctuation
        - tokenize by splitting the string on whitespace 
        - removes any single character tokens
    
    Parameters
    ----------
        message : The text message to be preprocessed.
        
    Returns
    -------
        tokens: The preprocessed text into tokens.
    """ 
    # Lowercase the twit message
    text = message.lower()
    
    # Replace URLs with a space in the message
    text = re.sub(r'https?://[^\s]+', ' ', text)
    
    # Replace ticker symbols with a space. The ticker symbols are any stock symbol that starts with $.
    text = re.sub(r'\$[a-z]*\b', ' ', text)
    
    # Replace StockTwits usernames with a space. The usernames are any word that starts with @.
    text = re.sub(r'@\w*\b', ' ', text)

    # Replace everything not a letter with a space
    text = re.sub(r'[^a-z]', ' ', text)
    
    # Tokenize by splitting the string on whitespace into a list of words
    tokens = text.split()

    # Lemmatize words using the WordNetLemmatizer. You can ignore any word that is not longer than one character.
    wnl = nltk.stem.WordNetLemmatizer()
    tokens = [wnl.lemmatize(t) for t in tokens if len(t) > 1]
    
    return tokens
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.

Preprocess All the Twits

tokenized = [preprocess_1(m) for m in messages]
tokenized[:1], messages[:1]
([['great', 'buy', 'at', 'ill', 'wait']],
 ['$FITB great buy at 26.00...ill wait'])

Looking good

Lets check for empty tokens

len([token for token in tokenized if len(token) == 0])
48528

Clean that up

good_tokens = [idx for idx, token in enumerate(tokenized) if len(token) > 0]
tokenized = [tokenized[idx] for idx in good_tokens]
sentiments = [sentiments[idx] for idx in good_tokens]

Bag of Words

Now with all of our messages tokenized, we want to create a vocabulary and count up how often each word appears in our entire corpus.

from collections import Counter
"""
Create a vocabulary by using Bag of words
"""
stacked_tokens = [word for twit in tokenized for word in twit]

bow = Counter(stacked_tokens)
# sort by decreasing order
sorted_bow = sorted(bow, key=bow.get, reverse=True)

Frequency of Words Appearing in Message

With our vocabulary parsed, lets remove some of the most common words such as 'the', 'and', 'it', etc. These words don't contribute to identifying sentiment and are really common, resulting in a lot of noise in our input. If we can filter these out, then our network should have an easier time learning.

We also want to remove really rare words that show up only in a few twits. Here you'll want to divide the count of each word by the number of messages. Then remove words that only appear in some small fraction of the messages.

# The key is the token and the value is the frequency of that word in the corpus.
total_words = len(bow)

freqs = {word: count/total_words for word, count in bow.items()}

# Float that is the frequency cutoff. Drop words with a frequency that is lower or equal to this number.
low_cutoff = 1e-5

# Integer that is the cut off for most common words. Drop words that are the `high_cutoff` most common words.
high_cutoff = 15

# The k most common words in the corpus. Use `high_cutoff` as the k.
K_most_common = [word[0] for word in bow.most_common(high_cutoff)]


filtered_words = [word for word in freqs if (freqs[word] > low_cutoff and word not in K_most_common)]
print(K_most_common)
len(filtered_words) 
['the', 'to', 'is', 'for', 'on', 'of', 'and', 'in', 'this', 'it', 'at', 'will', 'up', 'are', 'you']
98448

Remove Filtered Words from Vocabulary

from tqdm import tqdm

# A dictionary for the `filtered_words`. The key is the word and value is an id that represents the word.
vocab = {word: i for i, word in enumerate(filtered_words, 1)}

# Reverse of the `vocab` dictionary. The key is word id and value is the word. 
id2vocab = {i: word for i, word in enumerate(filtered_words, 1)}

# tokenized with the words not in `filtered_words` removed.
filtered = [ [w for w in msg if w in filtered_words] for msg in tqdm(tokenized) ]
100%|██████████| 1499482/1499482 [2:13:51<00:00, 186.69it/s]

Balancing the classes

If we look at how our twits are labeled, we'll find that 45% of them are neutral. This means that our network will be 45% accurate just by guessing 0 every single time. To help our network learn appropriately, we'll want to balance our classes. That is, make sure each of our different sentiment scores show up roughly as frequently in the data.

What we can do here is go through each of our examples and randomly drop twits with neutral sentiment. We want to get around 20% neutral twits starting from 50% neutral.

for i in range(0,5):
    print(f'{i}: {sentiments.count(i)/len(sentiments)}')
0: 0.08710474683924181
1: 0.11403071193918966
2: 0.44561321843143165
3: 0.20476004380179288
4: 0.14849127898834397
balanced = {'messages': [], 'sentiments':[]}

n_neutral = sum(1 for each in sentiments if each == 2)
N_examples = len(sentiments)
keep_prob = (N_examples - n_neutral)/4/n_neutral
print(f'keep_prob: {keep_prob}')

for idx, sentiment in enumerate(sentiments):
    message = filtered[idx]
    if sentiment != 2 or random.random() < keep_prob:
        balanced['messages'].append(message)
        balanced['sentiments'].append(sentiment) 
keep_prob: 0.31102465021124265

Lets check that we are balanced:

n_neutral = sum(1 for each in balanced['sentiments'] if each == 2)
N_examples = len(balanced['sentiments'])
n_neutral/N_examples
0.1998550428903639

Looks good, 1/5th of our samples are neutral now.

Let's convert our tokens into integer ids which we can pass to the network.

token_ids = [[vocab[word] for word in message] for message in balanced['messages']]
sentiments = balanced['sentiments']

Neural Network

With our vocabulary mapped to interger ids, we are now ready to build our neural network.

Here is a diagram showing the network:

Embed -> RNN -> Dense -> Softmax

class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_size, lstm_size, output_size, lstm_layers=1, dropout=0.1):
        """
        Initialize the model by setting up the layers.
        
        Parameters
        ----------
            vocab_size : The vocabulary size.
            embed_size : The embedding layer size.
            lstm_size : The LSTM layer size.
            output_size : The output size.
            lstm_layers : The number of LSTM layers.
            dropout : The dropout probability.
        """
        
        super().__init__()
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.lstm_size = lstm_size
        self.output_size = output_size
        self.lstm_layers = lstm_layers
        self.dropout = dropout

        # Setup embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, lstm_size, lstm_layers, 
                            dropout=dropout, batch_first=False)
        
        # Setup additional layers
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(lstm_size, output_size)
        self.logsoftmax = nn.LogSoftmax(dim=1)


    def init_hidden(self, batch_size):
        """ 
        Initializes hidden state
        
        Parameters
        ----------
            batch_size : The size of batches.
        
        Returns
        -------
            hidden_state
            
        """
        
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        hidden = (weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_(),
                  weight.new(self.lstm_layers, batch_size, self.lstm_size).zero_())
        
        return hidden


    def forward(self, nn_input, hidden_state):
        """
        Perform a forward pass of our model on nn_input.
        
        Parameters
        ----------
            nn_input : The batch of input to the NN.
            hidden_state : The LSTM hidden state.

        Returns
        -------
            logps: log softmax output
            hidden_state: The new hidden state.

        """
        # embeddings and lstm_out
        nn_input = nn_input.long()
        embeds = self.embedding(nn_input)
        lstm_out, hidden_state = self.lstm(embeds, hidden_state)
        
        lstm_out = lstm_out[-1, : , :]
        
        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        
        logps = self.logsoftmax(out)
        
        return logps, hidden_state

View Model

model = TextClassifier(len(vocab), 10, 6, 5, dropout=0.1, lstm_layers=2)
model.embedding.weight.data.uniform_(-1, 1)
input = torch.randint(0, 1000, (5, 4), dtype=torch.int64)
hidden = model.init_hidden(4)

logps, _ = model.forward(input, hidden)
print(logps)
print(model)
tensor([[-2.0672, -1.8099, -1.3161, -1.5152, -1.5058],
        [-2.0718, -1.7388, -1.3464, -1.5252, -1.5115],
        [-2.0702, -1.7568, -1.3375, -1.5287, -1.5054],
        [-2.0329, -1.9265, -1.2760, -1.5094, -1.4997]], grad_fn=<LogSoftmaxBackward>)
TextClassifier(
  (embedding): Embedding(98448, 10)
  (lstm): LSTM(10, 6, num_layers=2, dropout=0.1)
  (dropout): Dropout(p=0.1, inplace=False)
  (fc): Linear(in_features=6, out_features=5, bias=True)
  (logsoftmax): LogSoftmax(dim=1)
)

Training

DataLoaders and Batching

Now we should build a generator that we can use to loop through our data. It'll be more efficient if we can pass our sequences in as batches. Our input tensors should look like (sequence_length, batch_size). So if our sequences are 40 tokens long and we pass in 25 sequences, then we'd have an input size of (40, 25).

If we set our sequence length to 40, what do we do with messages that are more or less than 40 tokens? For messages with fewer than 40 tokens, we will pad the empty spots with zeros. We should be sure to left pad so that the RNN starts from nothing before going through the data. If the message has 20 tokens, then the first 20 spots of our 40 long sequence will be 0. If a message has more than 40 tokens, we'll just keep the first 40 tokens.

def dataloader(messages, labels, sequence_length=30, batch_size=32, shuffle=False):
    """ 
    Build a dataloader.
    """
    if shuffle:
        indices = list(range(len(messages)))
        random.shuffle(indices)
        messages = [messages[idx] for idx in indices]
        labels = [labels[idx] for idx in indices]

    total_sequences = len(messages)

    for ii in range(0, total_sequences, batch_size):
        batch_messages = messages[ii: ii+batch_size]
        
        # First initialize a tensor of all zeros
        batch = torch.zeros((sequence_length, len(batch_messages)), dtype=torch.int64)
        for batch_num, tokens in enumerate(batch_messages):
            token_tensor = torch.tensor(tokens)
            # Left pad!
            start_idx = max(sequence_length - len(token_tensor), 0)
            batch[start_idx:, batch_num] = token_tensor[:sequence_length]
        
        label_tensor = torch.tensor(labels[ii: ii+len(batch_messages)])
        
        yield batch, label_tensor

Training and Validation

Split it into training and validation sets.

split_frac = 0.8
split_idx = int(len(token_ids)*split_frac)

train_features, valid_features = token_ids[:split_idx], token_ids[split_idx:]
train_labels, valid_labels = sentiments[:split_idx], sentiments[split_idx:]
text_batch, labels = next(iter(dataloader(train_features, train_labels, sequence_length=20, batch_size=64)))
model = TextClassifier(len(vocab)+1, 200, 128, 5, dropout=0.)
hidden = model.init_hidden(64)
logps, hidden = model.forward(text_batch, hidden)

Train model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = TextClassifier(len(vocab)+1, 1024, 512, 5, lstm_layers=2, dropout=0.2)
model.embedding.weight.data.uniform_(-1, 1)
model.to(device)
TextClassifier(
  (embedding): Embedding(98449, 1024)
  (lstm): LSTM(1024, 512, num_layers=2, dropout=0.2)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=512, out_features=5, bias=True)
  (logsoftmax): LogSoftmax(dim=1)
)
"""
Train your model with dropout. Make sure to clip your gradients.
Print the training loss, validation loss, and validation accuracy for every 100 steps.
"""

epochs = 3
batch_size = 1024
learning_rate = .001
clip = 5


print_every = 100
criterion = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
model.train()

train_losses = []
valid_losses = []
valid_accs = []

for epoch in range(epochs):
    print('Starting epoch {}'.format(epoch + 1))
    
    steps = 0
    for text_batch, labels in dataloader(
            train_features, train_labels, batch_size=batch_size, sequence_length=20, shuffle=True):
        steps += 1
        # initialize hidden state
        hidden = model.init_hidden(batch_size=labels.shape[0])
        
        # Set Device
        text_batch, labels = text_batch.to(device), labels.to(device)
        for each in hidden:
            each.to(device)
        
        # reset gradients
        model.zero_grad()
        
        # get output from model
        log_probs, hidden = model(text_batch, hidden)
        
        # calculate the loss and perform backprop
        loss = criterion(log_probs, labels)
        loss.backward()
        
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        
        if steps % print_every == 0:
            model.eval()
            
            # Get validation loss
            val_losses = []
            for text_batch, labels in dataloader(valid_features, valid_labels,
                                                batch_size=batch_size, sequence_length=20,
                                                shuffle=True):
                val_hidden = model.init_hidden(labels.shape[0])
                
                text_batch, labels = text_batch.to(device), labels.to(device)
                for each in val_hidden:
                    each.to(device)
                    
                val_log_probs, test_hidden = model(text_batch, val_hidden)
                valid_loss = criterion(val_log_probs, labels)
                
                # Accuracy
                probs = torch.exp(val_log_probs)
                top_prob, top_class = probs.topk(1)
                equality = top_class == labels.view(*top_class.shape)
                valid_acc = torch.mean(equality.type(torch.FloatTensor))
            
            train_losses.append(loss.item())
            valid_losses.append(valid_loss.item())
            valid_accs.append(valid_acc.item())

            model.train()
            
            print(f'Epoch: {epoch+1} / {epochs} \tStep: {steps}',
                  f'\n  Train Loss: {loss.item():.3f}',
                  f'  Validation Loss: {valid_loss.item():.3f}',
                  f'  Validation Accy: {valid_acc.item():.3f}')
            
Starting epoch 1
Epoch: 1 / 3 	Step: 100 
  Train Loss: 0.913   Validation Loss: 0.956   Validation Accy: 0.634
Epoch: 1 / 3 	Step: 200 
  Train Loss: 0.856   Validation Loss: 0.877   Validation Accy: 0.658
Epoch: 1 / 3 	Step: 300 
  Train Loss: 0.781   Validation Loss: 0.841   Validation Accy: 0.670
Epoch: 1 / 3 	Step: 400 
  Train Loss: 0.783   Validation Loss: 0.724   Validation Accy: 0.716
Epoch: 1 / 3 	Step: 500 
  Train Loss: 0.745   Validation Loss: 0.761   Validation Accy: 0.713
Epoch: 1 / 3 	Step: 600 
  Train Loss: 0.738   Validation Loss: 0.706   Validation Accy: 0.725
Epoch: 1 / 3 	Step: 700 
  Train Loss: 0.722   Validation Loss: 0.688   Validation Accy: 0.746
Epoch: 1 / 3 	Step: 800 
  Train Loss: 0.730   Validation Loss: 0.752   Validation Accy: 0.714
Starting epoch 2
Epoch: 2 / 3 	Step: 100 
  Train Loss: 0.640   Validation Loss: 0.747   Validation Accy: 0.738
Epoch: 2 / 3 	Step: 200 
  Train Loss: 0.652   Validation Loss: 0.780   Validation Accy: 0.706
Epoch: 2 / 3 	Step: 300 
  Train Loss: 0.689   Validation Loss: 0.756   Validation Accy: 0.694
Epoch: 2 / 3 	Step: 400 
  Train Loss: 0.686   Validation Loss: 0.753   Validation Accy: 0.722
Epoch: 2 / 3 	Step: 500 
  Train Loss: 0.679   Validation Loss: 0.722   Validation Accy: 0.722
Epoch: 2 / 3 	Step: 600 
  Train Loss: 0.674   Validation Loss: 0.746   Validation Accy: 0.701
Epoch: 2 / 3 	Step: 700 
  Train Loss: 0.680   Validation Loss: 0.720   Validation Accy: 0.715
Epoch: 2 / 3 	Step: 800 
  Train Loss: 0.683   Validation Loss: 0.683   Validation Accy: 0.729
Starting epoch 3
Epoch: 3 / 3 	Step: 100 
  Train Loss: 0.600   Validation Loss: 0.756   Validation Accy: 0.704
Epoch: 3 / 3 	Step: 200 
  Train Loss: 0.602   Validation Loss: 0.733   Validation Accy: 0.731
Epoch: 3 / 3 	Step: 300 
  Train Loss: 0.615   Validation Loss: 0.744   Validation Accy: 0.722
Epoch: 3 / 3 	Step: 400 
  Train Loss: 0.625   Validation Loss: 0.716   Validation Accy: 0.719
Epoch: 3 / 3 	Step: 500 
  Train Loss: 0.563   Validation Loss: 0.758   Validation Accy: 0.694
Epoch: 3 / 3 	Step: 600 
  Train Loss: 0.663   Validation Loss: 0.760   Validation Accy: 0.716
Epoch: 3 / 3 	Step: 700 
  Train Loss: 0.587   Validation Loss: 0.698   Validation Accy: 0.719
Epoch: 3 / 3 	Step: 800 
  Train Loss: 0.666   Validation Loss: 0.689   Validation Accy: 0.716

So we get about 72% accuracy on the validation set. Lets try the ULMFit approach

ULMFit approach

from fastai.text.all import *

Import Twits

with open(os.path.join('data', 'stocktwits_sentiment', 'twits.json'), 'r') as f:
    twits = json.load(f)
messages = [twit['message_body'] for twit in twits['data']]
# Since the sentiment scores are discrete, we'll scale the sentiments to 0 to 4 for use in our network
sentiments = [twit['sentiment'] + 2 for twit in twits['data']]

Preprocessing the Data

For the preprocess step, fastai does a lot of the work for us using the Spacy tokenizer by default. We keep track of uppercase letters so we do not lowercase everything. Our preprocess function is therefore simplified.

def preprocess(message):
 
    text = message
    
    # Replace URLs with a space in the message
    text = re.sub(r'https?://[^\s]+', ' ', text)
    
    # Replace ticker symbols with a space. The ticker symbols are any stock symbol that starts with $.
    text = re.sub(r'\$[a-zA-Z]*\b', ' ', text)
    
    # Replace StockTwits usernames with a space. The usernames are any word that starts with @.
    text = re.sub(r'@\w*\b', ' ', text)

    # Replace everything not a letter with a space
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    
    # Remove multiple spaces
    text = ' '.join(text.split())
    # Tokenize by splitting the string on whitespace into a list of words
    #tokens = text.split()

    # Lemmatize words using the WordNetLemmatizer. You can ignore any word that is not longer than one character.
    # wnl = nltk.stem.WordNetLemmatizer()
    # tokens = [wnl.lemmatize(t) for t in tokens if len(t) > 1]
    
    return text
messages_clean = [preprocess(m) for m in messages]
messages_clean[:10], sentiments[:10]
(['great buy at ill wait',
  '',
  'STAAnalystAlert for Jefferies Maintains with a rating of Hold setting target price at USD Our own verdict is Buy',
  'I heard there s a guy who knows someone who thinks somebody knows something on StockTwits',
  'reveal yourself',
  'Why the drop I warren Buffet taking out his position',
  'bears have reason on to pay more attention',
  'ok good we re not dropping in price over the weekend lol',
  'Daily Chart we need to get back to above',
  'drop per week after spike if no news in months back to s if BO then bingo what is the odds'],
 [4, 3, 4, 3, 2, 3, 0, 3, 4, 0])
defaults.text_proc_rules
[<function fastai.text.core.fix_html>,
 <function fastai.text.core.replace_rep>,
 <function fastai.text.core.replace_wrep>,
 <function fastai.text.core.spec_add_spaces>,
 <function fastai.text.core.rm_useless_spaces>,
 <function fastai.text.core.replace_all_caps>,
 <function fastai.text.core.replace_maj>,
 <function fastai.text.core.lowercase>]

Build dataframe

Again neutral messages make up around 45% of our sample data. So guessing neutral alone would give us an accuracy of 45%. We can randomly drop neutral labelled rows so that neutral data makes up around 1/5 of the total data.

balanced = {'messages': [], 'sentiments':[]}

n_neutral = sum(1 for each in sentiments if each == 2)
N_examples = len(sentiments)
keep_prob = (N_examples - n_neutral)/4/n_neutral
print(f'keep_prob: {keep_prob}')

for idx, sentiment in enumerate(sentiments):
    message = messages_clean[idx]
    if sentiment != 2 or random.random() < keep_prob:
        balanced['messages'].append(message)
        balanced['sentiments'].append(sentiment) 
keep_prob: 0.3016022730997995

Check that neutral data is balanced

n_neutral = sum(1 for each in balanced['sentiments'] if each == 2)
N_examples = len(balanced['sentiments'])
n_neutral/N_examples
0.19986973504599923
data = {'messages': balanced['messages'], 'sentiments': balanced['sentiments']}
df = pd.DataFrame(data); df.head()
messages sentiments
0 great buy at ill wait 4
1 3
2 STAAnalystAlert for Jefferies Maintains with a rating of Hold setting target price at USD Our own verdict is Buy 4
3 I heard there s a guy who knows someone who thinks somebody knows something on StockTwits 3
4 reveal yourself 2

Remove blank messages

df = df[df['messages']!='']
df.head()
messages sentiments
0 great buy at ill wait 4
2 STAAnalystAlert for Jefferies Maintains with a rating of Hold setting target price at USD Our own verdict is Buy 4
3 I heard there s a guy who knows someone who thinks somebody knows something on StockTwits 3
4 reveal yourself 2
5 Why the drop I warren Buffet taking out his position 3
df.shape
(1500631, 2)

Self supervized language model

The first step of the ULMFit approach is to build a language model. A model that will predict the next word given the previous words. This is self supervized learning. So in a batch we have a text buffer as input and then the same buffer plus the next word as a target.

Text data loader

dls = TextDataLoaders.from_df(df, text_col='messages', label_col='sentiments', is_lm=True)
dls.show_batch(max_n=3)

Look at the data

dls.show_batch(max_n=3)
text text_
0 xxbos funny how these bears comes out today where were you yesterday when i was all alone getting flamed by these young bulls xxbos holey moley batman xxbos holding my waiting for kiki to tell me she loves me i mean jcpenney xxbos xxmaj china s xxmaj zhoushan city woos xxmaj exxon xxmaj mobil for a billion ethylene plant xxbos is max pain xxbos they are overselling again bring that xxup rsi funny how these bears comes out today where were you yesterday when i was all alone getting flamed by these young bulls xxbos holey moley batman xxbos holding my waiting for kiki to tell me she loves me i mean jcpenney xxbos xxmaj china s xxmaj zhoushan city woos xxmaj exxon xxmaj mobil for a billion ethylene plant xxbos is max pain xxbos they are overselling again bring that xxup rsi down
1 xxmaj volatility is expensive to forecast xxbos this is looking good xxbos wth xxbos about to break the daily high xxbos xxmaj all top ranks are xxup nvidia no others brands xxmaj see bounds back to xxbos people can run their fingers on the keyboard xxmaj xxunk and watch the chart you could learn a thing or two xxmaj now let s see premarket xxbos xxmaj update xxmaj aug xxmaj puts xxmaj volatility is expensive to forecast xxbos this is looking good xxbos wth xxbos about to break the daily high xxbos xxmaj all top ranks are xxup nvidia no others brands xxmaj see bounds back to xxbos people can run their fingers on the keyboard xxmaj xxunk and watch the chart you could learn a thing or two xxmaj now let s see premarket xxbos xxmaj update xxmaj aug xxmaj puts xxmaj up
2 unitedhealth xxmaj group s xxup pt raised by xxmaj raymond xxmaj james to strong buy rating xxbos xxmaj great summary of xxmaj boyar xxmaj value xxmaj group s xxmaj letter stay clear of xxup faang xxbos xxmaj absolutely ridiculous xxbos pump fake xxbos waiting to head to xxbos xxmaj it amuses me when bears here think they control the price with xxup st messages xxbos if i m wrong i m wrong xxmaj group s xxup pt raised by xxmaj raymond xxmaj james to strong buy rating xxbos xxmaj great summary of xxmaj boyar xxmaj value xxmaj group s xxmaj letter stay clear of xxup faang xxbos xxmaj absolutely ridiculous xxbos pump fake xxbos waiting to head to xxbos xxmaj it amuses me when bears here think they control the price with xxup st messages xxbos if i m wrong i m wrong i

Create language learner

With this data we can now fine tune the language model. We will be using a recurrent neural network (RNN) using an architecture called AWD-LSTM

learn = language_model_learner(
    dls, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()

Fine-Tuning the Language Model

Fit_one_cycle calls freeze when using a pretrained model so we are only training embeddings for words that are in our Stocktwits vocab, but aren't in the pretrained model vocab.

learn.fit_one_cycle(1, 2e-2)
epoch train_loss valid_loss accuracy perplexity time
0 4.079166 3.907568 0.344360 49.777721 18:13

Continue fine-tuning the model after unfreezing

learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)
epoch train_loss valid_loss accuracy perplexity time
0 3.667701 3.579307 0.391628 35.848701 22:17
1 3.556901 3.493224 0.401178 32.891827 22:18
2 3.501796 3.440667 0.407269 31.207756 22:08
3 3.429217 3.405697 0.411324 30.135296 21:58
4 3.346377 3.383271 0.414011 29.466999 22:09
5 3.297989 3.370040 0.416521 29.079689 22:22
6 3.222261 3.362387 0.418393 28.858006 22:10
7 3.179726 3.362380 0.419272 28.857779 22:16
8 3.155103 3.368545 0.419497 29.036251 22:22
9 3.091218 3.374530 0.419274 29.210562 22:23

Display the RNN layers

learn.summary()
epoch train_loss valid_loss accuracy perplexity time
0 None None None 00:00
SequentialRNN (Input shape: ['64 x 72'])
================================================================
Layer (type)         Output Shape         Param #    Trainable 
================================================================
RNNDropout           64 x 72 x 400        0          False     
________________________________________________________________
RNNDropout           64 x 72 x 1152       0          False     
________________________________________________________________
RNNDropout           64 x 72 x 1152       0          False     
________________________________________________________________
Linear               64 x 72 x 39784      15,953,384 True      
________________________________________________________________
RNNDropout           64 x 72 x 400        0          False     
________________________________________________________________

Total params: 15,953,384
Total trainable params: 15,953,384
Total non-trainable params: 0

Optimizer used: <function Adam at 0x7fc7a15836a8>
Loss function: FlattenedLoss of CrossEntropyLoss()

Model frozen up to parameter group #3

Callbacks:
  - ModelResetter
  - RNNRegularizer
  - ModelToHalf
  - TrainEvalCallback
  - Recorder
  - ProgressCallback
  - MixedPrecision

Test our basic language model

We now have a language model originally trained on Wikipedia data trained with Stocktwits data. The goal is to predict the sentiment of messages but lets look at what this basic language model has learned

TEXT = "Earnings were above expectations. This stock should be trending up."
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]
print("\n".join(preds))
Earnings were above expectations xxunk This stock should be trending up xxunk Update Aug Calls Up to per contract since alerted on Jul days to expire Some of todays top open interest changes Update Sep Puts Up sincealerted on
Earnings were above expectations xxunk This stock should be trending up xxunk Lot of Buying New Insider Filing On THICC JOHN days Transaction Code Both the short term and long term trends are positive This is a very

Classifier for Stocktwits sentiments

We now train our model with the sentiment label using our language model as a starting point.

Create the data loader with sentiment labels

dls_clas = TextDataLoaders.from_df(df, text_col='messages', label_col='sentiments', is_lm=False)
dls_clas.show_batch(max_n=3)
text category
0 xxbos i xxup want xxup him xxup to xxup feel xxup max xxup pain xxup the xxup rest xxup ill xxup play xxup by xxup play xxup it xxup same xxup as xxup amd xxup at xxup exp xxup why xxup it xxup had xxup to xxup hit xxup then xxup it xxup did xxup lulu xxup wk b xxup ibm b 0
1 xxbos xxup the xxup xxunk xxup fucks xxup are xxup waiting xxup for xxup mu xxup to xxup go xxup below xxup to xxup buy xxup it xxup back xxup lets xxup just xxup all xxup max xxup out xxup our xxup credit xxup cards xxup and xxup buy xxup more xxup mu xxup at xxup this xxup low xxpad xxpad xxpad 4
2 xxbos lookslike xxmaj about xxup mm s xxmaj trying xxmaj to xxmaj hold xxup bid xxmaj in xxmaj low s xxmaj but xxup mm xxmaj forcing xxmaj trading xxmaj in xxmaj low s xxup jpm xxmaj tusa lol xxmaj glta xxmaj bulls xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad xxpad 4

Create the learner to classify our "Twits"

learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()
 

Fine tuning the classifier

We now train our model with discriminative learning rates and gradual unfreezing. For NLP classifiers, the fastai library authors found that unfreezing a few layers at a time makes a real difference

learn.fit_one_cycle(1, 2e-2)
epoch train_loss valid_loss accuracy time
0 1.555983 1.285381 0.455808 10:37
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
epoch train_loss valid_loss accuracy time
0 1.098982 0.991681 0.593996 10:53
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))
epoch train_loss valid_loss accuracy time
0 0.906630 0.828425 0.670413 10:43
learn.unfreeze()
learn.fit_one_cycle(10, slice(1e-3/(2.6**4),1e-3))
epoch train_loss valid_loss accuracy time
0 0.862675 0.810372 0.678238 11:56
1 0.831596 0.728818 0.714281 11:56
2 0.751489 0.685342 0.732929 11:56
3 0.694547 0.662922 0.742413 12:00
4 0.706285 0.652665 0.747716 12:01
5 0.665387 0.639441 0.750803 12:02
6 0.658407 0.631364 0.753900 11:57
7 0.673705 0.632880 0.753232 11:50
8 0.626122 0.632878 0.754316 11:56
9 0.634495 0.630918 0.754664 12:02

Model results comparison

Around 75% accuracy which is 3% more than our basic PyTorch model. This is slightly better than our straight PyTorch model. 75% accuracy on a 5 sentiment level classifier is quite impressive. The fastai implementation is also based on PyTorch so it makes sense that the results are relatively close. The straight PyTorch implementation does require more work.

Predictions with the ULMFit model

  • 0: very negative
  • 1: negative
  • 2: neutral
  • 3: positive
  • 4: very positive

Very positive

learn.predict(preprocess('$AAPL doing my part. Just bought my first iPad'))
('4',
 tensor(4),
 tensor([    0.0003,     0.0176,     0.0395,     0.0209,     0.9217])

Very negative

learn.predict(preprocess('$AAPL historic reversal selloff on news , historic crash news'))
('0',
 tensor(0),
 tensor([    0.8399,     0.0051,     0.0703,     0.0847,     0.0000]))

Very Positive

learn_inf.predict(preprocess('$TSLA if you haven’t seen it yet, Tsla had a double bottom and has upward momentum. $410 is strong support and this might close the week above $440. Only positive news from now till Election Day'))
('4',
 tensor(4),
 tensor([    0.0001,     0.0003,     0.0320,     0.0357,     0.9319]))

Negative

learn.predict(preprocess('Sold my $AAPL 10/16 $120c this morning and MAN I&#39;m glad I&#39;m out of there!\n\nAlthough, this was one of my favorite trades,  holding through the volatility last week was ROUGH and I didn&#39;t want to be caught holding the bag (after the iPhone event)\n\nWhat&#39;s next? since $SPY  making a new high... Might looks for some shorts now 🐻🐻🐻'))
('1',
 tensor(1),
 tensor([    0.0524,     0.7317,     0.0754,     0.1390,     0.0015]))

Neutral

learn.predict(preprocess('$AAPL the news already leaked. Nothing new just a battery that last an extra hour and the 5G. They are going to launch 4 phones. Probably this time they will introduce the middle finger option to unlock the phone. To make the idiots happy that they brought something new. Lol'))
('2',
 tensor(2),
 tensor([    0.0001,     0.0303,     0.7729,     0.1964,     0.0002]))