S3 partners short interest dataset initial analysis
First impressions and Exploratory Data Analysis of S3 partners short interest dataset
- S3 partners short interest
- S3 Data Exploratory Data Analysis
- xgboost model
- Confusion Matrix
- Feature importance
- Conclusion
S3 partners short interest
Short interest is usually measured by shares on loan. Some brokers don't rely on borrowing shares as they have shares in inventory.
eg russell 3000
In their promotional presentation, S3 states:
- stock borrowed data is 45% within 10% of actual reported
- S3 data is 85% within 10% of actual reported
Common Short Float = shares shorted / shares available to trade
Despite its widespread use, this calculation is flawed in two main ways:
US investors are only required to report short shares twice per month, leading to a short interest number being roughly ten days stale by the time it gets to investors Float does not accurately represent shares available to trade on a daily basis To combat these flaws, S3 provides a true daily shares shorted measure and calculates more accurate “tradeable shares” than the general definition for Float provides.
S3 points out that “what is missing [from the general definition for float] are the ‘synthetic longs’ that are created as a result of a short sale which, in some stocks, can be a very significant number.” The synthetic long is a result of a long shareholder lending out their shares, a short seller borrowing those shares, and a long buyer on the other side of the short sale now owning the shares. In this case, the long buyer on the other side of the short sale has increased the market’s potential tradable quantity of shares. The interesting feature in the S3 data is the Squeeze Risk which we will look in depth.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import liberator
from datetime import datetime, timedelta
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 40)
# S3 data for 2021 as of December 23rd US stock only
df_s3_21_us = pd.read_csv('df_s3_23dec21_us.csv', dtype={"Cusip":"string"});
print(f"S3 short float mean: {df_s3_21_us['S3SIPctFloat'].mean():.3f}")
print(f"S3 short max: {df_s3_21_us['S3SIPctFloat'].max()}")
print(f"S3 short median: {df_s3_21_us['S3SIPctFloat'].median()}")
print(f"S3 short skew: {df_s3_21_us['S3SIPctFloat'].skew():.3f}")
df_s3_21_us.iloc[:,7:].head()
- Crowding: S3’s proprietary index score measuring the magnitude of shorting/covering activity relative to the security’s float, borrow capacity and financing rate.
- Short Interest: Real-time short interest expressed in shares.
- ShortInterestNotional: ShortInterest * Price (USD)
- ShortInterestPct: Real-time short interest as a percentage of equity float.
- S3Float: The number of tradable shares including synthetic longs created by short selling.
- S3SIPctFloat: Real-time short interest projection divided by the S3 float.
- IndicativeAvailability: S3's projected available lendable quantity
- Utilization: S3's Utilization is defined as S3 Short Interest divided by Total Lendable supply
- DaysToCover10Day: Liquidity measure = Short Interest / 10 day average ADTV
- DaysToCover30Day: Liquidity measure = Short Interest / 30 day average ADTV
- DaysToCover90Day: Liquidity measure = Short Interest / 90 day average ADTV
Finviz scan for tradable stock universe with volume > 500k, price > 10$ and ATR > 0.5
# average volume > 500k, price > 10$, ATR > 0.5
df_finviz = pd.read_csv('finviz_dec_23.csv')
df_finviz.head()
# 1294 stocks
df_finviz.shape
Lets extract S3 data for december 23rd to compare with the Finviz data for that day
# extract s3 data for dec 23, the day of our finviz screen
df_s3_dec_23 = df_s3_21_us[df_s3_21_us['timestamp'].str.startswith('2021-12-23')]
# filter s3 data for our target stocks from finviz
df_s3_dec23_filtered = df_s3_dec_23[df_s3_dec_23.symbol.isin(df_finviz.Ticker)]
df_s3_dec23_filtered.head()
Top 10 stocks by S3 Squeeze Risk for December 23
# top 10 s3 for squeeze risk on december 23
s3_top10 = df_s3_dec23_filtered.nlargest(10, 'Squeeze Risk')[['symbol','ShortInterestPct','S3SIPctFloat','Crowded Score','Squeeze Risk']]
s3_top10
# add new column 'fshortn' with Float short as a float
df_finviz['fshortn'] = df_finviz['Float Short'].str.replace('%','').astype(float)
# top 20 finviz Float Short
finviz_top20 = df_finviz.nlargest(20,'fshortn')[['Ticker','Shares Float','Float Short']]
finviz_top20
# function to get common elements in 2 lists
def Intersection(lst1, lst2):
return set(lst1).intersection(lst2)
How many of our top 10 Squeeze Risk stocks are caught with a Finviz top 20 short float for December 23?
# see what symbols from top 10 float short in S3 are in finviz top 20
Intersection(s3_top10.symbol, finviz_top20.Ticker)
The top 20 float short stocks from finviz catch 7/10 of the top Squeeze Risk S3 stocks.
# top 10 s3 for Short Interest Pct on december 23
s3_SIP_top10 = df_s3_dec23_filtered.nlargest(10, 'ShortInterestPct')[['symbol','ShortInterestPct','S3SIPctFloat','Crowded Score','Squeeze Risk']]
s3_SIP_top10
Looking at the short interest values, we can see that some stocks with similar short float value have very different Squeeze Risk. So the proprietary Squeeze Risk data has different information in it.
How many top 10 S3 short float value do we catch with the finviz top 20 ?
# see what symbols from top 10 float short in S3 are in finviz top 20
Intersection(s3_SIP_top10.symbol, finviz_top20.Ticker)
The top 20 float short stocks from finviz for dec 23 catch 8/10 of the top Short Interest Pct S3 stocks.
# lets merge finviz data our S3 data
merge_df = pd.merge(df_finviz, df_s3_dec23_filtered[['symbol','ShortInterestPct','S3SIPctFloat','Squeeze Risk']],left_on='Ticker',right_on='symbol')
merge_df.head()
Lets see how correlated S3 short float data is the Finviz equivalent
# Scatter plot of finviz Float Short versus S3 Squeeze Risk
sns.set(rc = {'figure.figsize':(10,8)})
sns.scatterplot(x=merge_df['fshortn'], y=merge_df['S3SIPctFloat']*100);
plt.xlabel("Finviz Short Float");
The S3 SIP percentage float values look to be highly correlated with the Finviz Short Float values and would not add anything to a model. This is for one day but this was done for about 20 days with the same results.
print(f"correlation between S3 and Finviz short float: {merge_df['S3SIPctFloat'].corr(merge_df['fshortn']):.3f}")
This is highly correlated. Anything above 0.8 is highly correlated. A correlation that high would seem to indicate that based solely on the S3 percent Float data the gain in accuracy S3 advertises is not significant compared to Finviz Elite data. Granted this was done for about 20 days with the same results.
The S3 metric we are really interested in is their proprietary Squeeze Risk.
sns.scatterplot(x=merge_df['S3SIPctFloat']*100, y=merge_df['Squeeze Risk']);
print(f"correlation between S3 short float and S3 Squeeze Risk:\n{merge_df['S3SIPctFloat'].corr(merge_df['Squeeze Risk']):.3f}")
S3 proprietary Squeeze Risk seem to have more information than just their percent float value, this looks more promising.
Lets look at S3 data correlation between data features.
target = ['Short Momentum',
'Short Interest',
'ShortInterestNotional',
'ShortInterestPct',
'S3Float',
'S3SIPctFloat',
'IndicativeAvailability',
'DaysToCover10Day',
'DaysToCover30Day',
'DaysToCover90Day',
'Crowded Score',
'Squeeze Risk',
'symbol']
plt.figure(figsize=(12,7));
sns.heatmap(df_s3_21_us[target].corr(),annot=True,cmap='Blues');
This heatmap indicates that none of the S3 features are highly correlated together. So in a model, every feature could be included and contribute to performance.
# lets filter our s3 data to our Finviz tradable stock universe
df_s3_21_us_filtered = df_s3_21_us.copy()
df_s3_21_us_filtered = df_s3_21_us_filtered[df_s3_21_us_filtered.symbol.isin(list(df_finviz.Ticker))]
df_s3_21_us_filtered.shape
# add timestamp date column
df_s3_21_us_filtered['Date'] = pd.to_datetime(df_s3_21_us_filtered.timestamp)
df_s3_21_us_filtered.iloc[:,7:].head()
# set index to date
df_s3_21_us_filtered.index = df_s3_21_us_filtered['Date']
del df_s3_21_us_filtered['timestamp']
del df_s3_21_us_filtered['Date']
df_s3_21_us_filtered.head()
Lets build a dataframe with the daily top 10 S3 Squeeze Risk with quote data including previous day close and previous day vwap. On a given trading day open, one would have all of the S3 data, the Open price and the previous day's close and vwap.
# new dataframe with S3 and quote data
df_top10 = pd.DataFrame(columns=['date','symbol','Open','High','Low','Close','Volume','vwap','prev_close','prev_vwap','Offer Rate','Bid Rate','Last Rate','Short Momentum','Short Interest','ShortInterestNotional','ShortInterestPct','S3Float','S3SIPctFloat','IndicativeAvailability','DaysToCover10Day','DaysToCover30Day','DaysToCover90Day','Crowded Score','Squeeze Risk','squeeze_return','profitable'])
# build date array for 2021
# dates = list(set(list(df_s3_21_us_filtered.index)))
dates = np.unique(np.array(df_s3_21_us_filtered.index))
dates[:5]
# for some reason S3 has data on Thanksgiving, November 25th which is a non trading day
index_to_remove = np.where(np.array([str(d) for d in dates]) == '2021-11-25T00:00:00.000000000')
# remove Thanksgiving
dates = np.delete(dates, index_to_remove)
# get date strings for liberator api
date_strings = [str(d)[:10] for d in dates]
date_strings = np.array(date_strings)
Loop to populate our dataframe. As a rough POC, we add a profitable feature to indicate if a stock reach 10% or above its open price for that day.
So we are assuming a Squeeze occurs if the high for the day reaches 10% above the stock open. This value should probably be higher but this is just a first exploration of the data.
index = 0
for d in dates[1:]:
# print(d)
# top 10 s3 for squeeze risk
# top_10 = df_s3_21_us_filtered.loc[d].nlargest(10, 'Squeeze Risk')[['symbol','ShortInterestPct','S3SIPctFloat','Crowded Score','Squeeze Risk']]
top_10 = df_s3_21_us_filtered.loc[d].nlargest(10, 'Squeeze Risk')[['symbol','Offer Rate','Bid Rate','Last Rate','Short Momentum','Short Interest','ShortInterestNotional','ShortInterestPct','S3Float','S3SIPctFloat','IndicativeAvailability','DaysToCover10Day','DaysToCover30Day','DaysToCover90Day','Crowded Score','Squeeze Risk']]
top_10_symbols = list(top_10.symbol)
date = str(d)[:10]
# find index for current date
date_index = np.where(date_strings == date)[0][0]
# get index for previous trading day
date_index -= 1
end = str((pd.Timestamp(d) + timedelta(days=1)).date())
# get quotes for top 10 Squeeze Risk symbols for current and previous date
quotes = liberator.get_dataframe(liberator.query(symbols = top_10_symbols, as_of = end, back_to = date_strings[date_index], name = 'daily_bars'))
# iterate through each top 10 stock for current day
for dd, row in top_10.iterrows():
# quote for symbol
quote = quotes[quotes.symbol == row['symbol']]
# no quote for symbol at this date, continue
if quote.empty:
print(f'{row["symbol"]} has no quotes for {date}')
continue
# current quote
q = quote[quote.timestamp.str.startswith(date)]
# previous trading day quote
qprev = quote[~quote.timestamp.str.startswith(date)]
high = q['high'].values[0]
open = q['open'].values[0]
# potential_return is max possible return from open
potential_return = round((high - open)/open, 3)
# print(q['symbol'] + ': ' + str(potential_return))
row['date'] = date
row['Open'] = open
row['High'] = high
row['Low'] = q['low'].values[0]
row['Close'] = q['close'].values[0]
row['Volume'] = q['volume'].values[0]
row['vwap'] = q['vwap'].values[0]
# prev trading day values
row['prev_vwap'] = qprev['vwap'].values[0]
row['prev_close'] = qprev['close'].values[0]
# max potential return
row['squeeze_return'] = potential_return
# label profitable > 10%
row['profitable'] = 1 if potential_return >= 0.1 else 0
df_top10.loc[index] = row
index+=1
# df_top10.to_csv('df_top10_full.csv', index=False)
df_top10 = pd.read_csv('df_top10_full.csv')
# Extract our data with a high - open greater than 10%
profitable = df_top10.copy()[df_top10.profitable == 1]
# top symbol count of squeeze > 10% for the year
pd.DataFrame(profitable.groupby('symbol').count()['squeeze_return'].sort_values(ascending=False)[:20])
This looks good... Top symbols in our rough arbitrary 10% threshold for a squeeze are AMC, the infamous GME and BBBY.
# squeeze count by date
pd.DataFrame(profitable.groupby('date').count()['squeeze_return'].sort_values(ascending=False)[:20])
With the 10% criteria we got a high of 6 hits on January 27th.
# Average of all max squeeze returns > 10% is 18%
print(f"{profitable['squeeze_return'].mean():.3f}")
# average of 1.5 top 10 stocks a day with max return > 10%
print(f"{profitable.groupby('date').count()['squeeze_return'].mean():.2f}")
plt.figure(figsize=(10,8))
plt.hist(profitable['Squeeze Risk']);
Most gains > 10% occur at 100 percent Squeeze Risk in our top10 Squeeze Risk dataset
pd.DataFrame(profitable[['Squeeze Risk']].value_counts(), columns=['count'])
pd.DataFrame(profitable.groupby('Squeeze Risk')['squeeze_return'].mean())
Seems to be an outlier at 75% Squeeze Risk
plt.figure(figsize=(10,8))
plt.plot(profitable.groupby('Squeeze Risk')['squeeze_return'].mean());
# check outlier at 75%
profitable[profitable['Squeeze Risk'] == 75]
ISIG had a high of $24.85 on December 8th for an open of $12.80
profitable.shape
# 248 trading days
len(set(df_s3_21_us_filtered.index))
# range of Squeeze Risks for potential trade returns > 10%
profitable['Squeeze Risk'].min(), profitable['Squeeze Risk'].max()
# number of data points with Squeeze Risk >= 75%
df_s3_21_us_filtered[df_s3_21_us_filtered['Squeeze Risk'] >= 75].shape[0]
# from the daily top 10 S3 Squeeze risk percentage that reach at least 10% above the open
df_top10['profitable'].value_counts()[1]/df_top10.shape[0]
7.3% in our top 10 dataset will reach 10% above the open. Our dataset is unbalanced.
This data would be useful as part of a model but it would be unlikely to yield a good model on its own.
# Lets do a quick xgboost model with our data
# convert our target profitabe to int
df_top10['profitable'] = df_top10['profitable'].astype(int)
df_top10.info()
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
X = df_top10.copy()[['Open','prev_close','prev_vwap','Offer Rate',
'Bid Rate',
'Last Rate',
'Short Momentum',
'Short Interest',
'ShortInterestNotional',
'ShortInterestPct',
'S3Float',
'S3SIPctFloat',
'IndicativeAvailability',
'DaysToCover10Day',
'DaysToCover30Day',
'DaysToCover90Day',
'Crowded Score',
'Squeeze Risk']]
X.head()
# lets normalize price values from Open and change percent values to the same scale
X['prev_close'] = X['prev_close']/X['Open']
X['prev_vwap'] = X['prev_vwap']/X['Open']
X['ShortInterestPct'] = X['ShortInterestPct']*100
X['S3SIPctFloat'] = X['S3SIPctFloat'] * 100
del X['Open']
X.head()
X['Squeeze Risk'].value_counts()
Target is our >10% open to high label
y = df_top10['profitable']
# Split training with 20% test set and maintain positive ratio.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=2, stratify=y)
X_train.shape, X_test.shape
xgb = XGBClassifier(booster='gbtree',
objective='binary:logistic', max_depth=6,
learning_rate=0.1, n_estimators=100,
random_state=2, n_jobs=-1, scale_pos_weight=13, use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
score = accuracy_score(y_pred, y_test)
print('Score: ' + str(score))
Accuracy of 89% kind of meaningless in such an unbalanced dataset.
# Confusion matrix code
def make_confusion_matrix(cf,
group_names=None,
categories='auto',
count=True,
percent=True,
cbar=True,
xyticks=True,
xyplotlabels=True,
sum_stats=True,
figsize=None,
cmap='Blues',
title=None):
'''
This function will make a pretty plot of an sklearn Confusion Matrix cm using a Seaborn heatmap visualization.
Arguments
'''
# CODE TO GENERATE TEXT INSIDE EACH SQUARE
blanks = ['' for i in range(cf.size)]
if group_names and len(group_names)==cf.size:
group_labels = ["{}\n".format(value) for value in group_names]
else:
group_labels = blanks
if count:
group_counts = ["{0:0.0f}\n".format(value) for value in cf.flatten()]
else:
group_counts = blanks
if percent:
group_percentages = ["{0:.2%}".format(value) for value in cf.flatten()/np.sum(cf)]
else:
group_percentages = blanks
box_labels = [f"{v1}{v2}{v3}".strip() for v1, v2, v3 in zip(group_labels,group_counts,group_percentages)]
box_labels = np.asarray(box_labels).reshape(cf.shape[0],cf.shape[1])
# CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS
if sum_stats:
#Accuracy is sum of diagonal divided by total observations
accuracy = np.trace(cf) / float(np.sum(cf))
#if it is a binary confusion matrix, show some more stats
if len(cf)==2:
#Metrics for Binary Confusion Matrices
precision = cf[1,1] / sum(cf[:,1])
recall = cf[1,1] / sum(cf[1,:])
f1_score = 2*precision*recall / (precision + recall)
stats_text = "\n\nAccuracy={:0.3f}\nPrecision={:0.3f}\nRecall={:0.3f}\nF1 Score={:0.3f}".format(
accuracy,precision,recall,f1_score)
else:
stats_text = "\n\nAccuracy={:0.3f}".format(accuracy)
else:
stats_text = ""
# SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS
if figsize==None:
#Get default figure size if not set
figsize = plt.rcParams.get('figure.figsize')
if xyticks==False:
#Do not show categories if xyticks is False
categories=False
# MAKE THE HEATMAP VISUALIZATION
plt.figure(figsize=figsize)
sns.heatmap(cf,annot=box_labels,fmt="",cmap=cmap,cbar=cbar,xticklabels=categories,yticklabels=categories)
if xyplotlabels:
plt.ylabel('True label')
plt.xlabel('Predicted label' + stats_text)
else:
plt.xlabel(stats_text)
if title:
plt.title(title)
cm1=confusion_matrix(y_test, y_pred)
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'not profitable > 10%','profitable > 10%']
make_confusion_matrix(cm1,
group_names=labels,
categories=categories,
figsize=(10,8),
cmap='Blues')
As expected, poor metrics with this limited S3 data and the unbalance.
What we are really interested in is the feature importance that our model came up with.
xgb.feature_importances_
sorted_idx = xgb.feature_importances_.argsort()
plt.figure(figsize=(10,8))
plt.barh(np.array(list(X))[sorted_idx], xgb.feature_importances_[sorted_idx])
plt.xlabel("Xgboost Feature Importance");
In this preliminary model, DaysToCover10Day is the most important feature by far. More than double the Squeeze Risk importance which would be expected at the top.
Would need to explore this further.
S3SIPctFloat has 0 importance, must be highly correlated to another feature in our top 10 Squeeze Risk dataset.
X[['S3SIPctFloat','ShortInterestPct']].corr()
Confirmed our tree model trigerred on the ShortInterestPct feature which is 99% correlated with S3SIPPctFloat.
S3 data is a valuable dataset. The S3 short data could be used in alpha factor research and contribute to models. The accurate float data by itself is valuable as float affects the way a stock trades.
Finviz Elite short interest data seem to be highly correlated to S3 short data and the stated increased accuracy is not obvious. S3 data includes other proprietary features like Squeeze Risk which provide additional information value.
This is just an initial look at the S3 data, more analysis is warranted.