How to Create and Backtest Trading Strategy on Twitter Sentiments

26 min read

Here at dxFeed, a market data vendor and a subsidiary of Devexperts, we have a number of sandbox projects. For our latest project, our team created a dxCurrent Python library for convenient and fast integration with dxFeed data. To test it, we created a few common tasks in the fields of quantitative finance, data science, and business analysis.

We chose to start from an approach that has attracted a lot of attention in modern financial data analysis – stock movement prediction based on Twitter sentiment models (Bollen et al., 2011; Nguyen et al., 2015).

Basically, this task could be split into two consequent parts:

  • Twitter sentiment scoring and strategy composition
  • Applying and backtesting our strategy

Since the second part is fully implemented in our dxCurrent package, we needed only to create sentiment scores in order to demonstrate our solution.

Data selection

First of all, we needed to choose Twitter text sources for sentiment analysis and stocks for prediction.

There are two main approaches to selection:

  • Acquire pairs of specific stocks and tweets with its tickers in the body of a message
  • Acquire pairs of Twitter feeds (from one or more sources) and sectors of the market which feeds represent (in form of indices)

We preferred the second approach for data collection. Our reasoning behind this decision is simple – Twitter feeds from multiple sources are more likely to provide us with a signal every day while emotional tweets about specific stocks could be quite rare (as long as it’s not AAPL).

In order to choose twitter feeds, we carefully hand-picked sources based on their impact and content:

Here is the list of all sources that we used:

@business, @WSJMarkets, @WSJMoneyBeat, @stocktwits, @benzinga, @markets, @IBDinvestors, @nytimesbusiness, @jimcramer, @TheStalwart, @ReformedBroker, @bespokeinvest, @stlouisfed, @Wu_Tang_Finance, @StockCats, @LizAnnSonders, @The_Real_Fly, @charliebilello, @lindayueh, @ukarlewitz, @paulkrugman, @EIAgov, @MarketWatch, @SeekingAlpha, @zerohedge

As the most relevant instrument for this aggregation of Twitter sources, we chose SPY (an ETF for SPX). The reasoning behind this decision is that the selected accounts cover the main industries of the US market and the SPX is wide enough to reflect the general market attitude.

We acquired Twitter data via Twitter public API and indices data from our dxCurrent Python library.

Algorithm selection

We decided to use simple, dictionary-based methods and started with the VaderSentiment algorithm. This method features a mapped out dictionary and a set of rules for sentiment analysis using this dictionary. It was attuned to measure social media sentiment. We implemented it in our pipeline and after the first few experiments, we came to the conclusion that this algorithm was too general for us. For example, the word ‘rising’ is completely neutral for Vader, while in the financial world it should naturally have a non-zero sentiment.

We found financial-specific sentiment in Oliveira, 2016. Authors provide an open dictionary with sentiment scores in negative and positive contexts. In this case, the word ‘rising’ had a strong positive sentiment. But its usage did not improve the algorithm’s results – probably our data, containing mostly general language and changes of sentiment in financial jargon, is not significant for our model. Therefore, we decided to present only Vader sentiment analysis results.

Metrics selection

The next big question for us was what metrics to use in order to evaluate our models. We sought our sentiment scores as sources for two types of models: a classification model for predicting market movement and a trading model, obviously for making a profit based on signals extracted from Twitter.

Both models were constructed in a mostly identical fashion:

  • Select a Twitter source (or an aggregation)
  • Calculate a sentiment score for every tweet
  • Create daily sentiment series averaging scores across each day
  • Create a signal (-1/1 for a market movement classification and 1/0/-1 for a trading strategy)

As a result, we got two series of signals for every Twitter source with one signal per day (negative/positive for classification task and negative/neutral/positive for trading). Accordingly, we selected two sets of metrics in order to check the performance of each model.

Classificator metrics

We formulated our experiment as a classification task: based on sentiment from the previous day we classified the following trading day as either “rising” or “falling” and compared it to the realized return for that day (positive or negative, accordingly).

In order to check the performance of such a model, we used f1 score and ROC AUC. As an additional metric, we also calculated the Pearson correlation between the daily return rate and the lagged daily sentiment score.

Financial metrics

Using our dxCurrent signal processing and backtesting modules, we tested every sentiment-based strategy on dxFeed financial data and collected classical metrics for strategies like total return, volatility, and Sharpe ratio.

We compared our strategies with a buy and hold strategy and a risk-free investment. A buy and hold strategy had the same starting portfolio as the sentiment strategies but did not perform any action with it. A risk-free investment yields 2.5% per annum.

Analysis

We used our dxCurrent library for easy and fast strategy testing on historical dxFeed data.

First of all, we acquired SPY historical data. We were quite lucky – there was a major drop in price in January 2019 – a real test for a sentiment strategy!

dxf = dm.DxFeed()
spy_data = dxf.get_feed(symbols=['SPY'], 
                        date_from=start_date, 
                        date_to=end_date)

The next step was to acquire Twitter data and perform sentiment analysis. In order to do that, we used tweepy (an open-source library). We ended up collecting 39958 tweets with a mean of 2854 tweets per user, covering up to a three year period for some users.

# Custom functions (not a part of dxCurrent solution) 
tweets = get_data(users)
df = prepare_data(tweets, spy_data, sentiment)
strategy = create_signal(tweets, sentiment)

After calculating sentiment we decided to perform a sanity check by reading the most positive and negative tweets:

The most negative tweet with -0.95 sentiment score:

@paulkrugman: ‘step 1: trade war step 2: emergency policies to offset damage from trade war step 3: policies to offset damage from’ (https://twitter.com/paulkrugman/status/1021844506767970305)

The most positive tweet with 0.95 sentiment score: 

@jimcramer: ‘so many great ones. $abt, $msft still cheap. same with $avgo.. $el is so fabulous. paychex good yield. $pg great organic growth. ‘ (https://twitter.com/jimcramer/status/1106569336234352640)

Interestingly, the most accurate source of Twitter sentiment was @stocktwits account – based on its sentiment alone we achieved correlation 0.12 (p < 0.005) between SPY returns and sentiment scores. Market movement prediction F1 was score = 0.69 and ROC AUC = 0.56.

Next, we created a strategy based on mean daily sentiment: any negative sentiment was interpreted as a signal to sell 1 share of SPY, zero or weak positive sentiment was considered as neutral, and a strong positive sentiment was a signal to buy 1 share of SPY. Also, we introduced a one day lag between the sentiment signal and stock prices. For each day, today’s Twitter signal defined tomorrow’s action, thus avoiding forward-looking. 

We started with 2.000 $ and 10 SPY shares.

# trading setup 
money = 2000
initial_portfolio = {'SPY':10}
start_date, end_date = '2018-01-26', '2019-04-08'
# Initialise dxCurrent trader 
tr = Trader(symbols=list(initial_portfolio.keys()), 
            money=money, 
            initial_portfolio=initial_portfolio, 
            date_from=start_date, 
            date_to=end_date)
# Trade by generated signal
tr.trade_by_mask(strategy)

We can plot actions of our strategy on a historical feed :

tr.plot_feed()

In general, we could see that sentiment analysis catches trends (but the statistical significance of such a claim requires further analysis). More importantly, the sentiment strategy ends up being more efficient than either an idle or a risk-free strategy.

Let’s check performance on backtesting stats:

trade_df = pd.concat([sentiment_strategy,
                      idle_strategy]).reset_index()
trade_df = trade_df.pivot('index','strategy','PnL')
bt = run_backtest(trade_df)
Sentiment strategy Buy&hold
Total return 8.24% 4.71%
Max drawdown -17.59% -11.76%
Num. of trades 274 0
Volatility (ann.) 54% 34%
Sharpe ratio (ann.) 1.82 1.35

Conclusion

We created a simple but efficient strategy and backtested it with our dxCurrent solution. While it’s not publicly available yet (but soon will be!), the demo may be requested at sd@dxfeed.com. We hope that our tool will make your process of market data exploration and financial research much easier and faster.

Special thank you to the rest of dxFeed Index Management team for their help and support.