1 post tagged

nltk

Building frequency counts and bigrams using the posts of traders

Время чтения текста – 9 минут

Stocktwits is the largest social network for investors and traders of all levels which allows us to see what is happening in the financial markets. Today we will build a frequency dictionary and bigrams of the users’ posts and divide them by the number of followers. This will allow us to see the difference between the posts of different types of traders.

This is how the feed on the CCIV security looks at Stocktwits:

Some users have the status of officials:

Scraping the posts

Stocktwits has an API that allows getting 30 posts at a time. The API request returns a JSON file, so we will write a get_30_messages function that reads the JSON file and writes all the entries into the list called rows. The information about posts already contains the information about users, so we will not create separate tables and will save everything in one DataFrame. For this purpose, we will create a list with the names of columns and initiate an empty list called rows where we will append all the scraped posts.

Some posts don’t have a “likes” key in the JSON file which results in KeyError. To avoid the error, we will assign 0 to the “likes” in such posts.

cols = ['post_id', 'text', 'created_at', 'user_id', 'likes', 'sentiment', 'identity','followers', 'following', 'ideas', 'watchlist_stocks_count', 'like_count', 'plus_tier']
rows = []
 
def get_30_messages(data):
    for p in data['messages']:
        try:
            likes = p['likes']['total']
        except KeyError:
            likes = 0
        rows.append({'id': p['id'], 
                    'text': p['body'], 
                    'created_at': p['created_at'], 
                    'user_id': p['user']['id'], 
                    'likes': likes,
                    'sentiment': p['entities']['sentiment'], 
                    'symbol': symbol,
                    'identity': p['user']['identity'],
                    'followers': p['user']['followers'], 
                    'following': p['user']['following'], 
                    'ideas': p['user']['ideas'], 
                    'watchlist_stocks_count': p['user']['watchlist_stocks_count'], 
                    'like_count': p['user']['like_count'], 
                    'plus_tier': p['user']['like_count']
                    })

We will scrap the posts from the pages of 16 most trending securities.

symbols = ['DIA', 'SPY', 'QQQ', 'INO', 'OCGN', 'BTC.X', 'SNAP', 'INTC', 'VXX', 'ASTS', 'SKLZ', 'RIOT', 'DJIA', 'GOLD', 'GGII', 'COIN']

As the API request returns only 30 most recent posts, to get older posts, we need to save the id of the last post into a dictionary and insert it as the max parameter during the next request. Unfortunately, the API allows us to make only 200 requests per hour, so in order to stay within the limits, we will run the for loop for each security only 11 times.

last_id_values = dict()
        
for symbol in symbols:
    file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json")
    data = json.loads(file.content)
    
    for i in range(10):
        get_30_messages(data)
            
        last_id = data['cursor']['max']
        last_id_values[symbol] = last_id
        
        file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json?max={last_id}")
        data = json.loads(file.content)
    
    get_30_messages(data)

Thus, we have collected only about 6000 posts, which is not enough for the analysis. That’s why, we will create a timer to run the same code after 1 hour and 5 minutes for 11 cycles.

def get_older_posts():
    for symbol in symbols:
        for i in range(12):
            file = requests.get(f"https://api.stocktwits.com/api/2/streams/symbol/{symbol}.json?max={last_id_values[symbol]}")
            data = json.loads(file.content)        
            get_30_messages(data)
 
            last_id = data['cursor']['max']
            last_id_values[symbol] = last_id
 
for i in range(11):
    time.sleep(3900)
    get_older_posts()

After all the data is collected, let’s create a DataFrame.

df = pd.DataFrame(rows, columns = cols)

The resulting table will look like this:

It is important to check that the post_id doesn’t have duplicate values. By looking at the number of unique values and the number of total values in posts_id we can notice that we have about 10000 duplicate values.

df.posts_id.nunique(), len(df.posts_id)

This happened because some posts get posted on multiple pages. So the last step will be dropping the duplicate values.

df.drop_duplicates(subset="posts_id", inplace=True)

Frequency counts and bigrams

First of all, let’s create a frequency count for posts without dividing them into groups.

df.text.str.split(expand=True).stack().value_counts()

We can see that articles, conjunctions, and prepositions prevail over the other words:

Thus, we need to remove them from the dataset. However, even if the dataset is cleaned, the results will look like this. Apart from the fact that 39 is the most frequent word, the data is not very informative and it’s difficult to make any conclusions based on it.

In this case, we will need to build bigrams. One bigram is a sequence of two elements, that is two words standing next to each other. There are many algorithms for building n-grams with different optimization levels. We will use a built-in function in nltk to create a bigram for one group. First, let’s import the additional libraries, download stop words for the English language, and clean the data. Then we will add more stop words including the names of the stock tickers that are used in every post.

import nltk
from nltk.corpus import stopwords
from string import punctuation
import unicodedata
import collections
import nltk
from nltk.stem import WordNetLemmatizer
 
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

english_stopwords = stopwords.words("english")
symbols = ['DIA', 'SPY', 'QQQ', 'INO', 'OCGN', 'BTC.X', 'SNAP', 'INTC', 'VXX', 'ASTS', 'SKLZ', 'RIOT', 'DJIA', 'GOLD', 'GGII', 'COIN']
symbols_lower = [sym.lower() for sym in symbols]
append_stopword = ['https', 'www', 'btc', 'x', 's', 't', 'p', 'amp', 'utm', 'm', 'gon', 'na', '’', '2021', '04', 'stocktwits', 'com', 'spx', 'ndx', 'gld', 'slv', 'es', 'f', '...', '--', 'cqginc', 'cqgthom', 'gt']
english_stopwords.extend(symbols_lower)
english_stopwords.extend(append_stopword)

Let’s define a function to prepare the text that will translate all the words to lowercase, bring them to the base form and remove stop words and punctuation.

wordnet_lemmatizer = WordNetLemmatizer()
 
def preprocess_text(text):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [wordnet_lemmatizer.lemmatize(token) for token in tokens if token not in english_stopwords\
              and token != " " \
              and token.strip() not in punctuation]
    
    text = " ".join(tokens)
    
    return text
    
    df.text = df.text.apply(process_text)

For example, let’s take the group of the least popular users with less than 300 followers, build bigrams and output the most frequent ones.

non_pop_df = df[(df['followers'] < 300)]
 
non_pop_counts = collections.Counter()
for sent in non_pop_df.text:
    words = nltk.word_tokenize(sent)
    non_pop_counts.update(nltk.bigrams(words))
non_pop_counts.most_common()

Results of the bigrams study

Users with less than 300 followers mostly write about their personal plans on making money. This is shown by the collocations like short term, long term, and make money.
Less than 300 followers:
1. look like, 439
2. next week, 422
3. let 39, 364
4. capital gain, 306
5. long term, 274
6. let go, 261
7. stock market, 252
8. buy dip, 252
9. gain tax, 221
10. make money, 203
11. short term, 201
12. buy buy, 192

More popular users with 300 to 3000 followers discuss more abstract issues like sweep premium, stock price and artificial intelligence.
From 300 to 3000 followers:
1. sweep premium, 166
2. price target, 165
3. total day, 140
4. stock market, 139
5. ask premium, 132
6. stock price, 129
7. current stock, 117
8. money trade, 114
9. trade option, 114
10. activity alert, 113
11. trade volume, 113
12. artificial intelligence, 113

Popular users that have below 30000 followers discuss their observations as well as promote their accounts or articles.
From 3000 to 30000 followers:
1. unusual option, 632
2. print size, 613
3. option activity, 563
4. large print, 559
5. activity alerted, 355
6. observed unusual, 347
7. sweepcast observed, 343
8. |🎯 see, 311
9. see profile, 253
10. profile link, 241
11. call expiring, 235
12. new article, 226

Very popular traders with more than 30000 followers mostly act as information sources and post about changes at the stock market. This is indicated by the frequent up and down arrows and collocations like “stock x-day” or “moving average”.
Users with more than 30000 followers:
1. dow stock, 69
2. elliottwave trading, 53
3. ⇩ indexindicators.com, 51
4. ⇧ indexindicators.com, 50
5. u stock, 47
6. stock 5-day, 36
7. moving average, 29
8. stock moving, 28
9. stock x-day, 27
10. ⇧ 10-day, 26
11. stock daily, 25
12. daily rsi, 25

We have also built the bigrams of officials, but the results turned out to be very similar to the most popular users.

 No comments    138   2021   data analytics   nltk