Twitter is not just an extended source of news, it is by far one of the best samples of the world’s thoughts. With more than 330 million active users, it is one of the top platforms where people like to share their thoughts. Twitter data can be used for a variety of purposes such as research, consumer insights, demographic insights, and many more.
Hence, the primary aim of this tutorial is to teach you how to get a sample of Twitter data relevant to your project or business.
Before proceeding, make sure you have all of these variables handy:
- Consumer Key
- Consumer Secret
- Access Token
- Access Token Secret
If you want to know how to acquire the above-mentioned details, go read that blog post written by my colleague Dattatray Upase.
Now let’s do some coding!
Defining the input variables
First, you have to define some of the global variables that you would need for the program:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
import sys start_date = sys.argv[1] #"2018-01-09" end_date = sys.argv[2] #"2018-01-10" consumerKey="Enter_Your_Consumer_Key_Here" consumerSecret="Enter_Your_Consumer_Secret_Here" accessToken="Enter_Your_Access_Token_Here" accessTokenSecret="Enter_Your_Access_Token_Secret_Here" keyword= sys.argv[3] #"tcs" lang="en" #see what twitter offers for language filtering data={} |
I am importing ‘sys’
to get command line arguments, because I might want to change keywords, start-date or end-date. For language I picked English, but you might want to check what other languages are supported. The results will be stored in ‘data’
at the end.
As a result, a typical usage of the script would be like this:
python script.py start_date end_date keyword
Accessing the Twitter API
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import oauth2 req_count = 0 def oauth_req(url, http_method="GET", post_body=b"", http_headers=None): global req_count,consumerKey,consumerSecret,accessToken,accessTokenSecret req_count += 1 consumer = oauth2.Consumer(key=consumerKey, secret=consumerSecret) token = oauth2.Token(key=accessToken, secret=accessTokenSecret) client = oauth2.Client(consumer, token) resp, content = client.request( url, method=http_method, body=post_body , headers=http_headers ) return content |
The req_count
variable is the number of times I have used the API in the execution of my program. I am facing the following error with the code given:
TypeError: Unicode-objects must be encoded before hashing
In order to avoid this, I am changing post_body=”” to post_body=b”” and it fixes the problem.
Twitter API Usage and Reference
It’s time to set the API URL to get the Twitter data. I am using the parameter min_faves
. Here’s the explanation of the URL and some optimization tricks:
‘min_faves’
is used to set the minimum number of favorites a tweet should have in the data. It’s a very useful feature but it’s not mentioned in the Twitter API documentation.
‘q’
represents the query or the keywords you want to enter. Here it’s important to make sure that you give as few keywords as possible. For example, let’s imagine that I want tweets about Facebook and Google. If I give both as keywords, say, FACEBOOK and GOOGLE, it’s only going to return me 100 tweets max, since that’s a restriction. But if I run the query two times – once with Facebook and once with Google, I can get a total of 200 tweets. Long story short, it’s better to use one keyword per query.
‘lang’
represents the language of the filtered tweets. Since I want to get tweets in English, I am setting it to ‘en’.
‘since’
is the start date of the period from which you want to look for tweets. This start date should be from the last 7 days. This is another feature which is not documented in Twitter API Documentation.
‘until’
represents the end date of your desired period. Logically, it should also be from the last 7 days. It’s also not documented on Twitter API Documentation.
‘result_type’
represents the kind of tweets you want. It has 3 values:
‘recent’
gives the most recent tweets, i.e. the tweets at the end of the selected period.
‘popular’
gives the most popular tweets and hence it misses a lot of tweets. You would always get the tweets with the top faves and retweets. The min_faves
feature would be of no use here.
‘mixed’
gives a mix of recent and popular tweets.
‘count’
represents the maximum number of tweets in the result. Default is set to 15 and the maximum is 100.
With the mixed result_type
and usage of min_faves
, we can get the maximum tweets running the query multiple times.
1 2 3 |
def get_tweets(min_faves): global keyword, start_date, end_date, lang return oauth_req( 'https://api.twitter.com/1.1/search/tweets.json?' + '&q=' + keyword + '&lang=' + lang + '%20since%3A' + start_date + '%20until%3A' + end_date + '%20min_faves%3A' + str(min_faves) +'&result_type=mixed&count=100') |
For more documented features, you can also check out the Twitter’s API Documentation.
Saving/Autosaving the Retrieved Tweet Data
As a next step, you need to define an autosave/save method which has a parameter ‘saveOverride’. This step is simply needed to remove the autosave time restriction and save the file. In order to do this, I create a t_last
to save the start time of the program. Then I access the same in the program and check if it has been 5 minutes since the t_last
(last save time). If it has been more than 5 minutes, I mark the ‘saveStatus’ as True.
Next, I check for ‘saveOverride’, which simply means that I need to give instruction to my program that no matter what, the file should be saved now. For this, I set the saveStatus to be True.
Then if ‘saveStatus’ is True, the script will change the t_last
to the current time. After that the code creates a dictionary object and prints “Autosave at [time]” so that you know that the data is being autosaved.
Next, I am checking if the output file already exists. If it does, I am combining the data and the data from an already saved file. After combining, I am writing it to the same file. If it doesn’t exist, I am creating a new file and then writing the data to that file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import os import json import time import datetime t_last = time.time() def autosave(saveOverride = False): global t_last saveStatus = (time.time() > t_last + 300) if(saveOverride == True):en" saveStatus = True if(saveStatus): t_last=time.time() tmp = {} print("Autosave at " + str(datetime.datetime.now().strftime("%Y-%m-%d_%H:%M:%S"))) fnamea = keyword + "-st-" + start_date + "-ed-"+ end_date + '.json' if os.path.exists(fnamea) == True: with open(fnamea,'r+') as f: tmp = json.load(f) for i in data.keys(): tmp[i] = data[i] with open(fnamea,'w+') as f: json.dump(tmp,f) |
As a result, I have coded almost the entire functionality I wanted.
Optimizing Further
Finally, it’s time to make use of these functions. I am writing a while(1) loop which means while(True). This basically makes the instruction run forever till some ‘break’ statement is called or some exception is raised.
First, I collect the tweets data in dictionary ‘d’ using json.loads
which converts the data into a dictionary format. Then, I run a try-catch/except
on the block of code extracting statuses from the data. I use try-catch because sometimes Twitter API doesn’t return data but a JSON mentioning the error. I don’t want my program to stop in such cases. Also, I want it to retrieve at what request number that’s happening and to save my Twitter data using the autosave command. Twitter allows us to make 180 requests per 15 minutes. That’s like 12 requests per minute or one request every five seconds. Just to be safe, I add a sleep command to make my program sleep for 5 seconds after executing one iteration.
After that, the code will display the number of tweets the script has collected so far.
Finally, it’s time for the major optimization trick. I was testing this script for almost a week and I got the following number of tweets for each min_faves
value. I can get a maximum of 100 tweets per request and I want to get as many as possible. Currently, there are not many tweets with higher values of min_faves
but we want to account for the times when, perhaps, the company or the keyword is trending. The max value of min_faves
can be 999999.
min_faves Value | Number of Tweets |
100,000 | 1 |
90,000 | 1 |
80,000 | 1 |
70,000 | 2 |
60,000 | 3 |
50,000 | 6 |
40,000 | 6 |
30,000 | 12 |
25,000 | 12 |
Therefore, I am using a logic that would get tweets from min_faves
value of 60000 and then decreases it by 10000 each time until it reaches 10000. But if say, the keyword is trendy and I get 100 tweets when I am working with the value of min_faves
to be 30000, it’s going to increase the min_faves
to 35000 and then get the results again. So now the new logic is 5000 instead of 10000. However, if the change decreases to less than 1000, I ask it to ignore and go ahead with subtracting 1000.
I am specifying a fixed interval of 1000 that min_faves
should decrease by if min_faves
is less than or equal to 10000.
At the end of the program, the program will let you know that the work is done by displaying ‘End’.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
min_faves=60000 change=10000 #high reduction in min_faves to extract data interval = 500 #normal reduction in min_faves to extract data while(1): d = json.loads(get_tweets(min_faves)) try: for i in d['statuses']: data[i['id']] = i c = len(d['statuses']) except Exception as e: print("Error at request : " + str(req_count)) autosave(True) print("At request: " + str(req_count) + " Total Tweets Collected: " + str(len(data)) + " with Min Faves: " + str(min_faves) ) if c==100 and min_faves>10000: if (change>1000): change /= 2 min_faves += change else: min_faves -= change elif min_faves>10000: min_faves -= change else: min_faves -= interval if(min_faves < 0): fnamea = keyword + '.json' autosave(True) break autosave() time.sleep(5) print("End") |
You can find the entire code on GitHub.
That’s all. In the next twitter data tutorial, I am going to teach you how to retrieve real-time tweets using the big data tool ‘Flume’. Stay tuned!
- Removing Spaces in Python - March 24, 2023
- Is Kubernetes Right for Me? Choosing the Best Deployment Platform for your Business - March 10, 2023
- Cloud Provider of tomorrow - March 6, 2023
- SOLID: The First 5 Principles of Object-Oriented Design? - March 3, 2023
- Setting Up CSS and HTML for Your Website: A Tutorial - October 28, 2022