Data Crawling Everything: Case of Social Media

Data scraping is the process of importing information or data from a website and displaying it in a spreadsheet or a local file on your computer. We will be exploring how one can scrape data from social media sites, particularly Twitter. We will also learn the same for Reddit using its official API. Lastly, we will learn how to generally scrape the content in web pages and pictures from different web pages as well. So, let’s get started!

Prerequisites

Before you go ahead, please note that there are a few prerequisites for this tutorial. You should have some prior basic knowledge of Machine Learning, as well as basic programming knowledge in any language (preferably in Python). We will be using Jupyter Notebook for writing our code. If you do not already have it installed, visit Jupyter Notebook or work on any other code editor of your liking.

1. Scraping tweets from Twitter using Twint

There are a number of ways to scrape tweets from Twitter. You can do so using the Twitter API but a shortcoming of this is that it limits the number of tweets that can be scraped. Manually scraping the tweets is also one option but requires unnecessary time and effort. This is why we will be using Twint to collect our tweets from Twitter. Twint is a tool that allows you to scrape tweets on different basis e.g. the tweets of a particular user, tweets containing a particular keyword, tweets that are tweeted after or within a certain time, etc.

Installations

You can install Twint by typing the following command in your terminal

pip install twint

Scraping Twitter tweets using Twint

Scraping tweets of a particular user

import twint
    config = twint.Config()
    # Search tweets tweeted by user 'BarackObama'
    config.Username = "BarackObama"
    # Limit search results to 20
    config.Limit = 20
    # Return tweets that were published after Jan 1st, 2020
    config.Since = "2020-01-1 20:30:15"
    # Formatting the tweets
    config.Format = "Tweet Id {id}, tweeted at {time}, {date}, by {username} says: {tweet}"
    # Storing tweets in a csv file
    config.Store_csv = True
    config.Output = "Barack Obama"
    twint.run.Search(config)

Output:

Tweet Id 1261004586359422979, tweeted at 18:44:56, 2020-05-14, by BarackObama says: Vote.
    Tweet Id 1260955716644470784, tweeted at 15:30:44, 2020-05-14, by BarackObama says: Michelle and I want to do our part to give all you parents a break today, so we’re reading “The Word Collector” for @chipublib. It’s a fun book that vividly illustrates the transformative power of words––and we hope you enjoy it as much as we did. pic.twitter.com/ADYbL6Dzg4
    Tweet Id 1260707691900612615, tweeted at 23:05:11, 2020-05-13, by BarackObama says: Despite all the time that’s been lost, we can still make real progress against the virus, protect people from the economic fallout, and more safely approach something closer to normal if we start making better policy decisions now. https://www.vox.com/2020/5/13/21248157/testing-quarantine-masks-stimulus …
    ....

Scraping tweets with a particular keyword

import twint
    # Configure
    config = twint.Config()
    # Search tweets that mention Taylor Swift
    config.Search = "taylor swift"
    # Limit search results to 10
    config.Limit = 20
    # Return tweets that were published after Jan 1st, 2020
    config.Since = "2020-01-1 20:30:15"
    # Formatting the tweets
    config.Format = "Tweet Id {id}, tweeted at {time}, {date}, by {username} says: {tweet}"
    # Storing tweets in a csv file
    config.Store_csv = True
    config.Output = "Taylor Swift"
    twint.run.Search(config)

2. Scraping Reddit using Reddit API

We will be scraping donation requests made on Reddit by using the official Reddit API. To access it, you need to:

Go to the official Reddit website
Log into your Reddit account or create a new one
Go to User Settings

4. Go to Privacy and Security

5. Go to App authorization

6. Click on ‘are you a developer? create an app’

7. Create a name for your application and fill in the other relevant credentials. In redirect URL, put the URL of your localhost.

8, Click on ‘create app’

9. Copy the characters underneath ‘personal use script’ and next to ‘secret’ and save them in a file or notepad. You will be needing them to gain access to the API.

Installations

We will be using a Python framework named Praw to easily use the Reddit API. To install it, run the following command in your terminal:

pip install praw

Python Code

import praw  
    import pandas as pd  
    import numpy as np
    # Fill in your own credentials for client_id, client_secret and user_agent. Characters in'Personal use script' make your client_id, those in 'secret' make client_secret and user_agent is the name of your application.
    reddit = praw.Reddit(client_id = '',  
                         client_secret = '', 
                         user_agent = '') 
    # Get posts from the subreddits related to donations 
    hot_post_1 = reddit.subreddit ('donate').hot(limit = 10) 
    hot_post_2 = reddit.subreddit ('Assistance').hot(limit = 10) # Offers
    hot_post_3 = reddit.subreddit ('Charity').hot(limit = 10) 
    hot_post_4 = reddit.subreddit ('Donation').hot(limit = 10) 
    hot_post_5 = reddit.subreddit ('gofundme').hot(limit = 10) # lots of categories
    hot_post_6 = reddit.subreddit ('RandomKindness').hot(limit = 10) 
    hot_post_7 = reddit.subreddit ('donationrequest').hot(limit = 10 )
    # Saving donation posts in an empty list
    posts = []
    for post in hot_post_1:
        posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
        
    for post in hot_post_2:
        posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
    for post in hot_post_3:
        posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
    for post in hot_post_4:
        posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
        
    for post in hot_post_5:
        posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
        
    for post in hot_post_6:
        posts.append ([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
        
    posts = pd.DataFrame (posts, columns = ['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])
    #posts
    df = pd.DataFrame (data = posts)
    dataframe = df.to_csv (r'donations.csv', index = False)
    # Data Processing
    df = pd.read_csv ('donations.csv')
    df = df.drop (['id', 'subreddit', 'num_comments', 'url', 'created'],1)
    df = df[['title', 'score','body']]
    print (df.head ())
    print(df.shape)
    # Saving donation posts to a csv file
    dataframe = df.to_csv (r'donations.csv', index = False)

Output:

3. Scraping contents of a web page

We will be scraping the text content of a Wikipedia page about Reddit using a simple and powerful Python library named BeautifulSoup. It is also important for you to be familiar with some of the basics of HTML for web scraping. First, right-click and open your browser’s inspector to inspect the webpage. Hover your cursor on the desired section whose content you want to scrape, and you should be able to see a blue box surrounding it. If you click it, the related HTML will be selected in the browser console. The section that we wish to scrape is a div that contains the entire text within the page.

Installations

To install BeautifulSoup, run the following command in your terminal:

pip install BeautifulSoup4

Python Code

# import libraries
    import urllib
    from bs4 import BeautifulSoup
    # specify url of webpage whose content you need to scrape
    url = "https://en.wikipedia.org/wiki/Coronavirus"
    request = urllib.request.Request (url)
    # query the website and return the html of the webpage
    response = urllib.request.urlopen (request)
    # parse the html using beautiful soup 
    var = BeautifulSoup (response,'html.parser')
    # Take out the <div> and get its value
    text_box = var.find ('div', attrs = {'id': 'bodyContent'})
    text = text_box.text.strip () 
    print (text)

Output:

From Wikipedia, the free encyclopedia
Jump to navigation
Jump to search
This article is about the group of viruses. For the ongoing disease involved in the COVID-19 pandemic, see Coronavirus disease 2019. For the virus that causes this disease, see Severe acute respiratory syndrome coronavirus 2.
Subfamily of viruses in the family Coronaviridae
Orthocoronavirinae
Transmission electron micrograph (TEM) of avian infectious bronchitis virus
Illustration of the morphology of coronaviruses; the club-shaped viral spike peplomers, colored red, create the look of a corona surrounding the virion when observed with an electron microscope.

4. Scraping images

We will be scraping images in batch through the Fatkun Batch Download Image extension.

Prerequisites

You will be needing Google Chrome Browser along with Fatkun Batch Download Image extension.

Steps:

After you are finished with the installation, search for the website and the pictures that you want to download
Click on the extension’s icon
Now an extension will get opened which would display a new tab showing all images that have been detected by it. All the pictures that appear on the extension’s tab by default have opted for the purpose of download. After making the choice, click on ‘save image’.
The extension would now provide you with the warning and will ask where to save the file before it is being downloaded and you have to give the confirmation for each image.
The extension would create for you a new folder based on the title of the website and there you could download all the desired images. You could even click on ‘more options’ so that with the aid of link you could simply filter the images, rename and sort them as per size.

While crawling presents easy access to many web-based data collections, most times, such data also accompanies heavy noises and contaminations to be used as a dataset right away. Therefore, companies or researchers need to devote heavy efforts in quality controlling; having enough human resources is always a great challenge. Therefore, it is often more efficient to find another service that does laborious works (including both collection and preprocessing) for you. For that, we could be your perfect solution!

Here at DATUMO, we crowdsource our tasks to diverse users located globally to ensure the quality and quantity simultaneously. Moreover, our in-house managers double-check the quality of the collected or processed data! Check us out at datumo.com for more information!

Your AI Data Standard

LLM Evaluation Platform

Learn more

Prerequisites

1. Scraping tweets from Twitter using Twint

Installations

Scraping Twitter tweets using Twint

Scraping tweets of a particular user

Scraping tweets with a particular keyword

2. Scraping Reddit using Reddit API

Installations

Python Code

3. Scraping contents of a web page

Installations

Python Code

4. Scraping images

Prerequisites

Steps:

Your AI Data Standard

LLM Evaluation Platform

Newsletter

Related Posts

GraphRAG- 2. How Queries Move Through the Graph

GraphRAG- 1. Structural Reasoning Framework

KGARevion: Smarter Medical AI Agent – 3