Extracting Reddit Data With R and the package RedditExtractoR (2023 Update)

James Cook
17 Apr 202316:50
EducationalLearning
32 Likes 10 Comments

TLDRJames Cook from the University of Maine at Augusta demonstrates how to use the Reddit extractor package in R for social media data analysis. He guides through the process of installing R and RStudio, using Reddit extractor to obtain data from Reddit, and transforming it into datasets. The tutorial covers fetching recent post URLs and comments from a subreddit, with a focus on the Tar Heels sports subreddit, and emphasizes the importance of the package's simplicity and efficiency in data extraction.

Takeaways
  • πŸ“Š The video is a tutorial on using the 'Reddit extractor' package in R for data analysis.
  • πŸ’» The presenter, James Cook from the University of Maine at Augusta, explains the process of obtaining datasets from Reddit for analysis.
  • πŸ”„ The 'Reddit extractor' package has been recently rewritten, simplifying the process of data extraction.
  • πŸ“± The tutorial recommends using RStudio for its user-friendly interface and consistency across different computers.
  • πŸ”§ To begin, one must install R and RStudio, and then install the 'Reddit extractor' package along with its dependencies.
  • πŸ“š The 'library' command in R is used to activate the 'Reddit extractor' package for use.
  • πŸ” The 'find thread URLs' command is utilized to extract URLs of recent posts from a specified subreddit, with options to sort and limit the search period.
  • πŸ“Š The 'get thread content' command fetches information about comments associated with each URL extracted previously.
  • ⏳ The data extraction process requires an internet connection and may take some time depending on the amount of data.
  • πŸ“ˆ The resulting dataset includes various variables such as date, timestamp, title, subreddit, number of comments, and URLs.
  • πŸ”Ž The video demonstrates how to view and interpret the data, including understanding the structure of threads and comments, and the potential for further analysis.
Q & A
  • What is the purpose of using the Reddit extractor package in the video?

    -The purpose of using the Reddit extractor package is to obtain data from the social media platform Reddit and convert it into data sets that can be used for various types of analysis.

  • Which software programs are recommended for use with the Reddit extractor package?

    -The recommended software programs for use with the Reddit extractor package are R and RStudio, with RStudio being particularly user-friendly and consistent across different computer types.

  • How does one install the Reddit extractor package?

    -To install the Reddit extractor package, one should first download R from R Project's website, then install RStudio, and finally, use the 'install.packages' function in R with 'Reddit extractor' as the argument.

  • What is the first command used in the video to collect data?

    -The first command used in the video to collect data is 'findThreadURLs', which retrieves a list of URLs for the most recent posts in a specified subreddit.

  • How can the data collection be sorted?

    -The data collection can be sorted by using the 'sort' option with 'new' or 'old' as the argument, to get the posts either from newest to oldest or vice versa.

  • What is the 'period' option used for in the Reddit extractor package?

    -The 'period' option is used to determine how far back in time the search should go, with options like 'day', 'hour', 'week', 'month', or 'all'.

  • What is the significance of the 'URL' variable in the collected data set?

    -The 'URL' variable is significant because it provides the web address for each post, allowing users to access not only the main text of the post but also the comments section for further analysis.

  • How does the 'getThreadContent' function work in the video?

    -The 'getThreadContent' function is used to extract information from a list of URLs, specifically to gather details about all the comments under each post, creating a new data set with this additional information.

  • What are the two elements in the 'TarHeels comments' object?

    -The two elements in the 'TarHeels comments' object are 'threads' and 'comments', with 'threads' containing information about the posts and 'comments' containing the text and details of individual comments.

  • What types of analysis can be performed with the collected data?

    -With the collected data, various types of analysis can be performed, such as examining the popularity of posts through upvotes and downvotes, studying trends over time, and analyzing the relationships between commenters.

  • How long did it take to collect data for the 'TarHeels comments' object in the video?

    -It took approximately 21 minutes to collect data for the 'TarHeels comments' object after initiating the data collection process.

Outlines
00:00
πŸ“š Introduction to Reddit Extractor

This paragraph introduces James Cook from the University of Maine at Augusta, who explains the process of using the Reddit Extractor package within R for data analysis. He emphasizes the simplicity of the process and the recent update to the Reddit Extractor package. The first step involves starting the R program and using RStudio for its user-friendly features. He suggests downloading R from the official website and installing RStudio for free. The next step is to install and activate the Reddit Extractor package using the library command in R. The paragraph concludes with a brief mention of the two rounds of data collection that will be performed.

05:01
πŸ” Collecting Data from a Specific Subreddit

In this paragraph, James Cook details the process of collecting data from a subreddit, specifically the 'Tar Heels' subreddit dedicated to the North Carolina Tar Heels sports team. He explains the use of the 'find thread URLs' command to extract the URLs of recent posts from the subreddit, sorted from new to old. The paragraph also discusses the option to define the period of data collection, which can range from hours to all available data. Cook demonstrates the execution of the command and explains the importance of having an internet connection for the data collection process. He notes the appearance of the 'Tar Heels threads' object in the R environment, signifying the start of data collection.

10:03
πŸ“Š Analyzing Thread and Comment Data

This paragraph delves into the analysis of thread and comment data collected from the subreddit. Cook describes the structure of the data set, which includes variables such as date, timestamp, title, subreddit, number of comments, and URL. He highlights the importance of the URL variable for accessing comments on individual posts. The paragraph then explains how to use the 'get thread content' command to extract information about comments from the URLs collected in the previous step. Cook provides a step-by-step guide on executing the command and what to expect during the data collection process, including the appearance of a stop sign indicating that R is processing the data.

15:05
πŸ€– Interpreting the Collected Data

In the final paragraph, Cook discusses the interpretation of the collected data, which now includes both threads and comments. He explains the structure of the 'Tar Heels comments' object, which contains two elements: 'threads' and 'comments'. The 'threads' element offers additional information about the posts, such as upvotes and downvotes, while the 'comments' element provides detailed information about individual comments, including the text, upvotes, downvotes, and comment ID. Cook also touches on the potential for analyzing the relationships between commenters and the dynamics of the discussion. He concludes by encouraging users to explore the possibilities of data extraction and analysis using the Reddit Extractor package in R.

Mindmap
Keywords
πŸ’‘Reddit
Reddit is a social media platform that hosts communities known as 'subreddits' where users can post, discuss, and vote on content. In the video, Reddit is the primary source of data for analysis, with specific subreddits like 'Tar Heels' being used as examples of where to extract data from.
πŸ’‘Reddit extractor
Reddit extractor is an R package used for extracting data from Reddit. It allows users to gather information such as post titles, comments, and user interactions for analysis. The video provides an update on how to use this tool, indicating that it has been recently rewritten.
πŸ’‘R programming language
R is a programming language and software environment for statistical computing and graphics. It is widely used for data analysis and is the foundation for using the Reddit extractor package. The video assumes the viewer will be using R to execute the data extraction process.
πŸ’‘RStudio
RStudio is an integrated development environment (IDE) for R. It provides a user-friendly interface for coding in R and helps in managing the development process, including debugging and visualization of data. The video recommends using RStudio for its consistency across different computer systems.
πŸ’‘Data sets
Data sets are structured collections of data, often used for analysis in research. In the context of the video, data sets are created from the information extracted from Reddit, containing variables such as post titles, comments, and timestamps.
πŸ’‘Library command
In R, the library command is used to load a package into the R session. This is necessary to access the functions and capabilities provided by the package. The video mentions using the library command to activate the Reddit extractor package.
πŸ’‘Data collection
Data collection refers to the process of gathering information and storing it in a format that can be analyzed. In the video, data collection involves using the Reddit extractor package to retrieve posts and comments from a subreddit.
πŸ’‘Threads
Threads on Reddit are individual posts that can contain a title and multiple comments. In the context of the video, threads are the primary unit of data extracted, with each thread potentially containing a list of comments and other interactions.
πŸ’‘Comments
Comments on Reddit are user-generated responses to a thread or other comments. They represent interactions and discussions within the community. The video focuses on extracting and analyzing comment data to understand social dynamics and trends.
πŸ’‘Data analysis
Data analysis involves systematically processing data to extract conclusions about its information. In the video, data analysis is the intended use of the extracted Reddit data, where the data sets can be examined for trends, relationships, and insights.
πŸ’‘Dependencies
Dependencies in software refer to other packages or libraries that a particular package relies on to function correctly. In the context of the video, installing the Reddit extractor package also requires installing its dependencies to ensure that the package works as intended.
Highlights

James Cook updates on using Reddit Extractor in R for data collection from Reddit.

Reddit Extractor package has been rewritten, requiring updates on its usage.

Initial setup involves downloading R and RStudio for a user-friendly experience.

Installing Reddit Extractor in RStudio is straightforward, with emphasis on installing dependencies.

The library command is used to activate the Reddit Extractor package in R.

Data collection is demonstrated through gathering data from the 'Tar Heels' subreddit.

The 'find thread URLs' command in Reddit Extractor retrieves URLs for subreddit posts.

Sorting options and time periods can be specified for targeted data collection.

The process collects thread data, including post URLs, for further analysis.

Subsequent commands fetch detailed comments for each thread.

Reddit Extractor enables deep analysis by capturing comment content and metadata.

Data sets are generated, showcasing threads and comments with various attributes.

Analysis potential includes examining relationships between commenters and comment popularity.

The tutorial emphasizes simplicity and efficiency in using Reddit Extractor for research.

Final thoughts encourage exploration of Reddit data analysis with the updated Reddit Extractor.

Transcripts
Rate This

5.0 / 5 (0 votes)

Thanks for rating: