Extracting Reddit Data With R and the package RedditExtractoR (2023 Update)
TLDRJames Cook from the University of Maine at Augusta demonstrates how to use the Reddit extractor package in R for social media data analysis. He guides through the process of installing R and RStudio, using Reddit extractor to obtain data from Reddit, and transforming it into datasets. The tutorial covers fetching recent post URLs and comments from a subreddit, with a focus on the Tar Heels sports subreddit, and emphasizes the importance of the package's simplicity and efficiency in data extraction.
Takeaways
- ๐ The video is a tutorial on using the 'Reddit extractor' package in R for data analysis.
- ๐ป The presenter, James Cook from the University of Maine at Augusta, explains the process of obtaining datasets from Reddit for analysis.
- ๐ The 'Reddit extractor' package has been recently rewritten, simplifying the process of data extraction.
- ๐ฑ The tutorial recommends using RStudio for its user-friendly interface and consistency across different computers.
- ๐ง To begin, one must install R and RStudio, and then install the 'Reddit extractor' package along with its dependencies.
- ๐ The 'library' command in R is used to activate the 'Reddit extractor' package for use.
- ๐ The 'find thread URLs' command is utilized to extract URLs of recent posts from a specified subreddit, with options to sort and limit the search period.
- ๐ The 'get thread content' command fetches information about comments associated with each URL extracted previously.
- โณ The data extraction process requires an internet connection and may take some time depending on the amount of data.
- ๐ The resulting dataset includes various variables such as date, timestamp, title, subreddit, number of comments, and URLs.
- ๐ The video demonstrates how to view and interpret the data, including understanding the structure of threads and comments, and the potential for further analysis.
Q & A
What is the purpose of using the Reddit extractor package in the video?
-The purpose of using the Reddit extractor package is to obtain data from the social media platform Reddit and convert it into data sets that can be used for various types of analysis.
Which software programs are recommended for use with the Reddit extractor package?
-The recommended software programs for use with the Reddit extractor package are R and RStudio, with RStudio being particularly user-friendly and consistent across different computer types.
How does one install the Reddit extractor package?
-To install the Reddit extractor package, one should first download R from R Project's website, then install RStudio, and finally, use the 'install.packages' function in R with 'Reddit extractor' as the argument.
What is the first command used in the video to collect data?
-The first command used in the video to collect data is 'findThreadURLs', which retrieves a list of URLs for the most recent posts in a specified subreddit.
How can the data collection be sorted?
-The data collection can be sorted by using the 'sort' option with 'new' or 'old' as the argument, to get the posts either from newest to oldest or vice versa.
What is the 'period' option used for in the Reddit extractor package?
-The 'period' option is used to determine how far back in time the search should go, with options like 'day', 'hour', 'week', 'month', or 'all'.
What is the significance of the 'URL' variable in the collected data set?
-The 'URL' variable is significant because it provides the web address for each post, allowing users to access not only the main text of the post but also the comments section for further analysis.
How does the 'getThreadContent' function work in the video?
-The 'getThreadContent' function is used to extract information from a list of URLs, specifically to gather details about all the comments under each post, creating a new data set with this additional information.
What are the two elements in the 'TarHeels comments' object?
-The two elements in the 'TarHeels comments' object are 'threads' and 'comments', with 'threads' containing information about the posts and 'comments' containing the text and details of individual comments.
What types of analysis can be performed with the collected data?
-With the collected data, various types of analysis can be performed, such as examining the popularity of posts through upvotes and downvotes, studying trends over time, and analyzing the relationships between commenters.
How long did it take to collect data for the 'TarHeels comments' object in the video?
-It took approximately 21 minutes to collect data for the 'TarHeels comments' object after initiating the data collection process.
Outlines
๐ Introduction to Reddit Extractor
This paragraph introduces James Cook from the University of Maine at Augusta, who explains the process of using the Reddit Extractor package within R for data analysis. He emphasizes the simplicity of the process and the recent update to the Reddit Extractor package. The first step involves starting the R program and using RStudio for its user-friendly features. He suggests downloading R from the official website and installing RStudio for free. The next step is to install and activate the Reddit Extractor package using the library command in R. The paragraph concludes with a brief mention of the two rounds of data collection that will be performed.
๐ Collecting Data from a Specific Subreddit
In this paragraph, James Cook details the process of collecting data from a subreddit, specifically the 'Tar Heels' subreddit dedicated to the North Carolina Tar Heels sports team. He explains the use of the 'find thread URLs' command to extract the URLs of recent posts from the subreddit, sorted from new to old. The paragraph also discusses the option to define the period of data collection, which can range from hours to all available data. Cook demonstrates the execution of the command and explains the importance of having an internet connection for the data collection process. He notes the appearance of the 'Tar Heels threads' object in the R environment, signifying the start of data collection.
๐ Analyzing Thread and Comment Data
This paragraph delves into the analysis of thread and comment data collected from the subreddit. Cook describes the structure of the data set, which includes variables such as date, timestamp, title, subreddit, number of comments, and URL. He highlights the importance of the URL variable for accessing comments on individual posts. The paragraph then explains how to use the 'get thread content' command to extract information about comments from the URLs collected in the previous step. Cook provides a step-by-step guide on executing the command and what to expect during the data collection process, including the appearance of a stop sign indicating that R is processing the data.
๐ค Interpreting the Collected Data
In the final paragraph, Cook discusses the interpretation of the collected data, which now includes both threads and comments. He explains the structure of the 'Tar Heels comments' object, which contains two elements: 'threads' and 'comments'. The 'threads' element offers additional information about the posts, such as upvotes and downvotes, while the 'comments' element provides detailed information about individual comments, including the text, upvotes, downvotes, and comment ID. Cook also touches on the potential for analyzing the relationships between commenters and the dynamics of the discussion. He concludes by encouraging users to explore the possibilities of data extraction and analysis using the Reddit Extractor package in R.
Mindmap
Keywords
๐กReddit
๐กReddit extractor
๐กR programming language
๐กRStudio
๐กData sets
๐กLibrary command
๐กData collection
๐กThreads
๐กComments
๐กData analysis
๐กDependencies
Highlights
James Cook updates on using Reddit Extractor in R for data collection from Reddit.
Reddit Extractor package has been rewritten, requiring updates on its usage.
Initial setup involves downloading R and RStudio for a user-friendly experience.
Installing Reddit Extractor in RStudio is straightforward, with emphasis on installing dependencies.
The library command is used to activate the Reddit Extractor package in R.
Data collection is demonstrated through gathering data from the 'Tar Heels' subreddit.
The 'find thread URLs' command in Reddit Extractor retrieves URLs for subreddit posts.
Sorting options and time periods can be specified for targeted data collection.
The process collects thread data, including post URLs, for further analysis.
Subsequent commands fetch detailed comments for each thread.
Reddit Extractor enables deep analysis by capturing comment content and metadata.
Data sets are generated, showcasing threads and comments with various attributes.
Analysis potential includes examining relationships between commenters and comment popularity.
The tutorial emphasizes simplicity and efficiency in using Reddit Extractor for research.
Final thoughts encourage exploration of Reddit data analysis with the updated Reddit Extractor.
Transcripts
Browse More Related Video
Scrape Reddit Comments R ExtractoR
Reddit API tutorial Python - Reddit PRAW
Introduction: R and IGraph for Edge Lists and Social Network Graphs
Collecting and Analyzing YouTube Video Data with R and VosonSML
how to scrape reddit with python
Reading Social Media into Data: Manually, through JSON, and through R
5.0 / 5 (0 votes)
Thanks for rating: