I built my own Reddit API to beat Inflation. Web Scraping for data collection.

Dreams of Code

10 Sept 202319:42

EducationalLearning

32 Likes 10 Comments

TLDRThe video discusses the rising costs of APIs and presents a solution by detailing the process of creating an alternative Reddit API. The creator uses web scraping techniques with Playwright and leverages older versions of Reddit to access data. The data is stored in a cost-effective manner using AWS services such as DynamoDB and SQS. The project aims to provide a cheaper alternative to the official Reddit API, with the added benefit of being open and accessible.

Takeaways

🚀 The rising costs of APIs have led to developers seeking alternative ways to access data from platforms like Reddit and Microsoft.
🌐 The principle of openness on which the World Wide Web was built makes it difficult for companies to restrict access to their data completely.
🛠️ Web scraping is a technique used to extract data from websites, making information intended for human consumption available to machines.
🎨 Playwright and Puppeteer are tools typically used for automated testing of web code but can also be repurposed for data collection through web scraping.
📚 The developer chose to build a Reddit API using the old version of Reddit (old.reddit.com) for its simplicity and easier navigation.
🔍 By inspecting HTML elements, the developer was able to identify the structure of the data needed for the project.
🔄 A recursive algorithm was used to parse the tree structure of comments on Reddit, preserving the context of each comment.
💾 Data storage was a challenge, but using a message queue like AWS SQS allowed for decoupling the database from the scraper.
🌍 AWS Lambda and DynamoDB were chosen for their low cost and scalability, making them suitable for a project with unknown traffic.
🔧 Bright Data's scraping browser was integrated for its residential IP addresses and ability to run multiple concurrent browsers, reducing the chance of being blocked.
📈 The final cost of running the project for three weeks was around 8 cents, primarily due to DynamoDB usage, which was more cost-effective than other options.

Q & A

What is the main issue discussed in the script?
-The main issue discussed is the increasing cost of APIs, specifically mentioning companies like Reddit and Microsoft, and how it affects cash-strapped developers.
What does the term 'API' stand for?
-API stands for Application Programming Interface. However, in the context of the script, it is playfully referred to as Application Programming Inflation due to the rising costs.
What is the speaker's solution to the expensive API problem?
-The speaker's solution is to build their own Reddit API using web scraping techniques to access the same data without the associated costs.
Which tools are mentioned for automating data collection through web scraping?
-The tools mentioned for automating data collection are Playwright and Puppeteer.
Why did the speaker choose Playwright over Puppeteer?
-The speaker chose Playwright over Puppeteer due to better support for pulling out data attributes from HTML elements, which was necessary for the project.
What is the strategy the speaker used for scraping data from Reddit?
-The strategy was to pull out all posts from the /r/programming subreddit for the last 24 hours using the old version of Reddit (old.reddit.com) for easier navigation and data retrieval.
How did the speaker handle the data structure of comments on Reddit?
-The speaker treated the comments as a tree structure and used a recursive algorithm to parse and pull out the comments and their child comments.
What database solution did the speaker choose and why?
-The speaker chose DynamoDB, a nosql database by AWS, because it is a low-cost, usage-based database with a generous free tier.
How did the speaker decouple the database from the scraper?
-The speaker decoupled the database from the scraper by writing to an SQS (Simple Queue Service) message queue, allowing the interface to remain constant even if the database technology changes later on.
What service did the speaker use to overcome the challenges of running a scraper on AWS Lambda?
-The speaker used Bright Data's scraping browser service, which provided residential IPs and allowed for concurrent browsers to speed up the scraping process.
How did the speaker optimize the scraper to reduce costs with Bright Data?
-The speaker optimized the scraper by filtering out unnecessary resources like images, videos, CSS, and JavaScript, and by changing the way attributes were pulled out, using a single API call instead of multiple.
What were the final components needed to create the speaker's own API?
-The final components needed were two DynamoDB tables for storing post and comment data, an AWS Lambda function to read from the message queue and write to DynamoDB, and an API created with AWS Lambda and API Gateway to serve the data.

Outlines

00:00

💡 The Rising Costs of APIs and a DIY Solution

This paragraph discusses the increasing costs of APIs, exemplified by companies like Reddit and Microsoft, and how this trend is affecting cash-strapped developers. The speaker shares their journey to create a cost-effective alternative to Reddit's API, using web scraping techniques to access the same data without the associated costs. The paragraph highlights the principle of openness on which the World Wide Web was built and the challenges companies face in restricting this openness. The speaker introduces the concept of web scraping as a solution to the problem, explaining how it allows for the extraction of data from web pages that are intended for human consumption.

05:01

🛠️ Building a Reddit API through Web Scraping

The speaker delves into the technical process of building their own Reddit API by web scraping. They describe choosing the Playwright tool over Puppeteer for its better support in extracting data attributes from HTML elements. The strategy involves scraping data from the /r/programming subreddit, targeting posts from the last 24 hours. The speaker explains the process of inspecting HTML elements to identify the required data and writing code to extract this information. They also discuss the challenges of dealing with the logged-out version of Reddit and how using the older version of the site (old.reddit.com) provided a more straightforward solution.

10:01

📊 Storing and Managing scraped Data

In this paragraph, the speaker addresses the challenge of storing the scraped data. They discuss the need for a cost-effective, usage-based database solution with a generous free tier, ultimately choosing Amazon's DynamoDB. The speaker also introduces the concept of decoupling the database from the scraper by using a message queue, specifically AWS SQS, to maintain a consistent interface for data storage. The process of setting up the SQS queue using Terraform and AWS is detailed, along with the integration of the scraper with the SQS queue to ensure no data loss during the development of the rest of the architecture.

15:03

🚀 Deploying the Scraper and Creating an API

The speaker describes the deployment of the scraper to AWS Lambda and the challenges faced, including the need for significant RAM due to the use of Playwright and the risk of being blocked by Reddit. They discuss the solution of using Bright Data's scraping browser, which provides residential IPs and allows for concurrent browsers to speed up the scraping process. The speaker explains the optimization of data usage and processing with Bright Data, and the integration with the service. They also cover the deployment of the scraper to AWS Lambda, the creation of DynamoDB tables for storing post and comment data, and the development of an API using AWS Lambda and API Gateway. The speaker concludes with the costs associated with running the project and a recommendation for Bright Data's services.

Mindmap

Keywords

💡API

API, or Application Programming Interface, is a set of rules and protocols for building and interacting with software applications. In the context of the video, it refers to the programmatic access to data from platforms like Reddit and Microsoft, which has become increasingly expensive. The video discusses creating an alternative to the official Reddit API to circumvent these costs.

💡Web Scraping

Web scraping is a technique used to extract data from websites by automating the navigation and reading of web pages. In the video, the speaker uses web scraping to gather data from Reddit, as it provides a way to access the same information without the costs associated with using the official API.

💡Node.js

Node.js is a cross-platform, open-source JavaScript runtime environment that allows developers to run JavaScript code outside of a web browser. In the video, the speaker uses Node.js to create a new project and leverage the capabilities of libraries like Playwright and Puppeteer for web scraping.

💡Playwright

Playwright is an open-source tool for automating web browsers. It is used in the video to automate data collection from Reddit by interacting with the website's HTML elements and extracting the desired data.

💡Puppeteer

Puppeteer is a Node.js library that provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. Although not used in the video, it is mentioned as an alternative to Playwright for web scraping and automated testing of web code.

💡Reddit

Reddit is a social news aggregation, web content rating, and discussion website. In the video, Reddit serves as the primary source of data that the speaker aims to access and scrape without incurring the costs of the official Reddit API.

💡Data Attributes

Data attributes in HTML are custom attributes that can be used to store extra information about an element or to store values to be used in JavaScript. In the video, the speaker examines the HTML elements on Reddit to identify the data attributes that contain the information needed for the scraper.

💡Message Queue

A message queue is a communication method used in computer systems to transfer messages between processes or applications. In the video, the speaker uses Amazon SQS, a managed message queuing service, to decouple the scraper from the database and ensure that data is not lost during the database implementation process.

💡DynamoDB

DynamoDB is a NoSQL database service provided by Amazon Web Services (AWS) that offers fast and predictable performance with seamless scalability. In the video, the speaker chooses DynamoDB for storing the scraped data due to its low cost, usage-based pricing, and generous free tier.

💡Lambda

AWS Lambda is a serverless computing service that runs code in response to events and automatically manages the computing resources required to run that code. The speaker deploys the scraper to AWS Lambda to run periodically, aiming for a cost-effective solution for executing the web scraping task.

💡Bright Data

Bright Data is a company that provides web scraping and proxy solutions. In the video, the speaker uses Bright Data's scraping browser product to run multiple concurrent browsers and reduce the chances of being blocked by Reddit, as it uses residential IPs and allows for the same code to be used without running an instance of Chrome within AWS Lambda.

Highlights

API costs are increasing, with companies like Reddit and Microsoft raising prices for programmatic access to their data.

The term API has shifted from 'application programming interface' to 'application programming inflation' due to rising costs.

A cash-strapped developer seeks to build a cost-effective alternative to Reddit's API using web scraping techniques.

The World Wide Web was built on the principle of openness, making it difficult for companies to restrict access to their data.

Web scraping is a technique to extract data from websites, which can be used as an alternative to expensive APIs.

Playwright and Puppeteer are tools typically used for automated testing but can also be utilized for web scraping.

The developer chose Playwright for its superior support in extracting data attributes from HTML elements.

Older versions of websites like old.reddit.com have less JavaScript and are easier to navigate for scraping purposes.

The scraping strategy involved extracting posts from the /r/programming subreddit over the last 24 hours.

The developer overcame challenges such as dealing with new tabs opening and the need to scroll to load more content.

Data attributes of interest included post ID, subreddit, timestamp, author, and URL to comments.

A recursive algorithm was used to parse the tree structure of comments on Reddit, capturing the context of each comment.

The developer opted for a message queue (SQS) to store data, allowing for future database changes without altering the scraper code.

AWS DynamoDB was chosen as a cost-effective, usage-based database with a generous free tier.

Bright Data's scraping browser product was used for its residential IPs and ability to run multiple concurrent browsers.

Data processing was optimized by filtering unnecessary resources and consolidating API calls.

The scraper was deployed to AWS Lambda, with a CloudWatch event to invoke it hourly.

An API was created using AWS Lambda and API Gateway, providing a low-cost alternative to the official Reddit API.

The project cost approximately 8 cents to run over three weeks, primarily due to DynamoDB usage.

The developer's homemade Reddit API allows for unlimited access, unlike the official API which has restrictions.

Transcripts

Browse More Related Video

Reddit API tutorial Python - Reddit PRAW

How-to Use The Reddit API in Python

PRAW - Using Python to Scrape Reddit Data!

How To Scrape Reddit & Automatically Label Data For NLP Projects | Reddit API Tutorial

Get UNLIMITED Tweets by Python Without Twitter API

how to scrape reddit with python

I built my own Reddit API to beat Inflation. Web Scraping for data collection.

Takeaways

Q & A

What is the main issue discussed in the script?

What does the term 'API' stand for?

What is the speaker's solution to the expensive API problem?

Which tools are mentioned for automating data collection through web scraping?

Why did the speaker choose Playwright over Puppeteer?

What is the strategy the speaker used for scraping data from Reddit?

How did the speaker handle the data structure of comments on Reddit?

What database solution did the speaker choose and why?

How did the speaker decouple the database from the scraper?

What service did the speaker use to overcome the challenges of running a scraper on AWS Lambda?

How did the speaker optimize the scraper to reduce costs with Bright Data?

What were the final components needed to create the speaker's own API?