I built my own Reddit API to beat Inflation. Web Scraping for data collection.
TLDRThe video discusses the rising costs of APIs and presents a solution by detailing the process of creating an alternative Reddit API. The creator uses web scraping techniques with Playwright and leverages older versions of Reddit to access data. The data is stored in a cost-effective manner using AWS services such as DynamoDB and SQS. The project aims to provide a cheaper alternative to the official Reddit API, with the added benefit of being open and accessible.
Takeaways
- π The rising costs of APIs have led to developers seeking alternative ways to access data from platforms like Reddit and Microsoft.
- π The principle of openness on which the World Wide Web was built makes it difficult for companies to restrict access to their data completely.
- π οΈ Web scraping is a technique used to extract data from websites, making information intended for human consumption available to machines.
- π¨ Playwright and Puppeteer are tools typically used for automated testing of web code but can also be repurposed for data collection through web scraping.
- π The developer chose to build a Reddit API using the old version of Reddit (old.reddit.com) for its simplicity and easier navigation.
- π By inspecting HTML elements, the developer was able to identify the structure of the data needed for the project.
- π A recursive algorithm was used to parse the tree structure of comments on Reddit, preserving the context of each comment.
- πΎ Data storage was a challenge, but using a message queue like AWS SQS allowed for decoupling the database from the scraper.
- π AWS Lambda and DynamoDB were chosen for their low cost and scalability, making them suitable for a project with unknown traffic.
- π§ Bright Data's scraping browser was integrated for its residential IP addresses and ability to run multiple concurrent browsers, reducing the chance of being blocked.
- π The final cost of running the project for three weeks was around 8 cents, primarily due to DynamoDB usage, which was more cost-effective than other options.
Q & A
What is the main issue discussed in the script?
-The main issue discussed is the increasing cost of APIs, specifically mentioning companies like Reddit and Microsoft, and how it affects cash-strapped developers.
What does the term 'API' stand for?
-API stands for Application Programming Interface. However, in the context of the script, it is playfully referred to as Application Programming Inflation due to the rising costs.
What is the speaker's solution to the expensive API problem?
-The speaker's solution is to build their own Reddit API using web scraping techniques to access the same data without the associated costs.
Which tools are mentioned for automating data collection through web scraping?
-The tools mentioned for automating data collection are Playwright and Puppeteer.
Why did the speaker choose Playwright over Puppeteer?
-The speaker chose Playwright over Puppeteer due to better support for pulling out data attributes from HTML elements, which was necessary for the project.
What is the strategy the speaker used for scraping data from Reddit?
-The strategy was to pull out all posts from the /r/programming subreddit for the last 24 hours using the old version of Reddit (old.reddit.com) for easier navigation and data retrieval.
How did the speaker handle the data structure of comments on Reddit?
-The speaker treated the comments as a tree structure and used a recursive algorithm to parse and pull out the comments and their child comments.
What database solution did the speaker choose and why?
-The speaker chose DynamoDB, a nosql database by AWS, because it is a low-cost, usage-based database with a generous free tier.
How did the speaker decouple the database from the scraper?
-The speaker decoupled the database from the scraper by writing to an SQS (Simple Queue Service) message queue, allowing the interface to remain constant even if the database technology changes later on.
What service did the speaker use to overcome the challenges of running a scraper on AWS Lambda?
-The speaker used Bright Data's scraping browser service, which provided residential IPs and allowed for concurrent browsers to speed up the scraping process.
How did the speaker optimize the scraper to reduce costs with Bright Data?
-The speaker optimized the scraper by filtering out unnecessary resources like images, videos, CSS, and JavaScript, and by changing the way attributes were pulled out, using a single API call instead of multiple.
What were the final components needed to create the speaker's own API?
-The final components needed were two DynamoDB tables for storing post and comment data, an AWS Lambda function to read from the message queue and write to DynamoDB, and an API created with AWS Lambda and API Gateway to serve the data.
Outlines
π‘ The Rising Costs of APIs and a DIY Solution
This paragraph discusses the increasing costs of APIs, exemplified by companies like Reddit and Microsoft, and how this trend is affecting cash-strapped developers. The speaker shares their journey to create a cost-effective alternative to Reddit's API, using web scraping techniques to access the same data without the associated costs. The paragraph highlights the principle of openness on which the World Wide Web was built and the challenges companies face in restricting this openness. The speaker introduces the concept of web scraping as a solution to the problem, explaining how it allows for the extraction of data from web pages that are intended for human consumption.
π οΈ Building a Reddit API through Web Scraping
The speaker delves into the technical process of building their own Reddit API by web scraping. They describe choosing the Playwright tool over Puppeteer for its better support in extracting data attributes from HTML elements. The strategy involves scraping data from the /r/programming subreddit, targeting posts from the last 24 hours. The speaker explains the process of inspecting HTML elements to identify the required data and writing code to extract this information. They also discuss the challenges of dealing with the logged-out version of Reddit and how using the older version of the site (old.reddit.com) provided a more straightforward solution.
π Storing and Managing scraped Data
In this paragraph, the speaker addresses the challenge of storing the scraped data. They discuss the need for a cost-effective, usage-based database solution with a generous free tier, ultimately choosing Amazon's DynamoDB. The speaker also introduces the concept of decoupling the database from the scraper by using a message queue, specifically AWS SQS, to maintain a consistent interface for data storage. The process of setting up the SQS queue using Terraform and AWS is detailed, along with the integration of the scraper with the SQS queue to ensure no data loss during the development of the rest of the architecture.
π Deploying the Scraper and Creating an API
The speaker describes the deployment of the scraper to AWS Lambda and the challenges faced, including the need for significant RAM due to the use of Playwright and the risk of being blocked by Reddit. They discuss the solution of using Bright Data's scraping browser, which provides residential IPs and allows for concurrent browsers to speed up the scraping process. The speaker explains the optimization of data usage and processing with Bright Data, and the integration with the service. They also cover the deployment of the scraper to AWS Lambda, the creation of DynamoDB tables for storing post and comment data, and the development of an API using AWS Lambda and API Gateway. The speaker concludes with the costs associated with running the project and a recommendation for Bright Data's services.
Mindmap
Keywords
π‘API
π‘Web Scraping
π‘Node.js
π‘Playwright
π‘Puppeteer
π‘Reddit
π‘Data Attributes
π‘Message Queue
π‘DynamoDB
π‘Lambda
π‘Bright Data
Highlights
API costs are increasing, with companies like Reddit and Microsoft raising prices for programmatic access to their data.
The term API has shifted from 'application programming interface' to 'application programming inflation' due to rising costs.
A cash-strapped developer seeks to build a cost-effective alternative to Reddit's API using web scraping techniques.
The World Wide Web was built on the principle of openness, making it difficult for companies to restrict access to their data.
Web scraping is a technique to extract data from websites, which can be used as an alternative to expensive APIs.
Playwright and Puppeteer are tools typically used for automated testing but can also be utilized for web scraping.
The developer chose Playwright for its superior support in extracting data attributes from HTML elements.
Older versions of websites like old.reddit.com have less JavaScript and are easier to navigate for scraping purposes.
The scraping strategy involved extracting posts from the /r/programming subreddit over the last 24 hours.
The developer overcame challenges such as dealing with new tabs opening and the need to scroll to load more content.
Data attributes of interest included post ID, subreddit, timestamp, author, and URL to comments.
A recursive algorithm was used to parse the tree structure of comments on Reddit, capturing the context of each comment.
The developer opted for a message queue (SQS) to store data, allowing for future database changes without altering the scraper code.
AWS DynamoDB was chosen as a cost-effective, usage-based database with a generous free tier.
Bright Data's scraping browser product was used for its residential IPs and ability to run multiple concurrent browsers.
Data processing was optimized by filtering unnecessary resources and consolidating API calls.
The scraper was deployed to AWS Lambda, with a CloudWatch event to invoke it hourly.
An API was created using AWS Lambda and API Gateway, providing a low-cost alternative to the official Reddit API.
The project cost approximately 8 cents to run over three weeks, primarily due to DynamoDB usage.
The developer's homemade Reddit API allows for unlimited access, unlike the official API which has restrictions.
Transcripts
Browse More Related Video
5.0 / 5 (0 votes)
Thanks for rating: