Big Data Problems: Crash Course Statistics #39

CrashCourse

21 Nov 201812:51

EducationalLearning

32 Likes 10 Comments

TLDRThis video discusses potential issues with big data, like bias and privacy concerns. It gives examples of algorithms inadvertently learning biased associations, leading to unfair or inaccurate outputs. Privacy is also a major concern as more personal data is collected, with questions around access, usage, and security. Possible solutions are presented like requiring algorithmic transparency, implementing data protection regulations, and anonymizing data. Overall, excitement for big data's potential should be balanced with caution about its downsides, so that it can be used judiciously and ethically.

Takeaways

😱 Bias can be inadvertently introduced into algorithms trained on big data
😎 Garbage in, garbage out - bad input data leads to bad output decisions
😞 Biased algorithms can negatively impact people's lives, like in sentencing decisions
👀 Lack of algorithmic transparency makes bias harder to detect
🤐 Lots of personal data is collected, often without consent or knowledge
😤 Privacy laws try to protect people's data and inform them of usage
🌟 Anonymization techniques like k-anonymity help share data while preserving privacy
😨 Hacks and data breaches put people's information at risk
🤔 Companies must balance data sharing and privacy protections
😃 Big data offers opportunities to advance science and society if used responsibly

Q & A

What was the main finding from the investigation into the COMPAS algorithm by ProPublica?
-ProPublica found that the COMPAS algorithm was more likely to falsely label black defendants as future criminals compared to white defendants - wrongly labeling black defendants this way at almost twice the rate of white defendants.
How can bias enter into algorithms created using big data?
-Bias can enter algorithms when the training data used contains inherent biases. For example, if images used to train an image recognition algorithm contain more white faces than black faces, the algorithm may be more accurate at recognizing white faces.
What does the GDPR law aim to address?
-The General Data Protection Regulation (GDPR) addresses privacy concerns related to the use of big data. It requires companies to be more transparent about what user data they are collecting and who has access to it.
What does k-anonymity mean?
-K-anonymity is a concept used to protect privacy in datasets. It ensures that in a dataset, there are at least k subjects that share the same characteristics so that they cannot be distinguished from each other. This helps keep individual data private.
How was the suspected Golden State Killer identified using DNA and genealogy databases?
-Investigators took DNA from a crime scene and looked for partial matches in public genealogy databases. This allowed them to identify relatives of the perpetrator, which ultimately led them to identify Joseph James DeAngelo as the suspect.
What is an example of a data breach mentioned in the video?
-Examples of major data breaches mentioned include the Equifax breach in 2017, the iCloud celebrity photo leak in 2014, and the Ashley Madison breach exposing users' private data.
What does the phrase 'garbage in, garbage out' mean in relation to algorithms?
-It means that if an algorithm is given low-quality, biased, or inappropriate input data, it will produce meaningless, biased, or garbage outputs. The quality of the input data is critical to producing meaningful outputs.
What laws aim to protect children's privacy and data collection?
-The Children's Online Privacy Protection Act (COPPA) aims to protect the privacy of children under 13 by requiring parental approval for collection of data and restricting use of data for targeted advertising.
What are some responsibilities of companies collecting user data?
-Responsibilities include securely storing user data, protecting it from unauthorized access or hacking, being transparent about data collection and use policies, allowing users control over their data, and properly handling breaches if they occur.
How can transparency around algorithms and big data analysis benefit society?
-Algorithmic transparency would allow biases to be recognized and addressed. Understanding what algorithms are doing allows us to use big data analysis responsibly and ensure decisions influenced by algorithms are fair and unbiased.

Outlines

00:00

😊 Introducing issues around bias, transparency, and privacy with big data algorithms

This paragraph introduces some of the potential downsides and ethical concerns with using big data and algorithms, such as: inadvertently introducing bias into algorithms based on the data used to create them; lacking transparency into how complex algorithms make decisions; and privacy issues regarding what personal data is collected and shared.

05:01

😕 Examples of bias and lack of transparency in algorithms

This paragraph provides examples of bias being introduced into algorithms, like the COMPAS recidivism prediction tool exhibiting racial bias. It also discusses the difficulty in auditing algorithms to understand their reasoning due to their complexity or proprietary nature.

10:02

😳 Privacy concerns and laws around use of personal data

This paragraph covers privacy issues related to collection and use of personal data, providing examples like genetic testing companies sharing data. It mentions privacy laws like GDPR and COPPA, but notes there are still many open questions around ethical use of data.

Mindmap

Keywords

💡Bias

Bias refers to the partiality or prejudice in the data that is used to train machine learning algorithms. The video explains that when algorithms are trained on large datasets, we can inadvertently introduce bias if the data itself reflects certain biases. For example, the algorithm trained to identify wolves and huskies focused on the presence of snow rather than the animals' physical features. This could lead to issues when algorithms make important decisions, like assessing insurance or mortgage risk, that impact people's lives.

💡Privacy

Privacy is a major concern with big data and refers to the ability to keep personal information private and prevent unauthorized access. The video discusses several laws like GDPR and COPPA that aim to give users more control over their data and limit how companies can use it. However, privacy risks still exist from potential data breaches and DNA databases.

💡Transparency

Transparency refers to the ability to understand how algorithms work and make decisions based on the data they are given. The video argues for more algorithmic transparency so researchers can check for bias, unfairness, or other issues. However, many algorithms are proprietary and companies resist revealing their workings.

💡Accountability

Accountability refers to holding companies and organizations responsible for protecting user data and being transparent about how they use it. The video questions how much responsibility companies should have over user data and what penalties there should be when data is improperly accessed or sold.

💡Data breaches

Data breaches refer to events where users' private data is accessed without authorization, often by criminal hackers. The video gives examples like the Equifax breach and iCloud photo leaks. As more user data is collected, the risks and potential impacts of data breaches grow.

💡Medical research

Sharing data for medical research can lead to valuable discoveries but also raises privacy issues. The video discusses how 23andMe shares customer genetic data with pharmaceutical companies with opt-in consent from users. Standards around what kinds of medical data sharing require consent are still emerging.

💡DNA databases

Public DNA databases used for genealogy research are referenced as a privacy issue because they allowed investigators to identify a suspect through his relatives' genetic data. This raises questions around what expectations of privacy people should have with regards to consumer DNA services.

💡Garbage in, garbage out

This concept summarizes the issue that feeding biased or low-quality data into an algorithm will produce unreliable or unfair results. To avoid encoding biases into algorithms, the data used to train them needs to be high-quality and representative of diverse groups.

💡k-anonymity

k-anonymity is a technique data scientists use to anonymize data and avoid revealing individuals' identities, especially in medical datasets. It ensures that every record shares the same traits with at least k-1 other records. So with 2-anonymity, every record would match one other.

💡Encryption

Encryption refers to techniques that encode data so only authorized parties can read it. The video notes that better encryption protects privacy but also makes large-scale data breaches possible by aggregating data in hackable systems.

Highlights

Big data algorithms can inadvertently introduce bias based on the data they are trained on

Biased data inputs lead to biased algorithmic outputs - "garbage in, garbage out"

Lack of algorithmic transparency makes it difficult to understand how algorithms arrive at decisions

EU's GDPR law requires transparency around companies' data collection and usage

US Children's Online Privacy Protection Act limits how kids' data can be collected/used

K-anonymity protects privacy by ensuring multiple subjects share the same characteristics

DNA database GEDmatch was used by police to identify the Golden State Killer through relatives' data

23andMe shares customer DNA data with medical researchers while allowing opt in/out

Large scale security breaches expose personal data like in Equifax, Yahoo, Ashley Madison hacks

Companies that collect data have responsibility to protect it, but policies are still developing

Excitement over big data's potential shouldn't ignore caution about privacy, security, bias

Solutions needed for big data's problems like bias, lack of transparency, privacy concerns

Must ensure big data is used responsibly for social good, not harm

Biased algorithm judged blacks more likely to reoffend than similar whites

Snow detection, not wolf traits, powered image classifier's success

Transcripts

Browse More Related Video

Intro to Big Data: Crash Course Statistics #38

Why Einstein is a “peerless genius” and Hawking is an “ordinary genius” | Albert-László Barabási

Data Collection: Method types and tools

Jordan Peterson: This Is Why I Don't Embrace Government Use Of Facial Recognition

Homo Deus: A BRIEF HISTORY OF TOMORROW with Yuval Noah Harari

New Religions of the 21st Century | Yuval Harari | Talks at Google

Big Data Problems: Crash Course Statistics #39

Takeaways

Q & A

What was the main finding from the investigation into the COMPAS algorithm by ProPublica?

How can bias enter into algorithms created using big data?

What does the GDPR law aim to address?

What does k-anonymity mean?

How was the suspected Golden State Killer identified using DNA and genealogy databases?

What is an example of a data breach mentioned in the video?

What does the phrase 'garbage in, garbage out' mean in relation to algorithms?

What laws aim to protect children's privacy and data collection?

What are some responsibilities of companies collecting user data?

How can transparency around algorithms and big data analysis benefit society?