Tech giants like Apple, Nvidia, and Salesforce have been secretly using subtitles from thousands of YouTube videos to train their AI models, despite YouTube’s rules against such practices. The dataset, named YouTube Subtitles, includes content from educational channels like Khan Academy and even popular shows like The Late Show and Last Week Tonight. It also contains videos from YouTube celebrities such as MrBeast and PewDiePie. Creators like David Pakman and channels like Crash Course were unaware their content was being used, sparking concerns about fair compensation and permission. This controversy highlights broader ethical questions about AI development and the rights of content creators in the digital age.
Tech companies are using questionable methods, frequently without the creators’ knowledge, to feed their data-hungry artificial intelligence models. These methods involve sucking up books, websites, images, and social network posts.
While most AI companies keep their training data sources a secret, a Proof News investigation revealed that some of the world’s wealthiest AI companies have been using content from thousands of YouTube videos for AI training. Despite YouTube’s policies prohibiting the unapproved extraction of content from the platform, businesses continued to do so reports the Wired.
Our analysis revealed that major Silicon Valley players, such as Anthropic, Nvidia, Apple, and Salesforce, were using subtitles from 173,536 YouTube films that were taken from more than 48,000 channels.
Video transcripts from educational and online learning channels like Khan Academy, MIT, and Harvard are included in the YouTube Subtitles collection. Videos from the Wall Street Journal, NPR, BBC, The Late Show With Stephen Colbert, Last Week Tonight With John Oliver, and Jimmy Kimmel Live were also used to train AI.
YouTube megastars MrBeast (289 million subscribers, two videos taken for training), Marques Brownlee (19 million subscribers, seven videos taken), Jacksepticeye (nearly 31 million subscribers, 377 videos taken), and PewDiePie (111 million subscribers, 337 videos taken) were among the celebrities whose content Proof News also discovered. A portion of the training data for AI also propagated conspiracies like the “flat-earth theory.”
A tool to look for creators in the YouTube AI training dataset was developed by Proof News.
David Pakman, the host of The David Pakman Show, a left-leaning politics channel with over 2 million subscribers and over 2 billion views, said, “No one came to me and said, ‘We would like to use this.'” His videos were included in the YouTube Subtitles training dataset to the tune of about 160.
The four full-time employees of Pakman’s company create content for TikTok, podcasts, and other platforms in addition to posting many videos every day. Pakman asserted that he ought to receive payment for the use of his data if AI businesses get paid. He brought up the fact that several media companies have lately signed contracts that require payment for the use of their work to train AI.
“This is my livelihood, and I put time, resources, money, and staff time into creating this content,” Pakman said. “There’s really no shortage of work.”
The CEO of Nebula, a streaming service that is partially owned by its creators and where some of their work has been stolen from YouTube to train AI, called it “theft.”
Wiskus stated that using artists’ creations without permission is “disrespectful,” particularly in light of the possibility that studios will utilize “generative AI to replace as many of the artists along the way as they can.”
“Will this be used to exploit and harm artists? Yes, absolutely,” Wiskus said.
The makers of the dataset, EleutherAI, did not reply to inquiries about Proof’s conclusions, including claims that movies were utilized without authorization. According to the company’s website, its main objective is to make AI development more accessible to people outside of Big Tech. In the past, it has done this by giving people “access to cutting-edge AI technologies by training and releasing models.”
YouTube Subtitles are only the simple text of the subtitles for videos; they frequently include translations into other languages, such as Arabic, German, Japanese, and so on.
As per a study produced by EleutherAI, the dataset is a component of an assortment that the nonprofit organization named the Pile. In addition to YouTube, the developers of the Pile incorporated content from the English Wikipedia, the European Parliament, and a vast collection of emails sent by staff members of Enron Corporation that were made public as a result of a federal probe into the company.
Anyone with enough computer power and storage space on the internet can access the majority of the Pile’s datasets. The dataset was utilized not just by Big Tech companies but also by academics and other developers.
In their research papers and postings, Apple, Nvidia, and Salesforce—companies with market values in the hundreds of billions and trillions of dollars—describe how they trained artificial intelligence using the Pile. Records also reveal that just weeks before the firm announced plans to integrate additional AI capabilities into iPhones and MacBooks, Apple trained OpenELM, a well-known model that was published in April, using the Pile. The disclosures from Bloomberg and Databricks show that the corporations also trained models on the Pile.
Anthropic, a well-known AI developer that received a $4 billion investment from Amazon, has also done so and advocates for its emphasis on “AI safety.”
“The Pile includes a very small subset of YouTube subtitles,” Jennifer Martinez, a spokesperson for Anthropic, said in a statement confirming use of the Pile in Anthropic’s generative AI assistant Claude. “YouTube’s terms cover direct use of its platform, which is distinct from use of the Pile dataset. On the point about potential violations of YouTube’s terms of service, we’d have to refer you to the Pile authors.”
Salesforce further stated that an AI model was created using the Pile for “academic and research purposes.” In a statement, the company’s vice president of AI research, Caiming Xiong, stressed that the dataset was “publicly available.”
The same AI model was eventually made available to the general public by Salesforce in 2022, and according to its Hugging Face website, it has been downloaded at least 86,000 times since then. Salesforce developers noted in their research report that the Pile included “biases against gender and certain religious groups” in addition to profanity and that this might result in “vulnerabilities and safety concerns.” Proof News discovered thousands of instances of racial and gender-disparaging remarks, along with profanity, in YouTube subtitles. Inquiries concerning safety issues received no response from the Salesforce representative.
An Nvidia official declined to provide a remark. Representatives from Bloomberg, Databricks, and Apple did not reply to requests for comment.
YouTube’s Data Treasure Trove
According to Jai Vipra, an AI policy researcher and CyberBRICS fellow at Fundação Getulio Vargas Law School in Rio de Janeiro, Brazil, AI companies compete with one another, partly by obtaining higher-quality data. It is among the reasons why businesses keep their data sources close to the vest.
The New York Times revealed earlier this year that Google, the company that owns YouTube, used text from YouTube videos to train its models. A representative replied by informing the newspaper that its usage was allowed under contracts with YouTube creators.
The Times’ investigation also revealed that OpenAI had improperly utilized YouTube footage. The conclusions of the article were neither confirmed nor refuted by company personnel.
Executives at OpenAI have consistently refused to respond in public to inquiries about whether the company’s AI product Sora, which produces videos using text cues, was trained on YouTube footage. The Wall Street Journal writer posed the query to OpenAI’s chief technological officer, Mira Murati, earlier this year.
Murati answered, “I’m actually not sure about that.”
Vipra described speech-to-text data, such as YouTube subtitles, as a potential “gold mine” since it can be used to train models that mimic human speech and conversation.
“It’s still the sheer principle of it,” stated Dave Farina, host of Professor Dave Explains, a scientific lesson channel with 3 million members that features 140 films with captions lifted for YouTube.
“If you’re profiting off of work that I’ve done [to build a product] that will put me out of work or people like me out of work, then there needs to be a conversation on the table about compensation or some kind of regulation,” he said.
The 2020 book YouTube Subtitles also includes subtitles from over 12,000 videos that have been removed from YouTube. At least in one instance, the author completely removed their internet identity, yet their creations have been integrated into an unidentified number of AI models.
Proof News made an effort to get in touch with the channel owners mentioned in this report. Many declined to comment when contacted for comments. None of the creators we spoke with knew that their work had been stolen, much less how it had been used.
The creators of SciShow (8 million subscribers, 228 videos taken) and Crash Course (almost 16 million subscribers, 871 videos taken), the cornerstones of Hank and John Green’s educational video empire, were among those caught aback.
Complexly, the production firm behind the shows, released a statement saying, “We are frustrated to learn that our thoughtfully produced educational content has been used in this way without our consent.” Julie Walsh Smith, the CEO, wrote the statement.
The creative industries have encountered AI training data before, and YouTube Subtitles are only one example.
Contributor to Proof News Alex Reisner acquired a copy of Books3, another Pile dataset, and revealed in an article published in The Atlantic last year that over 180,000 books—including works by Zadie Smith, Margaret Atwood, and Michael Pollan—had been stolen. Since then, numerous writers have filed lawsuits against AI firms, claiming that the companies violated their copyright and used their work without permission. Since then, there have been other similar occurrences, and Books3’s host platform has pulled it down.
Defendants including Meta, OpenAI, and Bloomberg have countered the lawsuits, saying their activities qualify as fair use. The plaintiffs willingly dropped their lawsuit against EleutherAI, the company that first destroyed the books and made them available to the public.
The remaining instances are still in the early phases of litigation, therefore the issues of authorization and payment are still open. Although The Pile is no longer accessible on its official download page, file-sharing services still host it.
Amy Keller, a partner at DiCello Levitt and consumer protection attorney, stated, “Technology companies have run roughshod.” Keller has filed litigation on behalf of creatives whose work has allegedly been taken up by AI firms without their permission.
“People are concerned about the fact that they didn’t have a choice in the matter,” Keller said. “I think that’s what’s really problematic.”
Mimicking a Parrot
A lot of creators are scared of what lies ahead.
Completely YouTubers monitor the internet for unapproved usage of their content, frequently submitting takedown requests, and some are concerned that artificial intelligence may soon be able to create content that is identical to their own if not exact replicas.
The David Pakman Show creator Pakman just discovered the potential of AI while browsing TikTok. He discovered a video that appeared to be a Tucker Carlson piece, but Pakman was shocked to see it. Though it sounded like Carlson, everything about it—even the cadence—was exactly as Pakman had stated on his YouTube program. He was similarly concerned by the fact that only one of the commenters on the video appeared to realize that the voice copy of Carlson reading Pakman’s screenplay was phony.
Pakman stated, “This is going to be a problem,” in a video he posted on YouTube concerning the hoax. “You can do this essentially with anybody.”
Sid Black, a cofounder of EleutherAI, revealed on GitHub that he used a script to generate YouTube subtitles. The subtitles are downloaded by that script from YouTube’s API like how a YouTube viewer’s browser does so when they watch a video. Black utilized 495 search phrases, such as “funny vloggers,” “Einstein,” “black protestant,” “Protective Social Services,” “infowars,” “quantum chromodynamics,” “Ben Shapiro,” “Uighurs,” “fruitarian,” “cake recipe,” “Nazca lines,” and “flat earth,” to filter videos, according to documentation on GitHub.
Despite YouTube’s terms of service forbidding accessing its films by “automated means,” the code has been bookmarked or recommended by over 2,000 GitHub users.
Machine learning engineer Jonas Depoix stated in a GitHub conversation, “There are many ways in which YouTube could prevent this module from working if that was what they are after.” Depoix also uploaded the code that Black used to get YouTube subtitles. “This hasn’t happened so far.”
Depoix stated in an email to Proof News that he hadn’t used the code since writing it for a project as a college student many years ago and that he was shocked that anyone found it helpful. Regarding the guidelines on YouTube, he refused to respond.
In response to a request for comment, Google spokesperson Jack Malon stated via email that the business has “taken action over the years to prevent abusive, unauthorized scraping.” When asked if the information was being used as training data by other companies, he remained silent.
146 videos from Einstein Parrot, a channel with close to 150,000 subscribers, are among those used by AI businesses. The famous bird’s caregiver, Marcia, declined to give her last name out of concern for the well-known bird’s safety. She admitted that when she first learned that AI models had swallowed phrases from a copying parrot, she found it amusing.
“Who would want to use a parrot’s voice?” Marcia said. “But then, I know that he speaks very well. He speaks in my voice. So he’s parroting me, and then AI is parroting the parrot.”
AI cannot relearn data once it has been consumed. Marcia was concerned about the potential uses of her bird’s data, which included the possibility of building a digital replica of her bird and possibly even cursing it.
“We’re treading on uncharted territory,” Marcia said.
This month, GreatGameInternational reported that companies are taking drastic measures to stop AI systems from scraping their text for training purposes. This has sparked a fierce battle between content-rich websites and AI developers who need vast amounts of text to improve their models.