OpenAI and Google trained their AI models on text transcribed from YouTube videos, potentially violating creators’ copyrights, according to . The report, which outlines efforts by OpenAI, Google and Meta to maximize the amount of data they can provide to their AIs, cites numerous people with knowledge of the companies’ practices. This comes just days after YouTube CEO Neal Mohan said in an interview with that OpenAI’s alleged use of YouTube videos to train its new text-to-video generator, Sora, .
According to New York TimesOpenAI used its speech recognition tool Whisper to transcribe over a million hours of YouTube videos, which were then used to train GPT-4. previously reported that OpenAI used YouTube videos and podcasts to train the two AI systems. OpenAI President Greg Brockman was reportedly part of this team. Under Google’s policies, “unauthorized scraping or downloading of YouTube content” is not allowed, said Matt Bryant, a Google spokesman. New York Timesalso claiming that the company was unaware of such use by OpenAI.
The report, however, claims that some people at Google knew about it but did not take action against OpenAI because Google was using YouTube videos to train its own AI models. Google said New York Times it only does so with videos from creators who have agreed to participate in an experimental program. Engadget has contacted Google and OpenAI for comment.
THE New York Times The report also claims that Google changed its privacy policy in June 2022 to more broadly cover its use of publicly available content, including Google Docs and Google Sheets, to train its AI models and products. Bryant said New York Times that this is only done with the permission of users who opt-in to Google’s experimental features, and that the company “has not started training on additional data types based on this language change.”