NVIDIA reportedly scraped vast amounts of copyrighted content for AI training. On Monday, 404 Media’s Samantha Cole reported that the $2.4 trillion company instructed workers to download videos from YouTube, Netflix, and other datasets to develop commercial AI projects. The graphics card maker seems to have embraced a “move fast and break things” approach as it competes for dominance in the intense and often controversial AI gold rush.
The training was allegedly used to develop models for products like its Omniverse 3D world generator, self-driving car systems, and “digital human” initiatives.
YouTube seems to disagree. Spokesperson Jack Malon referred us to a Bloomberg story from April, where CEO Neal Mohan stated that using YouTube to train AI models would be a clear violation” of its terms.
NVIDIA employees who raised ethical and legal concerns about the practice were reportedly told by their managers that it had already been approved by the company’s top executives. “This is an executive decision,” responded Ming-Yu Liu, vice president of research at NVIDIA. “We have an umbrella approval for all of the data.” Others in the company allegedly described the scraping as an “open legal issue” to be addressed later.
The situation is reminiscent of Facebook’s (now Meta’s) old “move fast and break things” motto, which famously compromised the privacy of millions of users.
In addition to YouTube and Netflix videos, NVIDIA reportedly instructed employees to train on other datasets, including the movie trailer database MovieNet, internal libraries of video game footage, and Github video datasets WebVid (now removed after a cease-and-desist) and InternVid-10M, which contains 10 million YouTube video IDs.
Some of the data that NVIDIA allegedly trained on was only intended for academic or non-commercial use. For instance, HD-VG-130M, a library of 130 million YouTube videos, includes a usage license specifying it’s for academic research only. NVIDIA reportedly dismissed concerns about these academic-only terms, insisting the data was fair game for its commercial AI projects.
To avoid detection by YouTube, NVIDIA reportedly used virtual machines (VMs) with rotating IP addresses to download content and avoid bans. When a worker suggested using a third-party IP address-rotating tool, another NVIDIA employee reportedly responded, “We are on Amazon Web Services and restarting a virtual machine instance gives a new public IP. So, that’s not a problem so far.”