Data Laundering: How Stability AI managed to get millions of copyrighted art works without paying artists [Finance Fridays]

Understanding the very creative ways tech gets around copyright to train Large Models.

Aug 12, 2023

Hey, it’s your favorite cult leader here 🐱‍👤

On Fridays, you get posts about money 💲💰💰💲. Expect a mix of posts on personal finance, breakdowns of the most relevant events in the Tech Industry, and insights into the unique workings of the tech industry. 🔍🔍. Use these posts to master the money game, and solidify your finances to ride out the meanest of recessions.

To get access to all the articles, support my crippling chocolate milk addiction, and become a premium member of this cult, use the button below-

Help me buy chocolate milk

p.s. you can learn more about the paid plan here.

In my collaboration with the Gradient, an amazing publication started at Stanford, I wrote the article: Artists enable AI art - shouldn't they be compensated? In terms of audience numbers, we’ve come a long way since those days. I was speaking to a few of the newer readers, and many of them were unaware of the ideas discussed in that article. In this piece, I will be going over one of the major ideas discussed in that piece: Data Laundering and how Tech Companies use it to get around copyright to build their large datasets (using quotes from the original). We have seen many companies and organizations try to build their language models. Given the trend towards multi-modality- we will likely see more orgs use such techniques to work around copyright (and I’m sure many already have). Understanding how is important for building better regulations and procedures to ensure everyone is paid better.

a group of washing machines — Photo by Ambitious Studio* - Rick Barrett on Unsplash

Key Highlights

What is Data Laundering- Data Laundering is the conversion of stolen data so that it may be sold or used by ostensibly legitimate databases. “As with other forms of data theft, data harvested from hacked databases is sold on darknet sites. However, instead of selling to identity thieves and fraudsters, data is sold into legitimate competitive intelligence and market research channels.” - Source- ZDnet, Cyber-criminals boost sales through ‘data laundering’.
How Stability AI got around artist copyright- In the case of Stability AI and AI art, the process plays out like this:
1. Create or fund a non-profit entity to create the datasets for you. The non-profit, research-oriented nature of these entities allows them to use copyrighted material more easily.
2. Then use this dataset to create commercial products, without offering any compensation for the use of copyrighted material.
Think, I’m making things up? Think back to Stable Diffusion, Stability’s AI text-to-image generator. Who created it? Many people think it’s Stability AI. You’re wrong. It was created by the Ludwig Maximilian University of Munich, with a donation from Stability. Look at the Github of Stable Diffusion to see for yourself
Stable Diffusion is a latent text-to-image diffusion model. Thanks to a generous compute donation from Stability AI and support from LAION, we were able to train a Latent Diffusion Model on 512x512 images from a subset of the LAION-5B database.
So the non-profit created the dataset/model, and the company then worked to monetize it. As noted in AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability:
“A federal court could find that the data collection and model training was infringing copyright, but because it was conducted by a university and a nonprofit, falls under fair use…
Meanwhile, a company like Stability AI would be free to commercialize that research in their own DreamStudio product, or however else they choose, taking credit for its success to raise a rumored $100M funding round at a valuation upwards of $1 billion, while shifting any questions around privacy or copyright onto the academic/nonprofit entities they funded.”
To their credit, Stability has started trying to use more licensed datasets since the release of Stable Diffusion v1. But they are far from the only ones relying on such tactics.
Why this is not Fair Use- In traditional fair use, attribution is given to the original creator. Not attributing something has been given another, less flattering name- plagiarism. Adjusting models to acknowledge their ‘sources’ would be an imperfect but great start. My article with the Gradient discusses how that can be implemented with AI Art (since then various organizations have done something similar- so yippee!). However, we need to look at extending that beyond just AI Art (the way Deepmind trained Gato might provide some promising insights). I plan to do a detailed dive into multi-modal joint-embeddings soon
What can be done till then- Till we can come up with refined attribution systems, I’m going to suggest something heretical- just pay the damn people whose work you use to create solutions. We have writers striking against Hollywood’s exploitation. In our sister publication- AI Made Simple- I’ve covered multiple research that shows how high-quality data is the highest ROI for developing performant models (for the most recent example, look at my breakdown of the excellent LIMA paper).
Antagonizing the people whose work provides us with high-quality data just sounds like a very stupid business decision. As Adam Smith wrote in his seminal work- An Inquiry into The Wealth of Nations (very interesting and surprisingly insightful book fyi)- wealth creation is not a zero-sum game. We can all eat.

It’s important to get the conversation going regarding this topic. At the very least, we need to push for transparency in the datasets used to ensure that companies aren’t cooking up creative ways to get around copyright (shoutout to Meta for their continued commitment to Open Source here).

That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow.

Save the time, energy, and money you would burn by going through all those videos, courses, products, and ‘coaches’ and easily find all your needs met in one place at ‘Tech Made Simple’! Stay ahead of the curve in AI, software engineering, and the tech industry with expert insights, tips, and resources. 20% off for new subscribers by clicking this link. Subscribe now and simplify your tech journey!

Using this discount will drop the prices-

800 INR (10 USD) → 640 INR (8 USD) per Month

8000 INR (100 USD) → 6400INR (80 USD) per year (533 INR /month)

Get 20% off for 1 year

Reach out to me

Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.

Small Snippets about Tech, AI and Machine Learning over here

AI Newsletter- https://artificialintelligencemadesimple.substack.com/

My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

Aditya Anil

Aug 13, 2023

Great post! I didn't know about Data Laundering. I think you are right: we need to address this on a large scale, so that artists' works stay protected.

This mass 'seemingly legal' burglary of data will reduce the concept of IP rights and copyrights to a mere joke.

1 reply by Devansh

Michael Woudenberg

Nice summary of an important problem.

4 more comments...

Technology Made Simple

Discussion about this post

Ready for more?