HomeFeatures
Protecting art from generative AI is vital, now and for the futurePart 2 of our series on generative AI delves into the issues around how it’s currently trained
Part 2 of our series on generative AI delves into the issues around how it’s currently trained
Image credit:Rock Paper Shotgun
Image credit:Rock Paper Shotgun
In Part 1,Our Generation, we looked at what we mean when they use the term ‘generative AI’.
Today in Part 2, The Art Of The Steal, we look at why people say generative AI is stealing art.
Generative AI is all over the entertainment industries right now, and lots of people in games are making excited noises about finding new ways to integrate it into their products, from game developers and publishers such as Ubisoft and Square Enix to platform holders and hardware firms such as Epic and Nvidia. This new industry obsession is still taking shape, and there are lots of questions still to be answered abouthow much it might cost in the future, who will have access to it, and what it will actually help with, not to mentionfears about job lossesand other harms. But there’s a bigger question bubbling underneath all of this that threatens to burst the wobbly generative AI bubble: is the entire boom built on stolen labour?
Building and training machine learning systems often requires data –a lotof data. Sometimes we’re lucky, and we can make our own data. When OpenAItrained a bot to play 1v1 Mid matchesinDota 2, they did it through a process called Reinforcement Learning, having the bot play against itself over and over, with the only feedback being who won and who lost. This feedback, sometimes called a ‘reward’, helps a machine learning system play a sort of ‘hot and cold’ game, changing its internal wiring to try and get a better reward the next time it performs the task. If you’re old enough to remember the gameBlack & White(or, god forbid,Creatures), then these games worked on a similar principle. If your little animal did a good thing, you give it a treat, and if they just hurled several villagers into a lake, you tell them off (or give them a treat, if you’re into that).
Sometimes we can’t make our own data. If we want to train an AI to be an artist, we can’t have it just doodle and learn from the results, because we can’t easily define what the feedback should be. In DOTA 2’s 1v1 Mid, if you die then it’s game over, and there are similar goals for Chess, Go, Starcraft and so many other games that AI have tried to play. In art, defining winners and losers is considerably harder. So we need to find some data that already exists, a dataset of art that already looks like the kind of stuff we’d like our machine learning system to be able to do. But where do we find that? Where we find the answer to every problem in life: on random websites.
Dota 2 is one of many games that have been used to train AI tools on how to win matches. |Image credit:Valve
Chances are, if you’ve heard of a machine learning system that generates art –Midjourney,Stable Diffusion, DALL-E - it has trained itself on millions or billions of images scraped directly from the Internet. Most of these datasets are unfiltered, gathered from pages strewn across the web, especially sites where content is easily accessible such as Flickr, Reddit or stock image databases. They’re also massive, with one popular dataset, LAION-5b, containing over fivebillionimages. These images are all gathered automatically, with very minimal attempts to filter the contents. As a result, the datasets used to train these AI models are full of copyrighted content, illegal material, and personal information. And it’s all being fed into for-profit AI products that a huge proportion of us are using every day – including game developers, journalists and players themselves.
This is the chief reason why you’ll see so many people posting about AI ‘stealing’ content online. One of the best-known effects of these murky, legally-questionable datasets is that you can ask an AI to mimic the style of a particular artist, likeGreg Rutkowskiwho has illustrated for games such as the Anno series, and card games like Magic: The Gathering. Rutkowski’s work is beloved, widely shared online, and clearly labelled with his name, all of which means an AI is going to see a lot of examples of his work,as he discovered one day. But there are many more examples that you may never hear about, or that may never come to light. For example,AI Dungeon– which used OpenAI’s GPT-2 to generate RPG stories – took and used thousands of choose your own adventure stories from an online community without permission, leading to a lot of disappointmentfrom the original authors(note that almost all threads on that link are pretty yikes, but it’s here for your context anyway).
Theft is sometimes simple, and sometimes complicated. When I wander into a boss fight inValheimthat my friends have spent hours preparing for and I hoover up all the loot like a hungry, fantasy Roomba, that’s clearly not stealing - that’s just sharing the wealth. In the real world, where courts and legal systems get involved, theft is a lot greyer. Games are often criticised for ‘stealing’ things from other games or media – whether it’screature appearances in Palworld,dances in Fortnite, or entire games, as happened toVlambeer’s Ridiculous FishingandAsher Volmer’s Threes. But we don’t always agree on what theft is, especially in courtrooms. The lawsuits against Fortnite were all dismissed, but in the case of small indie developers who had their work cloned, they had almost no legal recourse at all. Deciding on what constitutes theft is unfortunately often more about power than it is about justice.
Right now, there areseveral lawsuitsrunning around the world targeting various different AI models, companies and datasets for infringing all manner of different laws and regulations. Some models have been shown to memorise personal information and leak it later, while others have been trained on harmful content or will reproduce copyrighted works. More technical legal arguments get deep into the details of these systems – that the very act of training on copyrighted material constitutes a breach of copyright, for instance. It’s not clear which of these arguments, if any, will break through in court. Companies claim that all of this falls under fair use, that harmful content is a temporary flaw of the system that can be fixed later, and that licensing agreements will help provide respite for artists in the future.
But often it’s not necessarily about what’s legaltoday, but about how we want the world to work in the future. There are many, many examples throughout the last century of us regulating technology not because it breaks existing laws, but because it lets people work around those laws in ways that no one could have expected. However, the arguments that defend this large-scale exploitation of creative work are missing the point of the problem. This isn’t a question of legality, but one of humanity. It makes sense to protect creative work and the people who work hard to make it, because it plays a really important role in society.
Palworld has recently drawn alot of criticismover the design of its Pal monsters.Image credit:Rock Paper Shotgun/Pocketpair
It’s this long-term damage that a lot of creative people and AI researchers fear the most. Just like the recent games industry layoffs, the effects of big changes in the industry take a while to fully hit. All the games that were due to come out in a given year will probably come out, and a lot of them will be fun, and maybe you’ll wonder if those layoffs really affected anything? But the impact of disruption like this can take years to be noticed, and decades to be reversed. The reason these generative AI systems were able to be built today is because they had decades, centuries even, of human creativity to look at on the Internet. If they play a role in devaluing or destabilising the jobs that helped those people make that art, what culture will there be to learn from at the end of this century? Even if you’re a die-hard AI accelerationist, many are worried that we havepermanently infected the Internetwith so much AI-generated content it may be impossible to train an AI system on human-authored content ever again.
In Part 3, Better Living, we look athow generative AI must be centred around creatorsif we’re going to achieve a best-case scenario future.
Missed Part 1? Here, we take a closer look atwhat generative AI means, and why it needs better language to describe how we use it.