For the authors who just found out their books were stolen to train AI
And all the nosy folks who want to know about this, too (relatable)
The other day, someone DMed me a video that was clearly AI-generated. (It was also labeled as AI, and not all are, so that, at least, was a plus.) It wasn’t meant to be anything significant; it was just a video of a lamb looking and smiling(…?) at the “camera.” To me, it hit the uncanny valley. But obviously, plenty of others thought it was cute.
What I found ironic, though, was that that user who posted it had another video — a response to a comment asking if their job — therapy — could be replaced by AI. In that video, they listed out a zillion reasons their job could not be replaced by AI, and how it might actually be harmful for patients to be “treated” with AI.
I’m with them. I don’t think AI can replace a real-life therapist. But it’s frustrating to see that some people want to protect their own jobs from being taken over by AI, while they don’t seem to have the same respect for others’ work. (Which also can’t be replaced by AI.)
It’s not like there’s a shortage of real lambs out there.
Last month, while I was at my tax appointment waiting to find out how close I’d come with my estimated payments, my tax guy asked if I would ever use AI to “write” a book.
I said no. And that’s all I said. Hey, I can be chill!
But here’s the thing . . . it’s becoming a more and more common question. It’s one I hear a lot — and I’m sure other authors are hearing a lot, too.
But . . . why would I abdicate the fun part of my job?
Like so many authors I know, I grew up wanting to write books. I love telling stories. And being able to do that professionally is a dream come true.
Why on Earth would I give that up?
I mean, yes, it’s hard sometimes, but I enjoy the challenge. I enjoy figuring out the best way to tell a story. I enjoy spending time with my characters and building the world they live in.
And when I run into a problem, there’s nothing like the surge of triumph when I figure out how to make something work. The feeling of finishing a book or a series after literal years of work — it’s unmatched.
And yeah, when I was a querying writer and agents finally said, “Yes, I’d like to represent you,” after years of striving to improve, after hundreds of rejections and well over a dozen completed manuscripts . . . I felt immensely validated. I’d worked my tail off to become a good writer. A sellable writer. So after writing over a million words no one wanted to represent, when agents offered, I knew I was finally writing books that were worth something.
Now I’m eighteen books into my career (#18 is coming out this Fall!) — even more evidence that what I write is valuable.1
If you peeked into an author’s social media yesterday, you probably saw that — once again — we have a tool to search a dataset for our books . . . which were used to train LLM — large language models — aka “AI.”
A quick lesson for anyone who needs it:
It’s not really artificial intelligence. I’ve called it “AI” here mostly, because that’s how a lot of people see it in marketing language and news articles, but LLMs are, in short, extremely advanced predictive text. And yes, it’s fair to say that the term “AI” is misleading and gives people a false sense of what’s actually going on here.
Anyway, these models are nothing without training data, so they need to be fed — ideally — a lot of high-quality writing in order to function. Enter books. Massive amounts of words following words following words.
But . . . books are expensive, if you need a hundred thousand of them. Or a million of them.
Or, as we learned the latest dataset contains, 7.5 million books and 81 million research papers. Read The Atlantic’s “The Unbelievable Scale of AI’s Pirated-Books Problem” for more on that.
If you just gasped a gasp out loud, I get you. It’s an incredible amount of data. And by data, I mean real humans’ hard work. Literal lifetimes of dreams. Efforts. Passion.
Stolen.
Wait, stolen?
Yeah. Like I said, it’s expensive to buy all those books (and scientific papers and short stories and translations), so these were taken from pirate sites, where users illegally upload and download books.
That Atlantic article has a tool to search the database; you can search any author’s name to see which books of theirs are included in the dataset. All my books are in it.
In fact, my agent told me that, in her opinion, literally every book you’ve ever read has likely been swept up into illegal LLM training; if you don’t see a book in there, it’s probably because of a listing error, not because it wasn’t used.
This is actually my second time seeing my books appear in one of these. Many of my books were also in the Books3 dataset, revealed a couple of years ago. Also used by Meta and other companies to train their LLMs.
And when I searched my friends’ names — yep, they’re there too.
All across socials, I’m seeing my colleagues share screenshots of their work that was illegally included in this . . . and then used to train a LLM so that Meta, one of the biggest companies in the world, could make a profit off it.
Wait, Meta? Yes. Meta. That Meta. Facebook, Instagram, Threads, WhatsApp, etc. Owned by a billionaire. That’s the one.
A couple more articles worth reading:
Wired: “Meta Secretly Trained Its AI on a Notorious Piracy Database, Newly Unredacted Court Docs Reveal”
Axios “AI firms push to use copyrighted content freely”
Okay, so now what?
Well, there’s a class-action going on already. That’s why we’re seeing this information come out. I know some folks will probably ask what’s the point? They already have our work.
But here’s the thing: we don’t want to let them set precedent. We want what they’re doing to go on the record as illegal. They’re using stolen material for their own profit.
It will take years, I’m sure, but it’s worth fighting back.
And remember, this isn’t new information. There have been lots of folks paying attention to this for years. People like my agent, the AALA, the Authors Guild have been working on this for a really long time, and they’ve all been operating under the assumption that pirated works were being fed LLMs.
I know it’s easy to feel panicked and angry right now, but you’re definitely not alone in this fight! There have been people fighting on your behalf for literal years.
But I know waiting stinks, and action feels good, so here’s what you can do in the meantime:
Grab screenshots of what works of yours were included in the dataset. You may never need it, but it’s always good to have evidence just in case.
And even if you aren’t in the dataset, you can still help out your favorite (and least favorite) authors (and potentially your future self!) by simply not using AI/LLMs.
There are some “AI” features you can turn off, like “Apple Intelligence” on your phone.
Use search engines that either don’t respond with AI “overviews” at the top, or give you the ability to turn those off. (I use DuckDuckGo on both my phone and my computer.)
Don’t stop with LLMs, either. Don’t use AI to generate “art.” Don’t use it to make promotional materials, book or album covers, videos, or audiobooks.
As I said on (Meta-owned) Threads, if you’re an author and you’ve just discovered your work was used to train AI without your consent, remember this feeling. If you’re ever tempted to use AI to generate something to promote your work — or even just for brainstorming — remember that someone else’s art was stolen for that. They’ve experienced the same feelings you’re feeling right now.
Right now, folks are (rightfully!) mad about books being used for this, but make sure you’re not encouraging the exploitation of other artists. We all have to stand together on this.
Just . . . don’t use it. Don’t interact with it. Block/mute accounts that share AI generated material. Because giving them your attention encourages them to go harder for it — and in some cases, it makes them money.2
I know it can be unavoidable in some situations, though, so do just what you can.
And if you’re worried about your future as an author . . . remember, you can do things AI can’t. AI “art” is not art. It it’s generated. It isn’t created with intention.
You can internalize what you read and find meaning in it. You can make deliberate decisions about what to include on the page — or not include! You can actually understand what you’re writing and what it means, and how it fits in with literature on a larger scale.
Also, you, as a human, can copyright your work. So there’s that.
I know it’s demoralizing to see authors’ work once again taken without consent, without compensation, and fed into a billionaire’s environment-killing profit machine, but please know that your work is valuable.
Otherwise, they wouldn’t want to steal it.
New book announcement, in case you missed it: CONFESSIONS FROM THE GROUP CHAT
Thanks for allowing me to indulge my sometimes-struggling sense of pride for a moment.
Meta has a bonus program, so it is financially beneficial for some people to share, essentially, rage bait. Block them.
People who think an actual writer would write a book using AI, or people who think about writing their first book using AI... I believe, all of them just have this incredibly wrong notion that it's IDEAS that matter, and not the execution. That the ideas are the art, not the actual doing (painting, writing, you name it).
"I have so many good ideas, but I don't really know how to write/paint that well..."
I am so bemused by such statements.
Mate, the ideas were never the problem. The magic lies in the process, in the doing.
Trying to merge your crafting abilities with that image/feeling in your head - that's what being an artist is all about (or, you know, the commonly shared anguish about this gap being unbridgeable...)
100% copyright work should be protected.
I don't know that not using "AI" is an effective step, or even possible. We've stepped on copyright so frivolously to this point I think we hardly notice anymore.
For me it was Pinterest. A vast array of images, photographs, art with no attribution, no link back to the creator. Use images however, wherever you want with no regard for copyright. I honestly sent dozens of message reporting pins, "No link to the creator." "Some random person is using this image to promote their business or their..." The response was honestly along the lines of, "We don't really care about the creator or copyright." Everyone loved Pinterest and didn't think about how those images came into existence.
Now we've integrated different versions of AI so quickly into so many services, I don't know that we can stop using it (without becoming hobbits).
- I don't know a lot about the technology behind LLM but logically, predictive text probably leverages an LLM. I honestly don't know if that's fed from the browser or the website or if there's a central api they all tap into but almost every website I go to suggests what I should type next.
- I can (and do) turn predictive text off in Word because it's annoying. But Microsoft profits aren't more or less because I'm not using predictive test.
- I also use DuckDuckGo and it's started to give me predictive answers to search questions.
- Canva offers MagicWrite that offers to "help write copy and brainstorm ideas."
- Since Meta is the one stealing work to support their LLM, I imagine that's filtering into Facebook in some way.
- Grammarly, I would imagine though haven't researched, could have had the rules of grammar programmed in. But we don't know if it was trained solely on work people submitted to the site for review or from other sources. And did the people who submitted text for Grammarly to review know that their words would be used to train service? (I imagine there was a footnote somewhere.)
- Substack has AI to allow me to generate an image for my post.
Not all versions of AI are an LLM but it seems everyone wants an algorithm to automate some part of a process for them.
I would very much like to see the DOJ crack down on pirate sites, period. Copyright work should be protected.
I would also like to see copyright enforced in regard to training any AI model. I think big companies would still try to get around it. But one judgement that AI models have to respect copyright sets a precedent that makes it easier for artists to pursue justice in protecting their work.
And I'd actually like to see a company create AI that respects copyright (I did see Tess in the below comments). I think we'd be surprised how many people would be willing to allow their work to train AI if they're fairly compensated (and who knows what "fair" is in the wild west).