The other day I was hanging out with my Dad and I introduced him to ChaptGPT. We asked it to name the best cricketer from our home state of WA, and it outputted “Dennis Lillee”, describing him as a bowler of “lethal pace”. That immediately struck us as a phrase sounding exactly like something a cricket commentator would say. And it turns out, multiple sources do indeed use that exact phrasing.
This is no exception. While it is hard to quantify such things, one trial found that GPT-4 generated copyright material on 44% of test prompts.
All AIs are produced by compiling a large amount of data. For models of limited scale, that data can be a known set of openly available information used according to a legitimate allowance. But this is nowhere near enough for large models like GPT-4.
No-one really knows exactly what goes in their data. OpenAI isn’t telling. But is likely that it includes pretty much all the data there is. Reports say they essentially used up most data by 2021 and since then have had to resort tricks like first transcribing Youtube videos then adding them, a practice that is legally dubious. They’re now adding “synthetic data”, namely using AI to create data to be consumed by AI in order to create data. An even bigger problem is that to get a linear improvement you need an exponential increase in data. No doubt there are efficiencies to be gained, but as things stand, it is very possible that we are already approaching the limits of this technology.
Basically they suck everything up and crunch it in massive fields full of servers. Now they will be looking to extract more data. They’ll be getting data from any place, legitimate or not. They’ll scrape private documents, social media, anything they can get their hands on. But it will just keep growing as long as we let them. They’ll tap into our cameras, our devices, into the billions of computers embedded in cars and cappuccino machines, into location data from phones and heart rates from smartwatches. Anything they can. Increasingly, people are hooking their brains directly to computers; have no doubt, they’ll scrape your brain waves right out of your skull if they can.
There are laws against this kind of thing, but they were created in another time with another set of technologies. While details vary per jurisdiction, the general assumption so far has been that use of data for training models is not subject to the normal copyright restrictions. This means that, for example, the copyrighted work of an artist can be scraped up in an AI, which will then output images that to a normal human eye are indistinguishable for a work in the style of that artist. But not only would the artist have no legal recourse, they themselves would be subject to any legal restrictions regarding the generated image (although the scope of that is also contested). Creative artists and authors are begging for AI regulation to protect their livelihood.
One implication of all this is that the basis of large models is on legally very fragile ground. If it is ruled that their usage is in fact illegal, their models will have to be shut down. This is no idle speculation; the New York Times is currently taking OpenAI to court with the aim:
Ordering destruction under 17 U.S.C. § 503(b) of all GPT or other LLM models and training sets that incorporate Times Works
Meanwhile, AI firm Anthropic is being sued by music publishers with the request that they destroy all infringing copies of works, which would entail destroying the model itself.
The key to this is that any legal challenge must be weighed against the pro-AI momentum; once it is embedded too deeply in too many things there will be no going back. And that, of course, is their strategy.
They want you to think AI will make your job easier. But the reason your job is hard is not because of the tech. It’s because you’re working in a system that does not value you as a person, but only in how much money they can extract from you. No matter how hard you work, how efficient you are, they will keep pushing until they extract as much as they can. And when you’ve given everything you’ve got until you can give no more, they’ll replace you with a machine. AI is not the cause of this problem, but it is a great way to accelerate it.