AI-12: What’s good in AI is taken from humans

The other day I was hanging out with my Dad and I introduced him to ChaptGPT. We asked it to name the best cricketer from our home state of WA, and it outputted “Dennis Lillee”, describing him as a bowler of “lethal pace”. That immediately struck us as a phrase sounding exactly like something a cricket commentator would say. And it turns out, multiple sources do indeed use that exact phrasing.

This is no exception. While it is hard to quantify such things, one trial found that GPT-4 generated copyright material on 44% of test prompts.

All AIs are produced by compiling a large amount of data. For models of limited scale, that data can be a known set of openly available information used according to a legitimate allowance. But this is nowhere near enough for large models like GPT-4.

No-one really knows exactly what goes in their data. OpenAI isn’t telling. But is likely that it includes pretty much all the data there is. Reports say they essentially used up most data by 2021 and since then have had to resort tricks like first transcribing Youtube videos then adding them, a practice that is legally dubious. They’re now adding “synthetic data”, namely using AI to create data to be consumed by AI in order to create data. An even bigger problem is that to get a linear improvement you need an exponential increase in data. No doubt there are efficiencies to be gained, but as things stand, it is very possible that we are already approaching the limits of this technology.

Basically they suck everything up and crunch it in massive fields full of servers. Now they will be looking to extract more data. They’ll be getting data from any place, legitimate or not. They’ll scrape private documents, social media, anything they can get their hands on. But it will just keep growing as long as we let them. They’ll tap into our cameras, our devices, into the billions of computers embedded in cars and cappuccino machines, into location data from phones and heart rates from smartwatches. Anything they can. Increasingly, people are hooking their brains directly to computers; have no doubt, they’ll scrape your brain waves right out of your skull if they can.

There are laws against this kind of thing, but they were created in another time with another set of technologies. While details vary per jurisdiction, the general assumption so far has been that use of data for training models is not subject to the normal copyright restrictions. This means that, for example, the copyrighted work of an artist can be scraped up in an AI, which will then output images that to a normal human eye are indistinguishable for a work in the style of that artist. But not only would the artist have no legal recourse, they themselves would be subject to any legal restrictions regarding the generated image (although the scope of that is also contested). Creative artists and authors are begging for AI regulation to protect their livelihood.

One implication of all this is that the basis of large models is on legally very fragile ground. If it is ruled that their usage is in fact illegal, their models will have to be shut down. This is no idle speculation; the New York Times is currently taking OpenAI to court with the aim:

Ordering destruction under 17 U.S.C. § 503(b) of all GPT or other LLM models and training sets that incorporate Times Works

Meanwhile, AI firm Anthropic is being sued by music publishers with the request that they destroy all infringing copies of works, which would entail destroying the model itself.

The key to this is that any legal challenge must be weighed against the pro-AI momentum; once it is embedded too deeply in too many things there will be no going back. And that, of course, is their strategy.

They want you to think AI will make your job easier. But the reason your job is hard is not because of the tech. It’s because you’re working in a system that does not value you as a person, but only in how much money they can extract from you. No matter how hard you work, how efficient you are, they will keep pushing until they extract as much as they can. And when you’ve given everything you’ve got until you can give no more, they’ll replace you with a machine. AI is not the cause of this problem, but it is a great way to accelerate it.


As it so happens, meanwhile on the Internet, AI researchers announce “new and improved” models:

OpenAI claims the new GPT-4 Turbo model updates information up to April 2024.

At this point, it is clear everyone is scraping the entire Internet to build these models, because that is relatively easy to do. Meta is no doubt scraping their vast social media assets, but everybody is basically scraping off everybody else.

Whether or not the creators intend or not, there is no doubt these models are trained on copyright content. But even “open source” models don’t disclose their sources - they simply release the weights of the trained models.

Personally I am a bit leery of using a model trained on Trump’s speeches, all the hatred and cyberbullying in social media, and recursively on fake news generated by other AI models. Of course, all the gems of literature are included in that training as well.

Perhaps it is best to view these models as an art form. It is interesting to ruminate and speculate on the outputs of these models. In a strange and potentially terrifying way they are a mirror of humanity, in peering into them perhaps we look at ourselves, imperfections and all. And therefore by realising this, we free ourselves from our delusions of self-perception.


This is an interesting article:

Last week, the Wall Street Journal published a 10-minute-long interview with OpenAI CTO Mira Murati, with journalist Joanna Stern asking a series of thoughtful yet straightforward questions that Murati failed to satisfactorily answer. When asked about what data was used to train Sora, OpenAI’s app for generating video with AI, Murati claimed it used publicly available data, and when Stern asked her whether it used videos from YouTube, Murati’s face contorted in a mix of confusion and pain before saying she “actually wasn’t sure about that.” When Stern pushed a third time, asking about videos from Facebook or Instagram, Murati shook her head and said that if videos were “publicly available…to use, there might be the data, I’m not sure, I’m not confident about it.”

I don’t mean to pick on the individual in this video, but it was indeed very cringe to see that they didn’t know the answer to a very simple question of where the content came from.

That’s really interesting. So it seems that people who say the technology will just keep getting better and better are incorrect. Kind of like how we still aren’t driving flying cars.


I think it’s technically an open question at the moment. I certainly think the confident claims of continual progress are overblown, and there’s a significant chance that it will level out. After all, every other technology has.

But where it gets tricky is where it encounters philosophical beliefs, especially accelerationism. This is basically the idea that we’ve gone too far, there’s no going back, and the only way is to keep pushing further and faster until we get to the other side.

There’s an astonishing clip of Altman talking to a bunch of VC investors. He says, “We have no profits, no profit model, and no road to profitability. Our plan is to create an AGI and ask it how to make money.”

So the question of whether it will level out is one thing. The question of what happens when you give trillions of dollars to people in a wild gamble that it will not level out is another.


One thing that I wanted to mention that not many people are aware of, is that Meta does train their AI models from WhatsApp messages, which some people may have expected to be private.

Whilst it is true WhatsApp offers end to end encryption and technically does not store or look at WhatsApp messages, I was told they tokenise messages and feed it for training purposes locally on the app prior to encryption, and the results of the ingestion are sent separately to Meta for training purposes.

So technically Meta does not have access to the messages themselves, but they are used to train model(s). They even tried to alter the terms of service to make it clear they were doing this, which raised an uproar, but they were already doing that before the proposed change in the terms of service.

For this reason I deleted my WhatsApp account several years ago. I was also getting spam from relatives, and it’s hard telling family members, no I am not interested in watching YouTube, or TikTok, no matter how cute the video is, and I am not interested in political messages about how so and so is corrupt.

People also don’t seem to understand WhatsApp is a target for zero click exploits, since all it takes to send a message is to guess someone’s phone number, which is easy to do.

Similarly, Google harvests search queries, as well as pretty much anything you put in any Google Service. And Microsoft with Windows 11 etc. At this point in time, we are all feeding input into AI models whether we like it or not. Even Apple has been caught harvesting data from their devices.


You have a source for this claim?

I would prefer not to provide an answer. As @sujato noted, we are all 1 degree of separation removed from our tech overlords. The fact they tried to change the terms of service to make this explicit should be evidence enough.

One thing I would add is that Eric Schmidt once said: “Any user of Google services should not have any expectations of privacy.” (this is a paraphrase, I do not remember his exact words). He also added anyone who is not comfortable with this should stop using Google services. That very night, I deleted all Google apps from my phone.

Of course, all this is from a few years ago. Companies do change. I do not speak for current operating practices, as I am no longer active in the role I once played.

1 Like

The relentless optimism of tech bros (and they are mostly bros, although occasionally there is a sis) can be captivating, even if it sometimes come across as a bit naive.

A few years ago I attended a tech startup conference in Hong Kong. I wanted to “feel” the “vibe”, and to be honest I had a few ideas I wouldn’t mind exploring. It was Peak Tech Startup - interest rates were at all time lows, and everyone wanted to either create the next unicorn, or invest in the next unicorn.

Now when I reflect on it, it was truly surreal. I was told that China was making more investments in tech startups than Silicon Valley. Everyone wanted to do AI.

I was accosted by a woman who assured me she could connect me to a few angel investors if I was interested. This was before I even pitched my ideas to her. I mentioned to a company I had ideas, and they said they will be happy to see a prototype and we can move from there.

Self driving was big, everyone assumed the technology will be in production within 1-2 years. When I expressed my doubts, everyone looked at me as if I was crazy. I said I think we are years away from self driving, and we may never reach the goal, because … well maybe it is not a problem that can be completely solved by reinforced learning and require ethical decision making capabilities. I also predicted eventually we will see deaths caused by self-driving. I was told categorically by several self-styled “experts” I was wrong, the models are fully working and it was pesky regulators slowing everyone down and doubters like me.


We aren’t reading your messages we are just collecting the ASCII characters for each letter you have ever pressed.

I guess you could legally do this because the company could argue that the data becomes an aggregated feature with some information loss so its not 100% recoverable in the case of some out of vocab emojis.

1 Like

The person who told me was claiming exactly that - training is inherently lossy so privacy is protected.

However, we now know in certain cases LLMs can “leak” input under special circumstances. The possibility of this is reduced via quantisation, although quantisation also loses fidelity.

Current generation Copilot chat is heavily quantised, primarily to save computing power but I suspect also to prevent leakage.

1 Like