Your Content Probably Helped Train AI

If you’ve ever created something online, shared a piece of text or photo or write a blog this, even if you wrote in the comments section of Reddit or Youtube, you most likely have helped train an AI model somewhere out there.

There’s an old Internet saying stemming from the video game days: “All your base are belong to us.”

Today, it’s more like, “All your content are belong to us.”

What I’m saying is, all artificial intelligence has to be trained, it has to be taught how to respond or act or create content or answers out of thin air. But how do you train something like a computer? You need to feed it information.

How could you possible feed it that much information? You would need to have written material for just about every topic imaginable so that it can be an expert on everything – how trees grow, ancient Roman history, cooking instructions for scrambled eggs, you get the idea.

And that’s the best part about this AI story: all of the information that these machines are being fed, how they’re being trained, is essentially all the data we have all ever created on the web. All the data that’s open and freely available. It’s been scrapped, neatly packaged into digital text files and data, and then fed to a computer.

Everything – blogs, tweets, every post on Reddit, everything that is public and sharebale is now being fed to computers to that they can copy, learn, and act on that information.

LLM is a word that is used in AI quite a bit. It is essentially the foundation of most AI products. LLM stands for Large Language Models. That is, after all, how you can train AI – feed it gazillions of words, phrases, and sentences. Organize that data by specific keywords and topics, then make it quickly accessible when needed.

Where do these LLMs come from? Everything that’s open on the Internet. There is so much language on the Internet. The comments sections of social media networks like Reddit and Youtube have an incredible amount of data about answering complex topics to responding in specific ways.

This is the amazing thing about AI in its current state. It is greatest rug I can think of. It is all of us giving our content away for free, and that now being used to train this epic new tool.

Google, Microsoft, OpenAI, Facebook, have all been harvesting vast amounts of user generated content, consuming the text, video, everything that’s created. They host that content forever. They store it. And they train computers on it.

All of the free content people have shared on the Internet is now being used to train and build some fairly powerful tools and services.

I actually have no problem with this. I have been sharing freely for 10+ years now on Twitter, my blog, other sites and newsletters. So be it. But I think it’s important to really understand this going forward.

For those who think it’s plagiarism? rug pull? It’s too late now. I guess we just sit back and enjoy it, learn to call it what it is, and learn to identify it or use it enhance our own capabilities. Maybe some of this point was written by AI… it wasn’t.

Nevertheless, the point remains, if you’re reading this, and if you’ve ever posted on social media or made free content on the Internet, you’re apart of AI just as much as I am, or more importantly, just as much as its founders.

%d bloggers like this: