How Does AI Actually Work? 28 Learning Resources.

Find out how generative AI---from LLMs to diffusion models---does its magic.

Dec 07, 2023

I’ve been writing about generative AI for over a year.

Yet I’ll be the first to admit that I don’t really know what happens inside the black box called ChatGPT or Midjourney.

All I know is that we can get them to spit out nursery rhymes about quantum physics or paint an image of a steampunk Batman baking delicious cupcakes.

Steampunk Batman standing next to a plate of muffins. Photo generated in Midjourney. — Case in point.

But, like, how?

I bet many of you are in the same boat.

So let’s try to get smarter together, shall we?

I’ve curated a list of resources that explain and showcase the inner workings of AI models.

Take a look.

📝 1. LLMs (e.g. GPT-4)

ChatGPT is widely credited with being the product that got everyone talking about generative AI.

Behind ChatGPT is a large language model from OpenAI called GPT-3.5 (GPT-4 for Plus users). In a nutshell, large language models are algorithms that can understand and write text.

Here are a few great ways to learn how LLMs function:

LLM Visualization: An interactive diagram of an LLM accompanied by a step-by-step explanation of what happens under the hood.
Large language models, explained with a minimum of math and jargon: A fantastic primer from
Understanding AI
that delivers on its promise.
What Is ChatGPT Doing … and Why Does It Work?: A deeper dive into what makes an LLM tick.

If video is more your thing, here’s an excellent 1-hour look at LLMs, how they’re trained, fine-tuned, and more by Andrej Karpathy (of OpenAI, Tesla, and Google DeepMind fame):

This post might get cut off in some email clients. Click here to read it online.

🖼️ 2. Text-to-image (e.g. Midjourney)

Text-to-image tools are what first got me into generative AI.

My first love was Stable Diffusion, which I promptly abandoned for Midjourney when V4 came out. (Insert the customary Distracted Boyfriend meme here.)

The current generation of image AI is mostly based on diffusion models. These work by learning to “extract” an image from static noise.

Want to learn more? Start here:

An Introduction to Diffusion Models for Machine Learning: Great at explaining the basics, even though the specific models it references are a bit outdated now.
Introduction to Diffusion Models for Machine Learning: This gets a bit technical and mathy, but it’s worth it if you can follow.
Introduction to Diffusion Models for Image Generation – A Comprehensive Guide: This one dives into a bit of history and compares diffusion models with Generative Adversarial Networks (GANs), which used to be the standard.

If you’re going the video route, this is one of the most concise, no-hype, no-fluff options:

From my sponsor:

Hate writing meeting notes?

Bluedot 🔵 is the first AI Note Taker that records your meetings without bots joining the call. Backed by Google for Startups, Bluedot delivers world-class AI summaries in 15 languages.

Your first 10 recordings are free!

Install Free Chrome Extension

📽️ 3. AI video (e.g. Runway)

AI video models are advancing at breakneck speed this year. Not so long ago, I took 6 text-to-video sites for a spin.

As far as I can tell, most of them start with an image generated through the above diffusion process and then extend it into additional frames to create the final video.

Here are a few explainers:

Text-to-Video: The Task, Challenges and the Current State: This primer from Hugging Face touches on the basics, compares text-to-video to text-to-image, and provides a few free test models.
Make-A-Video: Text-to-Video Generation Without Text-Video Data [PDF]: This is a research paper from Meta AI, so it gets a bit heavy, but it breaks down the anatomy of Meta’s text-to-video models quite well.
Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning [PDF]: Another research paper from Meta, focusing on their latest Emu video model.

This clip from Google Research is an excellent intro that provides a top-level understanding of text-to-video:

🎹 4. Text-to-music (e.g. Suno)

Another area that saw multiple models spring up this year is AI-generated music. (Here’s my comparison between MusicGen, MusicLM, and Riffusion).

It was surprisingly hard to find good beginner-friendly explainers of what happens inside an AI music model. Here’s what I do have:

Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion: A brief intro from Stability AI to how their latest Stable Audio model works.
Simple and Controllable Music Generation [PDF]: A detailed research paper from Meta AI focusing on their MusicGen model.
Moûsai: Efficient Text-to-Music Diffusion Models [PDF]: A deep dive into the Moûsai text-to-audio diffusion model (also used in Stable Audio).

The following video is 2 years old, but it does a good job of covering certain concepts behind AI-generated music:

🧊 5. AI 3D generation (e.g. Luma Labs)

There’s an increasing variety of AI tools that can generate 3D assets from 2D images or even from a text prompt. (I most recently mentioned 3D just a month ago.)

There are multiple different approaches at play within 3D AI generation. I’m sharing three research papers that cover several of them:

DreamFusion: Text-to-3D using 2D Diffusion: Research from Google that uses a 2D image from a diffusion model as a starting point. Includes a video visualization.
Point·E: A System for Generating 3D Point Clouds from Complex Prompts [PDF]: Research paper from OpenAI about their “point cloud” method.
Magic3D: High-Resolution Text-to-3D Content Creation: NVIDIA’s 3D mesh approach. Includes useful video demos.

This is currently the only video on YouTube that truly explains how 3D AI generation works, based on over a dozen research papers, including the above.

🤖 6. AI agents

For a brief period in early 2023, everyone got super excited about autonomous AI agents like BabyAGI and AutoGPT.

Unlike LLMs that require constant back-and-forth prompting, these AI agents can be given a broad goal and then work towards it on their own by maintaining a running list of sub-tasks and continuously executing them.

For now, available agents don’t quite live up to the hype, but they’re widely considered to be the next major leap in AI.

Here are a few useful articles:

What is an AI agent? - an easily digestible intro to AI agents from Zapier.
Exploring Intelligent Agents in Artificial Intelligence: another quick primer by Simplilearn.
Agents in Artificial Intelligence: a more thorough look that discusses and illustrates the different types of agents.

This video is a good intro that also explores potential applications of AI agents in different disciplines:

⚙️7. BONUS: Hands-on Machine Learning

In the course of my research, I came across a few sites that explain machine learning in visual and interactive ways. They can get quite technical and might not be everyone’s cup of tea.

But I’m dropping them here just in case:

A visual introduction to machine learning: A two-part series that eases people into machine learning through worked examples.
Play with GANs in your browser: A look at how Generative Adversarial Networks function that also gives you a sandbox to play with.
A Neural Network Playground: Another platform that requires prior knowledge to be truly useful but lets you tweak and view the impact of multiple parameters.

If you’re particularly interested in machine learning, this freeCodeCamp course by Kylie Ying should be a fantastic place to start:

Over to you…

Have you come across better tutorials that I could include? Are there any relevant generative AI topics I didn’t address? If so, don’t hesitate to let me know, so I can update and expand this post.

You can send an email to whytryai@substack.com or leave a comment below.

Why Try AI

Discussion about this post