4.2. Generative AI


Generative Adversarial Networks (GANs)

Core Concepts

Generative Adversarial Networks (GANs) are a class of machine learning frameworks introduced by Ian Goodfellow and colleagues in 2014. A GAN consists of two competing neural networks: a Generator and a Discriminator. The generator’s role is to create synthetic data (e.g. images) resembling the training data, while the discriminator’s job is to distinguish the generator’s fakes from real data. Formally, the two networks are locked in a zero-sum game: the generator tries to “fool” the discriminator by producing outputs that appear real, and the discriminator tries to correctly classify inputs as real or fake. Through this adversarial training process, both models improve – the generator learns to produce more realistic outputs, and the discriminator becomes a better judge. This iterative back-and-forth continues until the discriminator can no longer reliably tell generated data apart from real examples.

To train a GAN, one typically alternates between updating the discriminator and the generator. First, the discriminator is shown a mix of real samples and generator-produced samples and learns to label real vs. fake correctly. Next, the generator is updated based on feedback from the discriminator’s mistakes – it adjusts its parameters to generate data that the current discriminator is more likely to accept as real. Over many such rounds, the generator can indirectly learn the true data distribution by exploiting the discriminator’s responses. Notably, GAN training is unsupervised (no manual labels for real/fake are needed beyond the real dataset) and can be viewed as finding a Nash equilibrium between the two networks’ objectives. The generator network is often a deep neural network that maps random noise (a latent vector) to a synthetic data sample, and the discriminator is usually a classifier network (for images, often a convolutional neural network) that outputs a probability of the input being real. This setup is analogous to an evolutionary “arms race,” with the generator evolving to create ever more lifelike data and the discriminator evolving to become a stricter critic.

Training GANs – Pseudocode Example: The following pseudocode illustrates a simplified GAN training loop:

for each training step:
    # 1. Train Discriminator on real and fake data
    real_data = sample_batch(real_dataset)
    z = sample_noise(batch_size)               # latent random input
    fake_data = Generator(z)                   # produce synthetic data
    loss_D = -[log(Discriminator(real_data)) + log(1 - Discriminator(fake_data))]
    update(Discriminator, grad(loss_D))        # maximize probability of correct labels

    # 2. Train Generator to fool the Discriminator
    z = sample_noise(batch_size)
    fake_data = Generator(z)
    loss_G = - log(Discriminator(fake_data))   # generator tries to maximize discriminator believing fake is real
    update(Generator, grad(loss_G))            # update generator parameters

In practice, numerous tricks (such as feature matching, label smoothing, Wasserstein loss, etc.) are used to stabilize GAN training, as it can be notoriously difficult to converge. When successful, however, the result is a generator model capable of producing remarkably realistic data samples from noise.

Applications

GANs are a leading technique in generative AI with wide-ranging applications:

  • Synthetic Data Creation: GANs can learn to generate additional data samples to augment limited datasets. In domains like medicine, this has been used to produce realistic synthetic medical images (e.g. MRI or CT scans) that help train AI models without risking patient privacy. For example, researchers have shown that a GAN (specifically a Progressive GAN) trained on retinal images can create artificial examples that improve a disease detection model’s performance, while the synthetic images remain free of any real patient identifiers. By increasing both the quantity and diversity of training data, GAN-generated datasets can make AI models more robust – especially in cases of rare conditions or where data sharing is restricted. This use of GANs provides an “ethical alternative to using sensitive patient data,” allowing medical training and experimentation without compromising privacy.

  • Realistic Image Generation: One of the most headline-grabbing uses of GANs is creating photorealistic images. StyleGAN, for instance, can generate high-resolution human face images so authentic that they could easily be mistaken for real people. ThisPersonDoesNotExist.com is a popular demo that shows random AI-generated faces – none of the people in those images actually exist. The ability of GANs to learn the intricate patterns of photographs (facial features, lighting, textures) enables them to produce new images that follow the statistical patterns of real photos.

An example of a photorealistic face generated by a GAN (NVIDIA’s StyleGAN). The person in this image is not real – it was produced by a neural network trained on a large dataset of portraits. GAN models can synthesize faces and other objects with such detail that they are often indistinguishable from real images.

Beyond human faces, GANs have been used for image super-resolution (upscaling images with realistic details), image-to-image translation (e.g. turning sketches into photos, or day images to night), and even generating art and scenery. In creative fields, artists employ GANs to spawn endless streams of new visuals. The images produced by GANs have a unique “dreamlike” aesthetic – “the images created by GANs have become the defining look of contemporary AI art,” as one art critic noted. This has given rise to a new genre of AI-generated art.

  • Art and Design: GANs are empowering designers and artists to explore new frontiers. In graphic design and fashion, GANs can generate novel designs for everything from clothing to interior layouts. In the art world, GAN-generated pieces have made headlines by appearing in prestigious galleries and auctions. A notable case was the “Edmond de Belamy” portrait – a canvas painted by a GAN trained on classical portraits – which sold for $432,500 at Christie’s in 2018. GAN art often involves a human-AI collaboration: artists curate or tweak the outputs of GANs to arrive at a final piece. The creative potential of GANs is so high that they can produce an essentially infinite stream of new images, leaving the artist’s role to sift through or guide the output. This synergy has led to “creative adversarial networks” – a term for using GANs explicitly in the creative process. Overall, GANs have expanded the toolkit of creators, allowing rapid prototyping of visuals and inspiring styles that would be hard to imagine manually.

Case Studies

Deepfake Technology: One famous (and controversial) application of GAN-like generative models is deepfakes. Deepfakes refer to synthetic media where a person’s likeness (in video or audio) is convincingly replaced with someone else’s. For example, an ordinary video can be manipulated so that the subject’s face is replaced by a celebrity’s face, yielding a fake video that appears real. GANs and related models enable this by learning to map one person’s facial movements and expressions onto another’s appearance. While the technology has compelling entertainment and film post-production uses (e.g. recreating photorealistic effects without expensive CGI), it has also been used maliciously – such as creating non-consensual fake videos of public figures or fake pornography. Concerns have been raised about GAN-generated fake images and videos being used for misinformation or defamation. By 2019, the problem had grown enough that some jurisdictions (e.g. California) even passed laws to outlaw certain uses of synthetic media (particularly deepfake porn and election-related fakes). Deepfake detectors and forensic techniques are now an active area of research to combat these AI-generated forgeries. This case study highlights a dual aspect of GANs: the same power that allows creative expression also poses ethical and security challenges. The term “deepfake” itself comes from the deep learning techniques used to create fake media, and GANs (or encoder-decoder networks) are often the engine behind these systems. As GANs continue to improve, deepfakes become ever more realistic, raising the stakes for detection and policy. Researchers and companies are working on watermarking AI-generated content and advanced verification to mitigate the risks while still leveraging the positive uses of this technology.

Synthetic Training Data for Medical AI: A positive case study for GANs is their use in medical research to generate synthetic training data. High-quality annotated medical images (like MRI scans, retinal images, or pathology slides) are invaluable for training diagnostic AI systems, but such data can be scarce, costly, and locked down by privacy regulations. GANs offer a solution by learning to produce artificial medical images that are statistically similar to real patient data but do not correspond to any actual individual. For instance, a GAN can be trained on a set of brain MRI scans and then generate additional fake MRI images that have realistic tumors or anatomy. These synthetic MRIs can be used to augment the training set of a neural network that detects tumors, thereby improving its performance. Crucially, because the GAN-generated scans aren’t real patients, they sidestep patient privacy issues. In a 2022 study on retinopathy of prematurity (ROP) (an eye disease in infants), researchers used a GAN variant to create synthetic retinal images with and without signs of disease. When they trained a diagnostic model on the GAN-synthesized images, it actually performed better at detecting the disease in real infants than a model trained only on the limited real images – achieving a higher AUC (area under curve) and agreeing more closely with expert doctors. This demonstrated that GAN-generated data can not only fill data gaps but also introduce useful variability that makes models more generalizable. Such synthetic data generation is becoming a key tool in medical AI for addressing data paucity, class imbalance, and privacy preservation. It’s important, however, to validate that the GAN’s outputs are clinically accurate and that no subtle patient-identifiable features leak into the synthetic data. Researchers are actively developing methods to ensure synthetic images are both realistic and privacy-guaranteed (e.g. by using differential privacy or by confirming that generated samples don’t exactly replicate any single real sample). This case study underlines how generative adversarial training can be harnessed for socially beneficial applications in healthcare and beyond.

Large Language Models (LLMs)

Core Concepts

Large Language Models (LLMs) are a prominent category of generative AI models focused on producing and understanding human language. Modern LLMs are built on the Transformer architecture, which has become the de facto standard for NLP (Natural Language Processing) tasks. Transformers were introduced in the seminal 2017 paper “Attention Is All You Need” by Vaswani et al. – a breakthrough that replaced earlier recurrent neural network (RNN) approaches with a new architecture based entirely on self-attention mechanisms. In a Transformer, rather than processing words sequentially one by one (as RNNs did), the model processes the entire sequence in parallel and learns contextual relationships through attention. The attention mechanism allows the model to weigh the relevance of every other word in the input when producing an output for a given position. In essence, the model learns “what to attend to” – for example, in the sentence “The cat sat on the mat because it was tired,” an attention-based model can figure out that “it” refers to “the cat” by attending to that word’s representation when processing “it”. This ability to capture long-range dependencies and relationships is a key reason for Transformers’ success.

The Transformer model architecture (encoder-decoder structure) as originally proposed by Vaswani et al. (2017). In the encoder (left), input text is processed through self-attention layers (allowing each word to attend to others) and feed-forward networks, producing a set of contextual vector representations. The decoder (right) generates output text autoregressively, using masked self-attention (to only attend to earlier output words) and attending to the full encoder output. Multi-head attention is used in both encoder and decoder: multiple parallel attention heads enable the model to capture different types of relationships (e.g. syntactic vs. semantic) in the data. This architecture can be scaled up by increasing layers and attention heads, which has been central to building large language models.

Several core concepts enable LLMs to function effectively:

  • Transformer Architecture & Multi-Head Attention: The Transformer is composed of stacks of self-attention layers and feed-forward layers, usually with residual connections and layer normalization to help training. Each attention head in a self-attention layer projects the input sequence into a query-key-value representation and computes attention scores to re-weight the information from other tokens. By having multiple heads, the model can attend to different aspects of context simultaneously (for example, one head might focus on syntactic relations like which noun a pronoun refers to, while another head might focus on broader thematic context). The original Transformer design includes an encoder-decoder structure: an encoder that reads the input text (e.g. a source sentence in translation) and a decoder that produces output text (e.g. the translated sentence) one token at a time, while attending to the encoder’s output. Many large language models, however, use a simplified decoder-only Transformer, especially for generation tasks. For instance, OpenAI’s GPT models dispense with the encoder and use a large stacked Transformer decoder that predicts the next word given all previous words (this is known as an autoregressive language model). Whether using encoder-decoder (useful for tasks like translation or summarization where you have input-to-output mapping) or decoder-only (useful for free-form text generation), the Transformer’s key innovation is the ability to model complex, long-range dependencies in text with computationally efficient parallelism.

  • Attention Mechanisms: The self-attention mechanism introduced by Transformers is a generalization of the “attention” idea first developed for RNN-based translation models in 2014. It allows a model to dynamically focus on relevant parts of the input sequence as it generates each part of the output. In Transformers, self-attention means each position in a sequence attends to every other position, enabling a rich encoding of context. For example, when a transformer-based model processes a word in a sentence, it can “see” all the words before (and after, in the case of the encoder) to build its representation for that word. This is fundamentally different from a left-to-right RNN which at a given word only had access to the past words (unless using bidirectional RNNs). Multi-head attention extends this by having multiple sets of attention computations in parallel – the outputs are then concatenated, allowing the model to capture different perspectives on the data. Attention mechanisms also naturally handle variable-length input and enable transfer learning: a transformer can be pre-trained on a huge corpus and then fine-tuned for a specific task without architectural changes, largely thanks to its flexible attention-based representation of text.

  • Pre-training and Fine-tuning: Most large language models go through two phases: pre-training on a very large corpus of text with a generic objective, and then fine-tuning on a narrower dataset for a specific task or to align with desired behavior. Pre-training typically uses self-supervised learning: for example, training the model to predict the next word in millions of sentences (this was the approach for GPT models) or to fill in missing words in a sentence (the approach for BERT models). Through this process, the model learns a broad understanding of language, facts, and even some reasoning patterns from the raw text, effectively acquiring a massive store of implicit knowledge about language and the world. The result of pre-training is a foundation model that is not specialized to any one task but has general language capabilities. After that, fine-tuning is employed to specialize the model or make it follow specific instructions. For instance, OpenAI’s GPT-3 was pre-trained on 500+ billion words from the internet, then later fine-tuned (in models like InstructGPT) on demonstration data and with human feedback to make it better at following user instructions politely and accurately. Fine-tuning typically uses a labeled dataset for a task (if available) or other techniques like Reinforcement Learning from Human Feedback (RLHF), where humans rate the model’s outputs and this feedback is used to refine the model’s behavior. The paradigm of pre-train then fine-tune has been hugely successful – it allows leveraging vast amounts of unlabeled text to build a powerful general model, and then only a relatively small amount of task-specific data or feedback is needed to achieve high performance on a target task. This is how, for example, a giant pre-trained model can be fine-tuned to become a conversational assistant: by training on example dialogues and preferences, it shifts from “predicting generic internet text” to responding helpfully and safely to user queries. The Transformer architecture is very amenable to fine-tuning; one can often fine-tune an LLM on a new task in a matter of hours or days, as opposed to training from scratch which can take weeks on supercomputers.

Applications

Large language models have demonstrated a remarkable range of applications in natural language generation and processing. Some key applications include:

  • Content Generation: LLMs can autonomously generate human-like text content. This includes writing paragraphs of descriptive text, composing stories or poetry, drafting news articles, and even writing computer code. Given a prompt, a model like GPT-3 or GPT-4 can produce several coherent paragraphs on virtually any topic. This capability is being used for generating first drafts of documents, brainstorming, and creative writing assistance. For example, LLMs can write fictional narratives, create marketing copy, generate product descriptions, or even produce creative dialogues. In software development, models trained on programming languages (such as OpenAI’s Codex, based on GPT) can generate code snippets or entire functions from natural language prompts. Content generation has also been applied to data augmentation (creating synthetic text data for training other models) and to build conversational agents that produce natural responses. A key aspect of using LLMs for generation is control – users often guide generation with prompts or use techniques like prompt engineering to influence style and content. As these models have grown, their generated content has become increasingly fluent and hard to distinguish from human-written text for many applications.

  • Automated Writing Assistance: LLMs serve as powerful writing aids, helping users to compose and refine text. They can autocomplete sentences, suggest rewrites, fix grammar, and adapt the tone or style of writing. For instance, in email clients and word processors, AI-powered autocomplete features (like Google’s Smart Compose or Microsoft’s AI Editor) use underlying language models to suggest likely next phrases or corrections as you type. Given a rough draft or an outline, an LLM can expand it into a more polished piece, or conversely, summarize a long text into key points. Tools like Grammarly have incorporated advanced language models to not only correct grammar but also suggest clarity improvements and more effective wording. More recently, products like Microsoft 365 Copilot (integrated into Word, Outlook, etc.) and Notion AI act as on-demand writing assistants: you can ask them to draft a section of a document, brainstorm different phrasings, or adjust the tone of a passage (e.g. make it more formal or more friendly). Such systems rely on the generative capability of LLMs to produce text aligned with user instructions. The benefit of LLM-based writing assistants is that they can understand the context of what’s been written so far and produce contextually relevant suggestions. Over time, these assistants learn from user feedback (e.g. which suggestions were accepted) to improve their recommendations. This application augments human productivity by handling mundane or repetitive writing tasks and by providing inspiration during writer’s block.

  • Language Translation: Translating text from one language to another has been dramatically advanced by the advent of large language models and transformers. While earlier machine translation systems (like Google Translate’s older versions) relied on phrase-based statistical methods, modern systems use massive neural models that capture deeper linguistic relationships. A transformer-based translation model can be pre-trained on multiple languages and then fine-tuned, or even prompted, to perform translation. Models such as Google’s Multilingual Transformer (e.g. mT5) or Meta’s NLLB (No Language Left Behind) are trained on many languages at once and can translate between any of them. Large models like GPT-4 have shown the ability to translate text with quality close to dedicated translation systems, even without being explicitly fine-tuned for translation, simply by being prompted with examples (this is sometimes called few-shot translation). The attention mechanism in transformers is well-suited to translation because it can align words in the source sentence with words in the target sentence (this emerges as a byproduct of the training). Today’s LLM-powered translation tools can handle subtle nuances and context better than past approaches, making machine translation more accurate and fluent. For example, an LLM can maintain context across a long paragraph, ensuring that pronouns and formality are handled correctly in the translated output. Language translation using LLMs is not limited to text – paired with speech recognition and synthesis, it can enable near real-time speech-to-speech translation. Overall, LLMs have taken machine translation to new levels, enabling more natural and context-aware translations for a wide range of language pairs.

Case Studies

  • OpenAI’s GPT Series: Perhaps the most famous large language models are the GPT (Generative Pre-trained Transformer) series developed by OpenAI. The series illustrates the rapid progress in LLM capabilities. The first model, GPT-1 (2018), had on the order of 100 million parameters and demonstrated the feasibility of pre-training on a large corpus and fine-tuning for specific tasks. This was followed by GPT-2 in 2019, which contained ~1.5 billion parameters and could generate surprisingly coherent and fluid text. In fact, GPT-2’s outputs were so convincing that OpenAI initially chose not to release the full model out of concern for misuse (e.g. generating fake news). By 2020, OpenAI introduced GPT-3, a massively larger model with 175 billion parameters, trained on virtually the entire public internet. GPT-3 could perform tasks it wasn’t explicitly trained for (like answering questions or composing essays) just by being given prompts – a phenomenon known as “few-shot learning” where the model adapts to tasks based on examples in the prompt. OpenAI offered GPT-3 via an API, and it powered numerous applications in writing, coding, and conversation. The impact of GPT-3 was significant, but arguably ChatGPT (released at the end of 2022) made even bigger waves. ChatGPT is essentially a fine-tuned descendant of GPT-3 (often referred to as GPT-3.5) optimized for dialogue through human feedback. Its public demo showed that AI could engage in detailed, coherent back-and-forth conversations with users on almost any topic, which captured the public’s imagination. Finally, in 2023 OpenAI announced GPT-4, the fourth-generation model, which is even more advanced. GPT-4 brought improvements in accuracy, the ability to handle multimodal inputs (it can accept images as part of the prompt, not just text), and greater reasoning ability. Notably, OpenAI did not disclose GPT-4’s exact size or architecture details, reflecting the increasingly commercial and competitive landscape of LLM development. Nevertheless, evaluations of GPT-4 showed it could achieve high scores on professional exams and exhibit a deeper understanding of complex queries than its predecessors. The GPT series has thus set the benchmark for LLM performance and has been a driving force in popularizing generative AI.

  • Google Bard: Bard is Google’s entrant into the large language model chatbot arena, launched in 2023. It was developed in response to the emergence of ChatGPT and the generative AI race. Bard is powered by Google’s LaMDA (Language Model for Dialogue Applications), a family of conversational LLMs that Google had been developing internally. LaMDA was designed to carry on open-ended conversations and trained specifically on dialogue data to produce responses that are sensible, factual, and contextually appropriate. Google announced Bard in February 2023 as a lightweight version of LaMDA that could be opened to the public. Initially, Bard was released to a limited set of users and later integrated into Google’s search interface to augment search results with conversational answers. For example, you can ask Google’s Bard-enabled search a question and get a human-like explanatory answer with relevant info, in addition to the usual list of links. Bard (and LaMDA behind it) uses the transformer architecture as well, and Google has reported focusing on metrics like sensibleness, specificity, and safety in its training. Compared to the GPT series, Bard’s strength comes from Google’s vast search data integration – it can use up-to-date information from the web (in some implementations) and is tuned for high-quality factual responses. Google later announced upgrades to Bard by incorporating more advanced models (such as PaLM 2) and enabling multimodal features, reflecting the fast-paced evolution in this field. The Bard case study exemplifies how major tech companies are deploying LLMs to transform search and communication, while also grappling with challenges of ensuring the AI’s responses are accurate and harmless.

  • Anthropic Claude: Claude is a large language model developed by Anthropic, an AI safety-focused company founded by former OpenAI researchers. First released in March 2023, Claude is designed as an AI assistant similar to ChatGPT, capable of conversing, answering questions, and performing a variety of text-processing tasks. What sets Claude apart is Anthropic’s emphasis on AI alignment and safety. They introduced a novel training approach called “Constitutional AI” to make Claude helpful, honest, and harmless. In Constitutional AI, instead of only using human feedback to correct the model, the model is guided by a set of written principles or a “constitution.” The model engages in a self-dialogue where one part of it generates a potentially problematic output and another part critiques it based on the principles, and then it revises accordingly. This process, along with reinforcement learning from human feedback, was used to fine-tune Claude. The result is that Claude is often more resistant to giving harmful or disallowed answers, and it tries to explain its reasoning or refusal when it won’t comply with a request. Technically, Claude is also a transformer-based GPT-like model with billions of parameters (Anthropic hasn’t publicly disclosed the exact size, but it’s on the order of the largest GPT models). Anthropic has iterated on Claude with versions like Claude 2, improving its capabilities and context length (how much text it can consider at once). Claude has been made available via an API and has been adopted in enterprise settings where safety is paramount. In evaluations, Claude is competitive with other top-tier LLMs on many tasks, and sometimes produces more detailed or cautious explanations. The Claude case is illustrative of the broader trend to address the issues of LLMs (like hallucinations or toxic outputs) through better training techniques. It demonstrates an alternative path to building AI assistants that prioritize alignment with human values, which is an active area of research as AI models become more powerful.

References:

  1. I. Goodfellow et al., “Generative Adversarial Nets,” in Proc. of the 27th International Conf. on Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, Dec. 2014, pp. 2672–2680.

  2. A. Radford et al., “Language Models are Unsupervised Multitask Learners,” OpenAI, Tech. report, 2019. (Introduces GPT-2)

  3. A. Vaswani et al., “Attention Is All You Need,” in Advances in Neural Information Processing Systems, vol. 30, Curran Associates, Inc., 2017, pp. 5998–6008.

  4. Wikipedia, “Generative adversarial network,” Wikipedia, The Free Encyclopedia, 2023 (accessed Oct. 1, 2025). [Online]. Available: https://en.wikipedia.org/wiki/Generative_adversarial_network

  5. J. Vincent, “A never-ending stream of AI art goes up for auction,” The Verge, Mar. 5, 2019.

  6. A. S. Coyner et al., “Synthetic Medical Images for Robust, Privacy-Preserving Training of Artificial Intelligence: Application to Retinopathy of Prematurity Diagnosis,” Ophthalmology Science, vol. 2, no. 2, p. 100126, Jun. 2022.

  7. Wikipedia, “Large language model,” Wikipedia, The Free Encyclopedia, 2023 (accessed Oct. 1, 2025). [Online]. Available: https://en.wikipedia.org/wiki/Large_language_model:contentReference[oaicite:64]{index=64}:contentReference[oaicite:65]{index=65}

  8. Wikipedia, “LaMDA,” Wikipedia, The Free Encyclopedia, 2023 (accessed Oct. 1, 2025). [Online]. Available: https://en.wikipedia.org/wiki/LaMDA:contentReference[oaicite:66]{index=66}:contentReference[oaicite:67]{index=67}

  9. Wikipedia, “Claude (language model),” Wikipedia, The Free Encyclopedia, 2023 (accessed Oct. 1, 2025). [Online]. Available: https://en.wikipedia.org/wiki/Claude_(language_model):contentReference[oaicite:68]{index=68}:contentReference[oaicite:69]{index=69}

  10. OpenAI, “GPT-4 Technical Report,” OpenAI, Mar. 2023 (GPT-4 capabilities and release details).

  11. California State Legislature, “AB-602 Malicious Deepscope Pornography and AB-730 Manipulated Video Ban,” Oct. 2019. (Legislation addressing deepfakes).

댓글

이 블로그의 인기 게시물

Expert Systems and Knowledge-Based AI (1960s–1980s)

Core Technologies of Artificial Intelligence Services part2

3.1.4 Linear Algebra and Vectors