Back to Resources

Generative AI: Creation Across Mediums

Generative AI creates new content: From text to images to music – discover the possibilities of this creative technology.

Generative AI refers to artificial intelligence that can create. Large Language Models (LLMs) like the GPT models behind ChatGPT have gotten a lot of attention, but more generally, generative AI comprises a lot: generating human-like text, synthesizing human voice and speech, designing images, writing code, composing music, even creating videos or building data models and many other tasks. Unlike traditional AI, which focuses on analyzing existing data or making predictions, generative AI produces new, original content.

Of course, a generative model can still be used to perform analyses and make predictions. For example, simply instruct an LLM to generate text that answers these information needs. Whether this is the best or most efficient way to do those tasks is up for debate, but the paradigm of generative AI is undoubtedly extremely powerful and versatile.

Model Architecture

At the core of most generative AI models are various types of neural networks. There are many different architectures of these networks, and this is still an active area of research, whereas the most relevant are:

  • Transformers, typically used for generating text
  • Diffusion Models, the current state of the art for high-quality image and video creation
  • GANs (Generative Adversarial Networks), the original architecture that enabled image generation
  • VAEs (Variational Autoencoders), which are versatile across multiple modalities like images and music

Transformers, in particular, power today’s LLMs and have significantly improved the generation of human-like text. While all of these architectures have contributed to advancements in generative AI, transformer models have become especially dominant in recent years, driving many of the most prominent applications. Their ideas have been transferred to improve models for other modalities besides text.

All generative AI models work by converting their inputs into numerical representations. These representations are then processed using millions or even billions of parameters, which the model adjusts to weigh and combine inputs effectively. What makes transformers unique is their attention mechanism, which dynamically determines the relevance of different parts of the input, allowing the model to focus on the most important information for generating outputs.

While the intricacies of different model architectures are interesting, they aren’t crucial for grasping the fundamentals of generative AI. Instead, it’s more valuable to focus on how these models are trained and the types of data they rely on during training. This will give you a clearer understanding of how generative AI creates content, and we will dive deeper into these aspects in the following section.

Training Process

When it comes to generating text, Large Language Models (LLMs) are some of the most successful generative models. Although there are many fascinating advancements in other modalities like image, music, and video generation, much of the recent progress in these areas builds on foundational concepts from LLMs. As such, this section will focus on the training and development of LLMs, which often provide insights into how other types of generative models are created and refined.

Phase 1: Pre-training on a Large Text Corpus

Task: “Predict the next word” on lots of text, essentially “the internet”. The training process will try to use the available parameters as well as possible to internalize the training corpus. This means that the training process performs sort of a lossy compression of the entire internet. This includes learning a lot of factual knowledge. After all, it is easier to predict the next word when you know the facts. Just think of predicting the next word in a sentence like: “When heated to 100°C water starts to […]”.

This leads to a model that can craft believable text, such as completing articles or drafting documents. While this capability can be useful, it is prone to generating errors, known as hallucinations. These occur when the model produces information that sounds correct but is, in fact, made up. For instance, it might invent a news event that has yet to happen or create descriptions of fictional products. This occurs because the model focuses on predicting likely sequences of words rather than verifying facts.

Phase 2: Instruction Fine-Tuning

The next step is transforming the model into a helpful assistant. While the internet contains useful information, it isn’t always reliable or structured in a way that directly benefits users. To make the model more helpful and user-focused, it undergoes fine-tuning. This involves training it to prioritize helpfulness, relevance, and clarity, ensuring the generated responses align more closely with user needs and expectations.

This fine-tuning process relies on specially curated datasets, where interactions between a user and an assistant are modeled. These conversations are manually created, and thus, the phase is sometimes also called Supervised Fine Tuning (SFT). It is structured to guide the model in adopting helpful behaviors, such as admitting when it doesn’t know something rather than fabricating information. This stage requires vast amounts of data—often millions of user-assistant interactions—and the quality of this training data is a key differentiator among leading AI models. Consequently, these datasets can be important assets that help creators of LLMs distinguish themselves from competitors.

After fine-tuning, the model is expected to behave like a more reliable assistant, capable of responding to a wide range of queries while adhering to helpful, user-centered behavior. However, it still relies heavily on the vast knowledge it internalized during the pre-training phase, meaning it may sometimes pull from incomplete or inaccurate information. Fine-tuning enhances its performance, but the pre-training phase remains crucial to how the model generates responses.

Phase 3: Comparison Labels

Manually creating the fine-tuning data in phase two is costly and time-consuming, especially for complex queries. A more efficient approach is to have the model generate several responses to a question and ask human labelers to rank these answers by quality. Additionally, once a model is deployed, usage data from real interactions can be leveraged to gather similar feedback automatically. This reduces the need for manual labeling while still providing critical insights that help refine the model’s performance.

These rankings produce comparison labels, which serve as new training data. Using these comparisons, the model is further fine-tuned to improve the quality of its responses. This process, called Reinforcement Learning from Human Feedback (RLHF), ensues continuous refinement and adaptation.

Note that phases 2 and 3 are somewhat interchangeable. Phase 3 can be seen as an optional but more efficient way to further refine a model. Vice-versa, if you had perfect comparison labels reinforcing precisely the desired behavior, there isn’t much need for phase 2 anymore. In practice, however, it is not clear how to realistically obtain such data and it is fair to assume that today’s best models went through all three phases.

Phase 4 (Optional): Task-Specific Fine-Tuning

In an optional subsequent phase, task-specific fine-tuning can be performed using dedicated datasets that are structured similarly to those from earlier phases but focus on more specific content. For example, if you want to enable a model to create queries to and interact with an internal API, suitable examples of just that are highly valuable.

Such data enables fine-tuning of an open-source model on your hardware. Alternatively, some vendors may also allow closed-source models to be fine-tuned on demand. This process typically does not modify the model weights but trains specialized components known as adapters, which can be trained at a much lower cost and with fewer resources than the original models.

The second part of this article series about GenAI highlights some of the most impactful and innovative uses of this powerful technology. Read on.

 


Author © 2024: Dr. Björn Buchhold – www.linkedin.com/in/björn-buchhold-3a497a209/

Further Expert Articles for You

Navigating the Challenges of Data Reconciliation: An Example Use-Case 

Ensure data accuracy with CID’s tailored reconciliation solutions. Achieve automation, quality, and insights for smarter business decisions.

Reconciliation – Turning Data Chaos into Clarity 

Unlock the power of data with effective reconciliation. Learn how to break silos, harmonize datasets, and drive informed decisions across industries.

Generative AI: Real-World Applications Transforming Industries

Discover the diverse applications of generative AI – from AI assistants to specialized tools transforming industries.

Any questions?

Get in touch
cta-ready-to-start
Keep up with what we’re doing on LinkedIn.