└── Deep Learning └── Generative AI (GenAI) ├── Natural Language Generation │ ├── LLMs: Text Generation (ChatGPT, Gemini) │ └── Speech Generation (Speechify) ├── Image Generation │ ├── Text-to-Image (DALL-E, Midjourney) └── Audio and video Generation ├── Music Generation (AIVA, SUNO, MusicLM) └── Video Generation (SORA, VEO) Firstly, the term ‘generative’ implies the ability to generate or create, which is made possible by an invention called the Attention Transformer (Vaswani et al., 2017). This was utterly ground-breaking and deep learning was completely revolutionised.

For many years since then, Large Language Models (LLMs) were consuming, “a total data size exceeding 774.5 TB for pre-training corpora”, (Liu et al., 2024), and were handling the information in a completely new way to classical computers. OpenAI, ironically not, introduced ChatGPT (Generative Pre-Trained Transformer) in November 2022, and the consensus was amazement at its ability. Their LLM generated streams of text within seconds that resembled professional essays of expertise (See Fig 2.1).

The image depicts a tree-like structure with different branches representing various capabilities of a large language model. The capabilities include question answering, semantic parsing, arithmetic, code completion, general knowledge, reading comprehension, summarization, logical inference chains, common-sense reasoning, pattern recognition, translation, dialogue, joke explanations, physics question answering, and language understanding. The tree trunk is labeled with "471 billion parameters," indicating the massive scale of the language model. The image conveys the diverse range of tasks that a large language model can potentially handle.
Figure 2.1 (The Royal Institution, 2023). A diverse range of tasks.

Other modalities were emerging to make bespoke images from words, such as the diffusion models: MidJourney or DALL-E; also GANs (Generative Adversarial Networks) work best at duplicating images for higher quality. VAEs (Variational Autoencoders) can work with text, image audio and video using an encoder-decoder infrastructure. Google’s BERT (Bidirectional Encoder Representations from Transformers) excels at finding context in text (Abdullahi, 2024).

Screenshot of AI prompt to make a song: An upbeat catchy song about cyber security and the satisfying and soul enriching inner world of being a geek. Key words to use: python, sql injections, linux, AI, transformers, encryption if possible.
Fig 2.2 AI prompt in Suno.

Currently, affordable software subscriptions are making this technology accessible to everyone. For example, Suno can take a prompt and auto-generate a published song within one minute, see Fig 2.2 (Suno, 2024).

Moving forward, GenAI software like SORA and VEO will confidently stride into the Box-office with hyper-realistic movies and soundtracks. Disney has been working at ‘replacing humans’ for a decade (Dams, 2022).

[<< Back | Next >>]

 


[250 words]

References

Abdullahi, A. (2024) Generative AI Models: A Complete Guide, eWEEK. Available at: https://www.eweek.com/artificial-intelligence/generative-ai-model/ (Accessed: 2 June 2024).

Dams, T. (2022) AI transforming movie production at Disney – with more to come, IBC. Available at: https://www.ibc.org/news/ai-transforming-movie-production-at-disney-with-more-to-come/9075.article (Accessed: 1 June 2024).

Liu, Y. et al. (2024) Datasets for Large Language Models: A Comprehensive Survey, arXiv.org. Available at: https://arxiv.org/abs/2402.18041 (Accessed: 1 June 2024).

SUNO (2024) Code to my Heart by @angelabevan, Suno.com. Available at: https://suno.com/song/7331f662-d7eb-4d8d-9019-b893626de621 (Accessed: 1 June 2024).

The Royal Institution (2023) ‘What is generative AI and how does it work? – The Turing Lectures with Mirella Lapata’, YouTube. Available at: https://www.youtube.com/watch?v=_6R7Ym6Vy_I (Accessed: 30 May 2024).

Vaswani, A. et al. (2017) Attention Is All You Need. Google Research. Available at: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (Accessed: 22 May 2024).

Further Reading

Anthropic (2024) Claude, Anthropic.com. Available at: https://www.anthropic.com/claude (Accessed: 29 May 2024).

Crouse, M. (2024) ChatGPT Gets an Upgrade With ‘Natively Multimodal’ GPT-4o, TechRepublic. TechRepublic. Available at: https://www.techrepublic.com/article/openai-next-flagship-model-gpt-4o/ (Accessed: 31 May 2024).

Meta AI (2024) Introducing Meta Llama 3: The most capable openly available LLM to date, Meta.com. Available at: https://ai.meta.com/blog/meta-llama-3/ (Accessed: 12 May 2024).

MusicLM (2024) Github.io. Available at: https://google-research.github.io/seanet/musiclm/examples/ (Accessed: 29 May 2024).

OpenAI (2024) Sora, https://openai.com. Available at: https://openai.com/index/video-generation-models-as-world-simulators/ (Accessed: 29 May 2024).

Team, G. and Google (2024) Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context . Available at: https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf (Accessed: 19 February 2024).