Data Analysis with LLMs Text, tables, images and sound (Immanuel Trummer)（Z-Library）

M A N N I N G Immanuel Trummer Text, tables, images and sound

Overview of Mini-Projects Data Project Audio Transcribing speech recordings to text Answering voice queries about tabular data Translating speech to another language Graphs Translating questions about graphs to Cypher queries Images Answering arbitrary questions about images Recognizing and tagging specific persons in images Multimodal Extracting information from multimodal data Building an autonomous agent for data analysis Tables Translating natural language questions to SQL Text Classifying product reviews by the underlying sentiment Extracting key information from application materials Clustering text documents by their content Videos Generating titles for videos based on content

Data Analysis with LLMs Text, tables, images and sound Immanuel Trummer M A N N I N G SHELTER ISLAND

For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2025 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. ∞ Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Development editor: Dustin Archibald Technical editor: Timothy Andrew Roberts Review editor: Kishor Rit Production editor: Keri Hales Copy editor: Tiffany Taylor Proofreader: Melody Dolab Technical proofreader: Karsten Strøbaek Typesetter: Ammar Taha Mohamedy Cover designer: Marija Tudor ISBN 9781633437647 Printed in the United States of America

To my beloved family

brief contents Part 1 Introducing language models ................................1 1 Analyzing data with large language models 2 2 Chatting with ChatGPT 17 Part 2 Data analysis with language models ....................37 3 The OpenAI Python library 38 4 Analyzing text data 52 5 Analyzing structured data 76 6 Analyzing images and videos 101 7 Analyzing audio data 120 Part 3 Advanced topics .................................................. 141 8 GPT alternatives 142 9 Optimizing cost and quality 156 10 Software frameworks 183 iv

contents preface ix acknowledgments xi about this book xii about the author xv about the cover illustration xvi Part 1 Introducing language models ............................... 1 1 Analyzing datawith large languagemodels 2 1.1 What can language models do? 2 1.2 What you will learn 4 1.3 How to use language models 6 Prompting 6 Example prompt 7 Interfaces 8 1.4 Using languagemodels for data analysis 8 Using language models directly on data 9 Data analysis via external tools 10 1.5 Minimizing costs 12 Picking the best model 13 Optimally configuring models 14 Prompt engineering 15 1.6 Advanced software frameworks and agents 15 2 Chatting with ChatGPT 17 2.1 Accessing the web interface 18 2.2 Making introductions 19 v

vi CONTENTS 2.3 Processing text with ChatGPT 21 2.4 Processing tables with ChatGPT 27 Processing tables in the web interface 28 Processing tables on your platform 30 Part 2 Data analysis with language models ....................37 3 The OpenAI Python library 38 3.1 Prerequisites 39 3.2 Installing OpenAI’s Python library 40 3.3 Listing available models 42 3.4 Chat completion 43 3.5 Customizing model behavior 46 Configuring termination conditions 46 Configuring output generation 47 Configuring randomization 48 Customization example 49 Further parameters 51 4 Analyzing text data 52 4.1 Preliminaries 53 4.2 Classification 53 Overview 54 Creating prompts 55 Calling the model 56 End-to-end classification code 57 Classifying documents 59 Running the code 59 Trying out variants 60 4.3 Text extraction 61 Overview 62 Generating prompts 63 Postprocessing 64 End-to-end extraction code 66 Trying it out 68 4.4 Clustering 69 Overview 70 Calculating embeddings 70 Clustering vectors 72 End-to-end code for text clustering 72 Trying it out 74 Other use cases for embedding vectors 74 5 Analyzing structured data 76 5.1 Chapter outline 77 5.2 A natural language query interface for analyzing game sales 78 Setting up an SQLite database 79 SQL basics 81 Overview 83 Generating prompts for text-to-SQL translation 83 Complete code 84 Trying it out 86 5.3 A general natural language query interface 87 Executing queries 87 Extracting the database structure 88 Complete code 89 Trying it out 91

CONTENTS vii 5.4 A natural language query interface for graph data 93 What is graph data? 93 Setting up a Neo4j database 94 The Cypher query language 95 Translating questions to Cypher queries 97 Generating prompts 97 Complete code 98 Trying it out 100 6 Analyzing images and videos 101 6.1 Setup 102 6.2 Answering questions about images 102 Specifying multimodal input 103 Code discussion 104 Trying it out 105 6.3 Tagging people in images 106 Overview 107 Encoding locally stored images 107 Sending locally stored images to OpenAI 109 The end-to-end implementation 111 Trying it out 113 6.4 Generating titles for videos 114 Overview 114 Encoding video frames 115 The end-to-end implementation 116 Trying it out 118 7 Analyzing audio data 120 7.1 Preliminaries 121 7.2 Transcribing audio files 122 Transcribing speech 122 End-to-end code 123 Trying it out 124 7.3 Querying relational data via voice 124 Preliminaries 125 Overview 125 Recording audio 126 End-to-end code 127 Trying it out 131 7.4 Speech-to-speech translation 132 Overview 132 Generating speech 133 End-to-end code 134 Trying it out 137 Part 3 Advanced topics ...................................................141 8 GPT alternatives 142 8.1 Anthropic 143 Chatting with Claude 144 Python library 144 8.2 Cohere 146 Chatting with Command R+ 146 Python library 147 8.3 Google 149 Chatting with Gemini 149 The Python library 150 8.4 Hugging Face 151 Web platform 151 Python library 153

viii CONTENTS 9 Optimizing cost and quality 156 9.1 Example scenario 157 9.2 Untuned classifier 158 9.3 Model tuning 160 9.4 Model selection 164 9.5 Prompt engineering 166 9.6 Tunable classifier 169 9.7 Fine-tuning 173 9.8 Generating training data 175 9.9 Starting a fine-tuning job 177 9.10 Using the fine-tuned model 179 10 Software frameworks 183 10.1 LangChain 184 10.2 Classifying reviews with LangChain 185 Overview 185 Creating a classification chain 186 Putting it together 187 Trying it out 188 10.3 Agents: Putting the large language model into the driver’s seat 189 10.4 Building an agent for data analysis 192 Overview 192 Creating an agent with LangChain 193 Complete code for data-analysis agent 194 Trying it out 195 10.5 Adding custom tools 198 The currency converter 199 Trying it out 201 10.6 Indexingmultimodal data with LlamaIndex 203 Overview 203 Installing LlamaIndex 204 Implementing a simple question-answering system 205 Trying it out 206 10.7 Concluding remarks 206 index 209

preface Using a large language model for the first time is an almost magical experience. I still remember my first chat with GPT-3 (nowadays an outdated model). For the first time, it seemed to me that my computer actually understood me and could react appropriately to a wide range of complex inputs. What’s more, I gave it various tasks, ranging from text analysis to coding, and the model was able to solve them based on my instructions alone! I was used to a world in which neural networks had to be trained for highly specialized tasks using large amounts of task-specific training data that had to be labeled tediously by hand, so this was an absolute game-changer that opened a world of new and exciting possibilities. I was hooked, and since then I have dedicated a large portion of my professional career to exploiting the amazing capabilities of language models. Coming from a data-analysis background, it was natural for me to look at language models from a data-analysis perspective. How can we use language models to make the most of our data sets? Since I started using language models, a big change has been the types of data to which language models can be applied. Starting with text analysis, modern models have expanded their scope to multimodal inputs including images, audio, video, and text. This makes them an invaluable tool for any kind of data science, allowing users to build complex analysis pipelines with just a few lines of Python code along with instructions for the model in natural language describing the task to solve. In my work, I regularly meet data scientists and data workers who could benefit tremendously from the possibilities offered by language models. However, getting into this new area can be challenging. ix

x PREFACE I had to rely on blog posts and online tutorials to piece together the information I needed to use language models for various data-analysis tasks. This is the book I wish I’d had when I started my journey. I hope you will find the book useful and enjoyable!

acknowledgments Thanks to the editorial staff atManning, as well as to the behind-the-scenes production staff who helped shepherd this book into its final format. In addition, thanks to Timothy Andrew Roberts, the technical editor for this book. Also, thanks to all the reviewers: Al Pezewski, Amitabh Premraj Cheekoth, Anindita Nath, Anto Aravinth, Brendan O’Hara, Clemens Baader, Darrin Bishop, Dotan Cohen, Eli Mayost, George E. Carter, Giri Swaminathan, Harcharan Kabbay, Ike- chukwuOkonkwo, Jaume Valls Altadil, Jeremy Chen, JohnGuthrie, John V.McCarthy, John Williams, Krzysztof J drzejewski, Lex Drennan, Marcio Francisco Nogueira, Marjorie Roswell, Marvin Schwarze, Paul Silisteanu, Rahul Jain, Robert Rozploch, Sumit Bhattacharyya, Swapna Yeleswarapu, Thiago Britto Borges, Todd Cook, Tony Holdroyd, Vatsal Desai, Vinoth Nageshwaran, and Walter Alexander Mata López. Your suggestions helped make this a better book. xi

about this book This book was written to help developers build applications for multimodal data analysis using state-of-the-art language models. It introduces language models and the most important libraries for using them in Python. Via a series of mini projects, it showcases how to use language models to analyze text, tabular data, graph data, images, videos, and audio files. By discussing topics such as prompt engineering, fine-tuning, and advanced software frameworks, the book will enable you to quickly build complex data-analysis applications with language models that are effective and cost-efficient. Who should read this book? Whether you are a software developer, data scientist, or hobbyist interested in data analysis, this book is for you if you want to exploit the powerful abilities of large language models to perform various types of data analysis. Prior experience with languagemodels is unnecessary, as the book covers all the basics. However, experience with Python is helpful, at least at a beginner’s level, as this book uses Python to interact with language models. How this book is organized: A road map This book has 10 chapters in three parts. Part 1 introduces language models and gives a first impression of their benefits for data analysis: Chapter 1 introduces language models and explains how they can be used for data analysis. Chapter 2 guides you through a chat with ChatGPT, illustrating the analysis of text and tabular data in the ChatGPT web interface. xii

ABOUT THIS BOOK xiii Part 2 introduces OpenAI’s Python library and shows how to analyze various types of data using language models directly from Python: Chapter 3 introduces OpenAI’s Python library, enabling users to send requests to language models and configure their behavior in various ways. Chapter 4 shows how to use language models to process text data: for example, to classify text documents or extract specific information. Chapter 5 demonstrates how to build natural language query interfaces using language models, translating questions in natural language to formal queries referring to data tables or graphs. Chapter 6 describes how to use multimodal language models to process ima- ges or video data for tasks such as object detection, question-answering, and captioning. Chapter 7 illustrates multiple use cases for language models in analyzing audio data: for instance, transcribing audio recordings, realizing voice query interfaces, or translating spoken input to other languages. Part 3 covers advanced topics, enabling you to optimize your choice of models, configurations, and frameworks: Chapter 8 discusses different providers of large language models and gives a short overview of the models they offer and the corresponding Python libraries. Chapter 9 demonstrates methods that can be used to minimize processing fees and maximize output quality when working with language models, including optimizing model choices and parameter settings and fine-tuning. Chapter 10 discusses several software frameworks, particularly LangChain and LlamaIndex, that can be used to build complex applications on top of large language models with lower implementation overheads. It is recommended that you start by reading chapter 1, which introduces important terms and concepts. You can skip chapter 2 if you have already used language models via web interfaces. Most of the remaining chapters are based on OpenAI’s Python library. It is therefore a good idea to read chapter 3 before diving into any later chapters. Chapters 4 to 7 focus on different data types and can be read in any order. Similarly, chapters 8 to 10 are independent, and you can study them in any order. About the code This book contains various code samples in numbered and unnumbered listings. All code in numbered listings is available for download from the book’s companion website at www.dataanalysiswithllms.com. Code, as well as suitable test data, is catego- rized by book chapter. Code files are named using the number of the corresponding listing in the book. The entire code and data repository can also be downloaded from the publisher’s website at www.manning.com/books/data-analysis-with-llms. The source code is formatted in a fixed-width font like this to separate it from ordinary text. In many cases, the original source code has been reformatted;

xiv ABOUT THIS BOOK we’ve added line breaks and reworked indentation to accommodate the available page space in the book. Additionally, comments in the source code have often been removed from the listings when the code is described in the text. Code annotations accompany many of the listings, highlighting important concepts. liveBook discussion forum Purchase ofData Analysis with LLMs includes free access to liveBook,Manning’s online reading platform. Using liveBook’s exclusive discussion features, you can attach com- ments to the book globally or to specific sections or paragraphs. It’s a snap to make notes for yourself, ask and answer technical questions, and receive help from the au- thor and other users. To access the forum, go to https://livebook.manning.com/book/ data-analysis-with-llms/discussion. You can also learn more about Manning’s forums and the rules of conduct at https://livebook.manning.com/discussion. Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the author some challenging questions lest his interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.

about the author Immanuel Trummer is an associate professor of computer science at Cornell Uni- versity. His research focuses on topics at the intersection of data analysis and ma- chine learning. In particular, he studies applications of large language models to data-analysis problems, resulting in various award-winning publications and industry collaborations. His video tutorials have obtained over a million views on YouTube. Besides working with language models, Immanuel enjoys playing the violin, exploring the beautiful outdoors in upstate New York, and spending as much time as possible with his family. xv

about the cover illustration The figure on the cover of Data Analysis with LLMs, titled “Le Spéculateur,” or “The Speculator,”is taken from a book by Louis Curmer published in 1841. Each illustration is finely drawn and colored by hand. In those days, it was easy to identify where people lived and what their trade or station in life was just by their dress. Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional culture centuries ago, brought back to life by pictures from collections such as this one. xvi

Part 1 Introducing language models Sowhat are language models, exactly? And how can we use them for data analysis? This part of the book answers both those questions. In chapter 1, we discuss the principles underlying language models and what makes them special. We also discuss all the different ways in which language models can be used for data analysis, covering options to use them directly on data as well as the possibility of using them as interfaces to more specialized data-analysis tools. In chapter 2, we have a “chat” with ChatGPT: that is, we interact with a popular language model by OpenAI. We witness the flexibility of ChatGPT when performing a variety of tasks on text, ranging from text classification to extracting specific pieces of information from text based on a concise task description. We also see that ChatGPT does well when translating questions about data, formulated in natural language, to formal query languages such as SQL. After reading this part, you should have a good understanding of what language models are and how you can use them for data analysis.

1Analyzing data with large language models This chapter covers An introduction to language models Data analysis with language models Using language models efficiently Language models are powerful neural networks that can be used for various data- processing tasks. This chapter introduces language models and shows how and why to use them for data analysis. 1.1 What can language models do? We will start this section with a little poem and an associated picture (figure 1.1) connecting the two main topics of this book, data analysis and large language models: In the silent hum of the server’s light, Data flows through the veins of night. Rows and columns, a structured sea, With stories hidden, waiting to be free. 2

Statistics

Uploader

Data Analysis with LLMs Text, tables, images and sound (Immanuel Trummer)（Z-Library）

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Recommended for You

Statistics

Uploader

Data Analysis with LLMs Text, tables, images and sound (Immanuel Trummer)（Z-Library）

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Reply to Comment

Edit Comment

Recommended for You