(This page has no text content)
Advanced SQL Implementing Modern Data Solutions and ML Applications Rui Machado, Hélder Russa, and Pedro Esmeriz
Advanced SQL by Rui Machado, Hélder Russa, and Pedro Esmeriz Copyright © 2026 Rui Pedro Machado, Hélder Russa, and Pedro Esmeriz. All rights reserved. Published by O’Reilly Media, Inc., 141 Stony Circle, Suite 195, Santa Rosa, CA 95401. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Andy Kwan Development Editor: Corbin Collins Production Editor: Beth Kelly Copyeditor: Piper Content Partners Proofreader: Tim Stewart Indexer: Sue Klefstad Cover Designer: Susan Brown Cover Illustrator: Monica Kamsvaag Interior Designer: David Futato Interior Illustrator: Kate Dullea July 2026: First Edition
Revision History for the First Edition 2026-07-01: First Release See http://oreilly.com/catalog/errata.csp?isbn=9798341627482 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Advanced SQL, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 979-8-341-62748-2 [LSI]
Preface There is a moment, and perhaps you have already lived it, when you ask a large language model to write a SQL query, and it does, without hesitation and with better quality than you were expecting. A window function, a recursive common table expression (CTE), a lateral join that would have taken you twenty minutes of careful thinking: done in seconds. For many practitioners, this moment produces a peculiar mix of relief and unease. If the machine can do this, you may ask, what exactly am I here for? We wrote much of this book before that question became urgent. But the world shifted under our feet as we wrote, and we owe you an honest account of what that shift means, explaining why it matters more than ever. Tools like Claude and ChatGPT have done something remarkable: they have made advanced SQL accessible to anyone willing to describe what they need in plain language. The syntax barrier, for so long the wall separating those who could build data systems from those who merely used them, is coming down. Window functions no longer require a ritual of Stack Overflow searches. Complex aggregations no longer demand a specialist. A domain expert who understands the business can now, more than ever before, reach directly into the data. This is genuinely good and is worth celebrating. But capability without understanding is dangerous. A large language model (LLM) can generate a beautifully structured query that computes the wrong thing, silently, plausibly, and at scale. It can produce a partition strategy that works on your development dataset and destroys performance in production. It can write ML feature pipelines in SQL that look correct and encode subtle data leakage that poisons every model trained on them. The query runs, the numbers appear, and decisions get made. When we set out to write this book, we kept returning to a single word: trust. Not trust in a narrow technical sense, though constraints, tests, and data contracts matter
enormously, but trust in a human sense. The confidence a chief revenue officer feels when quoting a number in a board meeting. The certainty a data scientist has that the features fed into her model reflect reality rather than an artifact of how the extract, transform, and load (ETL) was written. The moment when data stops being a source of anxiety and becomes a genuine foundation for action. That kind of trust is not a property of the code; it is a property of the practitioner who wrote it. Every decision in advanced SQL work carries judgment: how you handle NULLs at the boundary of a join, how you define the grain before aggregating, how you account for late-arriving events in a time-series computation, and whether a machine learning feature should be calculated before or after a specific business event. These decisions require context that lives outside any prompt, the organizational history, the known quirks of a source system, the argument last quarter about how refunds should be treated, or the implicit definition of “conversion” that differs between teams. LLMs are extraordinary at execution. But they are, by their nature, unable to handle context efficiently. You are the one who holds it. That makes you, in the age of generative AI, more important, not less. Paradoxically, the easier it becomes to generate code, the more critical it becomes to understand what that code does. When writing SQL was slow and deliberate, errors were caught in the effort. Today, when fifty lines can be generated in seconds, the bottleneck has shifted from production to verification. And you cannot verify what you do not understand. This book covers advanced SQL not as a historical artifact but as the living language of modern data systems, the substrate on which analytical platforms, real-time pipelines, and machine learning applications are built. When you understand how window functions track state across ordered partitions, you will know when the generated code handles ties incorrectly. When you understand the mechanics of incremental processing, you will catch the moment an AI-suggested approach duplicates records under late data. When you understand how SQL integrates with ML workflows,
feature stores, training pipelines, and model scoring, you will identify when a well-intentioned query has introduced a leak between training and serving that will surface only when the model is in production. Understanding the system is not an alternative to using AI assistance. It is what makes AI assistance safe to use. The chapters ahead move deliberately from the foundations of advanced query mechanics through the architecture of modern data solutions and into the territory where SQL meets machine learning. Each section builds on the previous one, and we have been intentional about showing you not just how but why: the reasoning behind every pattern and the trade-offs it makes. We encourage you to read with an LLM by your side. Generate examples, ask for variations, push the model to refactor and then interrogate its choices. Use it as a tireless collaborator, one that is fast, knowledgeable, and completely dependent on you to set direction, define standards, and catch mistakes. That is the appropriate relationship between a skilled practitioner and an AI tool. Not replacement. Not fear. Partnership, with your judgment at the center. This book does not aim to make you faster at writing SQL. Rather, its goal is to make you the kind of practitioner who can be trusted: with complex systems, with data that drives consequential decisions, with the ML applications that increasingly shape what organizations know about the world. In a moment when code can be written almost on demand, that trust is the rarest and the most valuable thing you can offer. Build systems worth trusting, guide the machines well. That has always been the work, and it still is. Conventions Used in This Book The following typographical conventions are used in this book: Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. TIP This element signifies a tip or suggestion. NOTE This element signifies a general note. WARNING This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/helder-russa/advanced-sql.git. If you have a technical question or a problem using the code examples, please send email to support@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a
program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Advanced SQL by Rui Machado, Hélder Russa, and Pedro Esmeriz (O’Reilly). Copyright 2026 Rui Machado, Hélder Russa, and Pedro Esmeriz, 979-8-341-62748-2.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning NOTE For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us
Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 141 Stony Circle, Suite 195 Santa Rosa, CA 95401 800-889-8969 (in the United States or Canada) 707-827-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://oreilly.com/about/contact.html We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/advanced-sql. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly. Watch us on YouTube: https://youtube.com/oreillymedia. Acknowledgments All of the authors would like to thank our technical reviewers Sami Ahlroos, Robert de Graaf, and Sudhanshu Badoni. We are grateful for their time and assistance in ensuring technical accuracy.
Rui Machado Books are never written alone. Behind every page and every example are people who make the work possible, not by writing a single word but by giving something far more precious: time, patience, and unconditional support. To my wife, Ana Ferreira, you are my constant. Through late nights and distracted weekends, you never gave up on this. This book exists because you believed it should. To my daughters, Maria Inês and Margarida, you are the reason any of this matters. You may not yet understand what databases are, or why someone would spend hours thinking about how data moves through a system, but I hope that one day you look at this book and see proof that curiosity and a love for building things are always worth it. I wrote this for you, even when it didn’t look that way. And then there are the people who give you the most valuable thing of all, time. A special thanks to my mother and father for the countless nursery trips. You may not have written a single line, but this book would not exist without you. Writing a technical book is an act of optimism, a belief that knowledge shared is knowledge multiplied. Hélder Russa Writing a book is a journey of self-knowledge. There are moments of pure enthusiasm, and others of real frustration. Writing a book demands consistency, patience, and resilience. I would not have gone through it the same way without you, my wife, Patrícia. Your love, your support, your understanding, and ultimately your ability to lift me up in the toughest moments were my ground when I needed it the most. For that, I am deeply grateful. To my son, Tomás. You are the purest form of love and joy I know. As I write this, you are still a baby, with those toothless smiles that can light up
an entire room. You may not realize it yet, but you are already one of the strongest forces that keep me going. I love you with all my heart. To my parents, thank you for every sacrifice you made so I could keep learning, growing, and building my own path. The foundations you gave me made all of this possible. This book is also yours. Pedro Esmeriz If someone had told me a few years ago that I would coauthor a book, I would have laughed. And yet, here we are. Writing a book for the first time is equal parts thrilling and terrifying, and I could not have done it without the people who kept me grounded along the way. To my partner in crime, Raquel—thank you for never letting this be just my project. From the very beginning, you made it ours. Every late night, every “just five more minutes” that turned into two hours, you were there, not waiting on the sidelines but cheering right beside me. This book has my name on it, but it carries your fingerprints all over it. To my daughter, Júlia—or Ju, as we already call you. You have no idea what a Transformer is—and no, I don’t mean the robots—but honestly, I hope you take your time getting there. What I must confess, though, is that you were named after a programming language. One day I stumbled upon Julia Lang and thought, “That’s actually a beautiful name.” So yes, before you even existed, tech was already shaping your life. I’m sorry. I love you endlessly. To Rui and Hélder—you are far more than coauthors. You are true friends. Working with people who are positive, inspiring, supportive, and genuinely caring made this process not just bearable but joyful. Thank you for bringing out the best in this book and in me. One book down, on to the next quest. Onwards!
Chapter 1. Evolution of SQL SQL has had a remarkable impact in the world of technology as it has continuously adapted and evolved for decades. It has shaped the way we model and query data by introducing a declarative approach, in which we specify what data we want, not how we retrieve it. From the first mainframe databases to modern cloud data warehouses, SQL has remained a constant. It combines solid mathematical foundations with practical developments and has evolved from a simple query language to the core of today’s data-driven industry. Its past, present, and future are closely connected with the history of data itself, and mastering this language remains essential for any data professional. The evolution of SQL from its beginnings in the 1970s to the SQL:2023 standard shows how a technology can evolve for a long period without losing its core philosophy. Over the years, it has standardized access to data, enabled countless applications and analytics, and spawned an ecosystem of databases that underpins virtually every industry. With consistency came innovation as SQL always embraced new ideas such as object-oriented data, online analytical processing, XML/JSON data formats, and geographical data. Few languages remain as relevant after half a century, and the combination of theoretical rigor and practical utility has secured SQL’s place at the center of data processing. Looking to the future, SQL shows no signs of stagnating. Database communities and standards bodies are actively exploring new frontiers such as integrating machine learning and vector search capabilities directly into SQL, expanding analytics capabilities for time series and pattern matching, and making SQL more accessible. For example, the latest SQL:2023 standard introduces formal support for graph queries, and better support for polymorphic storage engines that include row-based, column-based, or document stores is being discussed. Given this track record, it is likely that
these innovations will be adopted in due course to ensure that SQL remains the ideal language for an ever-growing number of data tasks. Historical Overview: From Relational to Modern SQL Understanding the future of something often involves learning its past, to envision an informed set of next steps and evolutions. With this in mind, we recall that the SQL language has its roots in the early 1970s with Dr. Edgar F. Codd’s relational model. Inspired by Dr. Codd’s relational model, IBM researchers Donald Chamberlin and Raymond Boyce developed Structured English Query Language (SEQUEL), later renamed SQL. Commercial products like SQL/DS (1981) and Oracle V2 (1979) helped establish it as a standard data access language. As SQL gained popularity, vendors implemented their own dialects with slightly different syntax and capabilities. Standard SQL-86: Laying the Foundations In 1986, the American National Standards Institute (ANSI) and International Organization for Standardization (ISO) published the first SQL standard, SQL-86, which defined the core of the language. It was relatively simple and covered basic SQL operations and data types. With it came core capabilities such as table creation and modification, basic filtering and aggregation, and support for primitive types such as INTEGER, SMALLINT, and VARCHAR. SQL-86 created a common baseline for SQL but was limited by modern standards. Notably, it lacked many features we take for granted today. For example, no JOIN operations, combining tables had to be done via Cartesian products and filters, no support for foreign keys or explicit
constraints, no views, and only minimal transaction control. SQL-86 was a basic toolkit reflecting the state of relational databases at that time. Standard SQL-89: Adding Integrity and Security Building on the initial standard, SQL-89 introduced essential features for data integrity and access control, including the definition of PRIMARY KEY constraints and commands like GRANT and REVOKE to manage user permissions. These additions strengthened SQL as a consistent and portable query language across systems. However these features were not enough for the demanding industry that was asking for ways to manipulate data efficiently and at scale, thus allowing for more use cases to be developed in SQL. That’s when the real innovation started. Standard SQL-92: Adding Modeling Capabilities Officially released in 1992, SQL-92 significantly expanded SQL capabilities and is often referred to as the first modern SQL. SQL-92, sometimes called SQL2, introduced a variety of features that greatly improved the language’s expressive power for data modeling and querying. This standard marked a major step forward by standardizing JOIN syntax, which made expressing multi-table relationships more intuitive and portable, and by formalizing subqueries and set operations like UNION and INTERSECT, which dramatically expanded SQL’s expressiveness. It also introduced new data types such as DATE, TIME, and TIMESTAMP. It also reinforced referential integrity with FOREIGN KEY and CHECK constraints and defined transaction control with COMMIT, ROLLBACK, and isolation levels such as READ COMMITTED and SERIALIZABLE. With SQL-92, SQL became much more powerful, allowing developers and analysts to perform complex joins and subqueries in one go instead of multiple passes. Databases could enforce more data quality rules internally with foreign keys and checks, and they could handle dates and times
properly. SQL could now express more sophisticated queries and maintain consistency, paving the way for it to dominate data processing in the enterprise. SQL-92 also laid the foundation for robust, enterprise-wide database applications. This standard has been widely adopted, and many core features of SQL-92 remain fundamental to all SQL dialects today. Standard SQL:1999—Handling Industry Needs In the late 1990s, databases became increasingly complex and new use cases were added, to which the SQL standard responded accordingly. SQL:1999, often referred to as SQL3, was an important further evolution that extended SQL beyond purely relational operations and incorporated concepts from the object-oriented and analytical areas. This was a turning point that introduced object-relational modeling, TRIGGERS, and STORED PROCEDURES, allowing databases to encapsulate business rules and reactive logic. Most notably, it added WITH and WITH RECURSIVE and common table expressions (CTEs), which are a first-class SQL construct that transformed how complex transformation and preparation logic is written; it also directly influenced modern open source tools like the data build tool (dbt), which rely on CTEs for modular, maintainable data pipelines. With SQL:1999, the language became far more powerful and expressive. It effectively turned SQL into a richer query and programming language within the database, reducing the need for external logic for many use cases. From the SQL:2003 Standard to SQL:2008 In the 2000s, the computing landscape saw the rise of XML for data exchange and a push for better database programmability. The SQL standards of the time focused on integrating these trends and adding some long-awaited conveniences.
SQL:2003 introduced major additions including SQL/XML, enabling native storage and querying of XML data, while standardizing SEQUENCE generators and IDENTITY columns for portable key generation. It also significantly extended window functions—adding RANK(), DENSE_RANK(), and custom window frames—which transformed SQL into a robust engine for in-database analytics. Minor revisions followed: SQL:2006 improved XML integration, and SQL:2008 added practical enhancements like INSTEAD OF triggers for view updates and TRUNCATE TABLE for efficient batch deletes in data warehouses. By 2008, SQL had transformed from a simple query language for basic operations into a robust platform for complex data processing. With XML and later JSON integration, window functions for analytics, and other improvements like sequences and improved triggers, SQL could handle a much wider array of tasks within the database. From the SQL:2011 Standard to SQL:2016 With the further development of data systems, new requirements for handling time-based data and semi-structured formats such as JSON have emerged. The SQL:2011 and SQL:2016 standards address these areas and ensure that SQL remains equipped for modern use cases: SQL:2011 introduced temporal tables via FOR SYSTEM_TIME, allowing databases to track and query historical versions of data— crucial for auditing, financial records, and time-based analysis. This unified what had previously been vendor-specific features into a standard approach for handling time-dependent data. SQL:2016 brought native support for JSON, including a JSON data type and functions like JSON_VALUE, JSON_QUERY, and JSON_OBJECT, enabling powerful document-style queries directly in SQL. This bridged the gap between relational and semi- structured data, making SQL a first-class option for modern APIs and analytics.
This standardization meant that SQL could natively process JSON documents. A user could store JSON in a column and query fields in that JSON using SQL syntax. This was an important step because it closed the gap between relational storage and NoSQL-style document storage: relational databases could now conveniently process hybrid data. By 2016, SQL had proven remarkably adaptable. It could manage relational data, time-dimensional historical data, and schema-free JSON data all within one framework. Many databases had already added JSON features on their own, and the standard unified these trends. The combination of temporal and JSON features showed that SQL was responding to modern needs like advanced auditing or semi-structured data without abandoning its relational base foundation. The Graph Revolution with SQL:2023 The latest standard, SQL:2023, demonstrates the continuous evolution of SQL to meet today’s requirements. Published in 2023, the standard introduces features that push SQL into new areas such as graph processing and further improve its usability and compatibility with modern data formats. SQL:2023 expands into new territory with native property graph queries, allowing users to express patterns like paths and neighborhoods directly on relational data, supporting use cases such as fraud detection and social network analysis. It also deepens JSON integration with a true JSON data type, improved functions like JSON_SERIALIZE and JSON_SCALAR, and better indexing potential. Alongside these, syntax simplifications and refinements, such as clearer handling of NULLs in constraints, enhance developer productivity and eliminate long-standing inconsistencies. This 2023 standard shows that even after decades, SQL continues to adapt. By incorporating graph query support and deepening JSON integration, the standard recognizes that modern data platforms often blend paradigms, namely relational, document (NoSQL), or graph. SQL is positioning itself as a universal query interface that can handle diverse data models.
These innovations ensure SQL remains relevant as data evolves— developers can use familiar SQL syntax for new kinds of queries. Importantly, learning SQL’s history is also a lesson in how computer science evolves, namely, the language has continually adapted to new technology trends while maintaining backward compatibility and the core relational principles. Transition to Multipurpose SQL Over the past decade, SQL has evolved into an indispensable multipurpose language for modern data management. Today’s SQL engines seamlessly process structured, semi-structured, and even unstructured data, integrate with machine learning (ML) and large language models (LLMs), and serve as a unified interface for various computing paradigms. In the following sections, we explore how SQL expanded into these new areas. Semi-Structured Data Handling in SQL Modern SQL databases can natively handle semi-structured data formats like JSON and XML, which are ubiquitous in web services and APIs. They offer sophisticated functions for parsing, querying, and generating JSON directly within SQL. By supporting JSON, SQL databases allow the combination of relational and document-style data in one system. For example, SQL implementations now include powerful JSON query functions such as JSON_VALUE, JSON_QUERY, and even JSON table constructs to project JSON fields as rows. Example 1-1 shows the advanced use of JSON. Example 1-1. Parsing JSON in SQL SELECT account_id, JSON_VALUE(profile_data, '$.name') AS name, JSON_QUERY(profile_data, '$.preferences') AS prefs FROM users WHERE JSON_EXISTS(profile_data, '$.preferences[? (@.category=="music")]');
Loading comments...
Reply to Comment
Edit Comment