(This page has no text content)
(This page has no text content)
The Art of SQL While heeding the profit of my counsel, avail yourself also of any helpful circumstances over and beyond the ordinary rules. —Sun Tzu, The Art of War
Other resources from O’Reilly Related titles SQL in a Nutshell SQL Tuning SQL Pocket Guide SQL Cookbook™ oreilly.com oreilly.com is more than a complete catalog of O’Reilly books. You’ll also find links to news, events, articles, weblogs, sample chapters, and code examples. oreillynet.com is the essential portal for developers interested in open and emerging technologies, including new plat- forms, programming languages, and operating systems. Conferences O’Reilly brings diverse innovators together to nurture the ideas that spark revolutionary industries. We specialize in documenting the latest tools and systems, translating the innovator’s knowledge into useful skills for those in the trenches. Visit conferences.oreilly.com for our upcoming events. Safari Bookshelf (safari.oreilly.com) is the premier online reference library for programmers and IT professionals. Conduct searches across more than 1,000 books. Sub- scribers can zero in on answers to time-critical questions in a matter of seconds. Read the books on your Bookshelf from cover to cover or simply flip to the page you need. Try it today for free.
Beijing • Cambridge • Farnham • Köln • Paris • Sebastopol • Taipei • Tokyo The Art of SQL Stéphane Faroult with Peter Robson
The Art of SQL by Stéphane Faroult with Peter Robson Copyright © 2006 O’Reilly Media, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc. 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (safari.oreilly.com). For more information, contact our corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com. Editor: Jonathan Gennick Production Editors: Jamie Peppard and Marlowe Shaeffer Copyeditor: Nancy Reinhardt Indexer: Ellen Troutman Zaig Cover Designer: Mike Kohnke Interior Designer: Marcia Friedman Illustrators: Robert Romano, Jessamyn Read, and Lesley Borash Printing History: March 2006: First Edition. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. The Art of SQL and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. This book uses RepKover™, a durable and flexible lay-flat binding. ISBN-10: 0-596-00894-5 ISBN-13: 978-0-596-00894-9 [M] [10/06]
The French humorist Alphonse Allais (1854–1905), once dedicated one of his short stories as follows: To the only woman I love and who knows it well. . . . with the following footnote: This is a very convenient dedication that I cannot recommend too warmly to my fellow writers. It costs nothing, and can, all at once, please five or six persons. I can take a piece of wise advice when I meet one. STÉPHANE FAROULT
(This page has no text content)
C O N T E N T S Preface ix 1 Laying Plans 1 Designing Databases for Performance 2 Waging War 27 Accessing Databases Efficiently 3 Tactical Dispositions 55 Indexing 4 Maneuvering 75 Thinking SQL Statements 5 Terrain 105 Understanding Physical Implementation 6 The Nine Situations 127 Recognizing Classic SQL Patterns 7 Variations in Tactics 167 Dealing with Hierarchical Data 8 Weaknesses and Strengths 199 Recognizing and Handling Difficult Cases 9 Multiple Fronts 225 Tackling Concurrency 10 Assembly of Forces 247 Coping with Large Volumes of Data 11 Stratagems 279 Trying to Salvage Response Times 12 Employment of Spies 307 Monitoring Performance Photo Credits 333 Index 335
(This page has no text content)
ix C H A P T E R P R E F A C E There used to be a time when what is known today as “Information Technology” or IT was less glamorously known as “Electronic Data Processing.” And the truth is that for all the buzz about trendy techniques, the processing of data is still at the core of our sys- tems—and all the more as the volume of data under management seems to be increasing even faster than the speed of processors. The most vital corporate data is today stored in databases and accessed through the imperfect, but widely known, SQL language—a com- bination that had begun to gain acceptance in the pinstriped circles at the beginning of the 1980s and has since wiped out the competition. You can hardly interview a young developer today who doesn’t claim a good working knowledge of SQL, the lingua franca of database access, a standard part of any basic IT course. This claim is usually reasonably true, if you define knowledge as the ability to obtain, after some effort, functionally correct results. However, enterprises all over the world are today confronted with exploding volumes of data. As a result, “functionally correct” results are no longer enough: they also have to be fast. Database performance has become a major headache in many companies. Interestingly, although everyone agrees that the source of performance issues lies in the code, it seems accepted everywhere that the first concern of developers should be to provide code that works— which seems to be a reasonable expectation. The thought seems to be that the database
x P R E F A C E access part of their code should be as simple as possible, for maintenance reasons, and that “bad SQL” should be given to senior database administrators (DBAs) to tweak and make run faster, with the help of a few magic database parameters. And if such tweaking isn’t enough, then it seems that upgrading the hardware is the proper course to take. It is quite often that what appears to be the common-sense and safe approach ends up being extremely harmful. Writing inefficient code and relying on experts for tuning the “bad SQL” is actually sweeping the dirt under the carpet. In my view, the first ones to be concerned with performance should be developers, and I see SQL issues as something encompassing much more than the proper writing of a few queries. Performance seen from a developer’s perspective is something profoundly different from “tuning,” as practiced by DBAs. A database administrator tries to get the most out of a system—a given hardware, processors and storage subsystem, or a given version of the database. A database administrator may have some SQL skills and be able to tune an especially poorly performing statement. But developers are writing code that may well run for 5 to 10 years, surviving several major releases (Internet-enabled, ready-for-the-grid, you name it) of the Database Management System (DBMS) it was written for—and on several generations of hardware. Your code must be fast and sound from the start. It is a sorry assessment to make but if many developers “know” SQL, very few have a sound understanding of this language and of the relational theory. Why Another SQL Book? There are three main types of SQL books: books that teach the logic and the syntax of a particular SQL dialect, books that teach advanced techniques and take a problem-solving approach, and performance and tuning books that target experts and senior DBAs. On one hand, books show how to write SQL code. On the other hand, they show how to diagnose and fix SQL code that has been badly written. I have tried, in this book, to teach people who are no longer novices how to write good SQL code from the start and, most importantly, to have a view of SQL code that goes beyond individual SQL statements. Teaching how to use a language is difficult enough; but how can one teach how to efficiently use a language? SQL is a language that can look deceivingly simple once you have been initiated. And yet it allows for an almost infinite number of cases and combinations. The first comparison that occurred to me was the game of chess, but it suddenly dawned on me that chess was invented to teach war. I have a natural tendency to consider every new performance challenge as a battle to be fought against an army of rows, and I realized that the problem of teaching developers how to use databases efficiently was similar to the problem of teaching officers how to conduct a war. You need knowledge, you need skills, and you need talent. Talent cannot be taught, but it can be nurtured. This is what most strategists, from Sun Tzu, who wrote his Art of War 25
P R E F A C E xi centuries ago, to modern-day generals, have believed—so they tried to pass on the experience acquired on the field through simple maxims and rules that they hoped would serve as guiding stars among the sound and fury of battles. I have tried to apply this method to more peaceful aims, and I have mostly followed the same plan as Sun Tzu—and I’ve borrowed his title. Many respected IT specialists claim the status of scientists; “Art” seems to me more appropriate than “Science” when it comes to defining an activity that requires flair, experience, and creativity, as much as rigor and understanding.* It is quite likely that my fondness for Art will be frowned upon by some partisans of Science, who claim that for each SQL problem, there is one optimal solution, which can be attained by rigorous analysis and a good knowledge of data. However, I don’t see the two positions at odds. Rigor and a scientific approach will help you out of one problem at one given moment. In SQL development, if you don’t have the uncertainties linked to the next move of the adversary, the big uncertainties lie in future evolutions. What if, rather unexpectedly, the volume of this or that table increases? What if, following a merger, the number of users doubles? What if we want to keep several years of data online? How will a program behave on hardware totally different from what we have now? Some architectural choices are gambles on the future. You will certainly need rigor and a very sound theoretical knowledge—but those qualities are prerequisites of any art. Ferdinand Foch, the future Supreme Commander of the Allied armies of WWI, remarked at a lecture at the French Ecole Supérieure de Guerre in 1900 that: The art of war, like all other arts, has its theory, its principles—otherwise, it wouldn’t be an art. This book is not a cookbook, listing problems and giving “recipes.” The aim is much more to help developers—and their managers—to raise good questions. You may well still write awful, costly queries after having read and digested this book. One sometimes has to. But, hopefully, it will be knowingly and with good reason. Audience This book is targeted at: • Developers with significant (one year or, preferably, more) experience of develop- ment with an SQL database • Their managers • Software architects who design programs with significant database components * One of my favorite computer books happens to be D.E. Knuth’s classic Art of Computer Programming (Addison Wesley).
xii P R E F A C E Although I hope that some DBAs, and particularly those that support development databases, will enjoy reading this book, I am sorry to tell them I had somebody else in mind while writing. Assumptions This Book Makes I assume in this book that you have already mastered the SQL language. By mastering I don’t mean that you took SQL 101 at the university and got an A+, nor, at the other end of the spectrum, that you are an internationally acknowledged SQL guru. I mean that you have already developed database applications using the SQL language, that you have had to think about indexing, and that you don’t consider a 5,000-row table to be a big table. It is not the purpose of this book to tell you what a “join” is—not even an outer one—nor what indexes are meant to be used for. Although you don’t need to feel totally comfortable with arcane SQL constructs, if, when given a set of tables and a question to answer, you are unable to come up with a functionally correct piece of code, there are probably a couple of books you had better read before this one. I also assume that you are at least familiar with one computer language and with the principles of computer programming. I assume that you have already been down in the trenches and that you have already heard users complain about slow and poorly performing systems. Contents of This Book I found the parallel between war and SQL so strong that I mostly followed Sun Tzu’s outline—and kept most of his titles.* This book is divided into twelve chapters, each containing a number of principles or maxims. I have tried to explain and illustrate these principles through examples, preferably from real-life cases. Chapter 1, Laying Plans Examines how to design databases for performance Chapter 2, Waging War Explains how programs must be designed to access databases efficiently Chapter 3, Tactical Dispositions Tells why and how to index Chapter 4, Maneuvering Explains how to envision SQL statements Chapter 5, Terrain Shows how physical implementation impacts performance * A few titles were borrowed from Clausewitz’s On War.
P R E F A C E xiii Chapter 6, The Nine Situations Covers classic SQL patterns and how to approach them Chapter 7, Variations in Tactics Explains how to deal with hierarchical data Chapter 8, Weaknesses and Strengths Provides indications about how to recognize and handle some difficult cases Chapter 9, Multiple Fronts Describes how to face concurrency Chapter 10, Assembly of Forces Addresses how to cope with large volumes of data Chapter 11, Stratagems Offers a few tricks that will help you survive rotten database designs Chapter 12, Employment of Spies Concludes the book by explaining how to define and monitor performance Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates emphasis and new terms, as well as book titles. Constant width Indicates SQL and, generally speaking, programming languages’ keywords; table, index and column names; functions; code; or the output from commands. Constant width bold Shows commands or other text that should be typed literally by the user. This style is used only in code examples that mix both input and output. Constant width italic Shows text that should be replaced with user-supplied values. This icon signifies a maxim and summarizes an important principle in SQL. N O T E This is a tip, suggestion, or general note. It contains useful supplementary information about the topic at hand.
xiv P R E F A C E Using Code Examples This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact O’Reilly for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. O’Reilly, Media Inc. appreciates, but does not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “The Art of SQL by Stéphane Faroult with Peter Robson. Copyright © 2006 O’Reilly Media, 0-596-00894-5.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact the publisher at permissions@oreilly.com. Comments and Questions Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 (800) 998-9938 (in the U.S. or Canada) (707) 829-0515 (international or local) (707) 829-0104 (fax) The publisher has a web page for this book, where we list errata, examples, and any additional information. You can access this page at: http://www.oreilly.com/catalog/artofsql To comment or ask technical questions about this book, send email to: bookquestions@oreilly.com For more information about our books, conferences, Resource Centers, and the O’Reilly Network, see O’Reilly’s web site at: http://www.oreilly.com You can also visit the author’s company web site at: http://www.roughsea.com
P R E F A C E xv Safari® Enabled When you see a Safari® Enabled icon on the cover of your favorite technology book, that means the book is available online through the O’Reilly Network Safari Bookshelf. Safari offers a solution that’s better than e-books. It’s a virtual library that lets you easily search thousands of top tech books, cut and paste code samples, download chapters, and find quick answers when you need the most accurate, current information. Try it for free at http://safari.oreilly.com. Acknowledgments Writing a book in a language that is neither your native language nor the language of the country where you live requires an optimism that (in retrospect) borders on insanity. Fortunately, Peter Robson, whom I had met at several conferences as a fellow speaker, brought to this book not only his knowledge of the SQL language and database design issues, but an unabated enthusiasm for mercilessly chopping my long sentences, placing adverbs where they belong, or suggesting an alternative to replace a word that was last heard in Merry England under the Plantagenets.* Being edited by Jonathan Gennick, the best-selling author of the O’Reilly SQL Pocket Guide and several other noted books, was a slightly scary honor. I discovered in Jonathan an editor extremely respectful of authors. His professionalism, attention to detail, and challenging views made this book a much better book than Peter and I would have written on our own. Jonathan also contributed to give a more mid-Atlantic flavor to this book (as Peter and I discovered, setting the spelling checker to “English (US)” is a prerequisite, but not quite enough). I would like to express my gratitude to the various people, from three continents, who took the time to read parts or the whole of the drafts of this book and to give me frank opinions: Philippe Bertolino, Rachel Carmichael, Sunil CS, Larry Elkins, Tim Gorman, Jean-Paul Martin, Sanjay Mishra, Anthony Molinaro, and Tiong Soo Hua. I feel a particular debt towards Larry, because the concept of this book probably finds its origin in some of our email discussions. I would also like to thank the numerous people at O’Reilly who made this book a reality. These include Marcia Friedman, Rob Romano, Jamie Peppard, Mike Kohnke, Ron Bilodeau, Jessamyn Read, and Andrew Savikas. Thanks, too, to Nancy Reinhardt for her most excellent copyedit of the manuscript. * For readers unfamiliar with British history, the Plantagenet dynasty ruled England between 1154 and 1485.
xvi P R E F A C E Special thanks to Yann-Arzel Durelle-Marc for kindly providing a suitable scan of the picture used to illustrate Chapter 12. Thanks too, to Paul McWhorter for permission to use his battle map as the basis for the Chapter 6 figure. Finally, I would like to thank Roger Manser and the staff at Steel Business Briefing for supplying Peter and me with an office and much-needed coffee for work sessions in London, halfway between our respective bases, and Qian Lena (Ashley) for providing me with the Chinese text of the Sun Tzu quote at the beginning of this book.
Chapter 1. C H A P T E R O N E Laying Plans Designing Databases for Performance C’est le premier pas qui, dans toutes les guerres, décèle le génie. It is the first step that reveals genius in all wars. —Joseph de Maistre (1754–1821) Lettre du 27 Juillet 1812 à Monsieur le Comte de Front
2 C H A P T E R O N E The great nineteenth century German strategist, Clausewitz, famously remarked that war is the continuation of politics by other means. Likewise, any computer program is, in one way or another, the continuation of the general activity within an organization, allowing it to do more, faster, better, or cheaper. The main purpose of a computer program is not simply to extract data from a database and then to process it, but to extract and process data for some particular goal. The means are not the end. A reminder that the goal of a given computer program is first of all to meet some business requirement* may come across as a platitude. In practice, the excitement of technological challenges often slowly causes attention to drift from the end to the means, from upholding the quality of the data that records business activity to writing programs that perform as intended and in an acceptable amount of time. Like a general in command of his army at the beginning of a campaign, we must know clearly what our objectives are— and we must stick to them, even if unexpected difficulties or opportunities make us alter the original plan. Whenever the SQL language is involved, we are fighting to keep a faithful and consistent record of business activity over time. Both faithfulness and consistency are primarily associated with the quality of the database model. The database model that SQL was initially designed to support is the relational model. One cannot overemphasize the importance of having a good model and a proper database design, because this is the very foundation of any information system. The Relational View of Data A database is nothing but a model of a small part of a real-life situation. As any representation, a database is always an imperfect model, and a very narrow depiction of a rich and complex reality. There is rarely a single way to represent some business activity, but rather several variants that in a technical sense will be semantically correct. However, for a given set of processes to apply, there is usually one representation that best meets the business requirement. The relational model is thus named, not because you can relate tables to one another (a popular misconception), but as a reference to the relationships between the columns in a table. These are the relationships that give the model its name; in other words, relational means that if several values belong to the same row in a table, they are related. The way columns are related to each other defines a relation, and a relation is a table (more exactly, a table represents one relation). The business requirements determine the scope of the real-world situation that is to be modeled. Once you have defined the scope, you can proceed to identify the data that you * The expression business requirement is meant to encompass non-commercial as well as commercial activities.
Comments 0
Loading comments...
Reply to Comment
Edit Comment