Allen B. Downey Think Data Structures ALGORITHMS AND INFORMATION RETRIEVAL IN JAVA
Allen B. Downey Think Data Structures Algorithms and Information Retrieval in Java Boston Farnham Sebastopol TokyoBeijing
978-1-491-97239-7 [LSI] Think Data Structures by Allen B. Downey Copyright © 2017 Allen Downey. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Nan Barber and Brian Foster Production Editor: Kristen Brown Copyeditor: Charles Roumeliotis Proofreader: Amanda Kersey Indexer: Allen B. Downey Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest July 2017: First Edition Revision History for the First Edition 2017-07-07: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Think Data Structures, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. Think Data Structures is available under the Creative Commons Attribution-NonCommercial 3.0 Unpor‐ ted License. The author maintains an online version at http://greenteapress.com/wp/think-data-structures/.
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Why Are There Two Kinds of List? 2 Interfaces in Java 2 List Interface 3 Exercise 1 4 2. Analysis of Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Selection Sort 8 Big O Notation 10 Exercise 2 11 3. ArrayList. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Classifying MyArrayList Methods 15 Classifying add 17 Problem Size 19 Linked Data Structures 19 Exercise 3 21 A Note on Garbage Collection 23 4. LinkedList. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Classifying MyLinkedList Methods 25 Comparing MyArrayList and MyLinkedList 27 Profiling 28 Interpreting Results 30 Exercise 4 32 iii
5. Doubly Linked List. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Performance Profiling Results 33 Profiling LinkedList Methods 35 Adding to the End of a LinkedList 36 Doubly Linked List 38 Choosing a Structure 39 6. Tree Traversal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Search Engines 41 Parsing HTML 42 Using jsoup 44 Iterating Through the DOM 46 Depth-First Search 46 Stacks in Java 47 Iterative DFS 48 7. Getting to Philosophy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Getting Started 51 Iterables and Iterators 52 WikiFetcher 54 Exercise 5 55 8. Indexer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Data Structure Selection 57 TermCounter 59 Exercise 6 61 9. The Map Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Implementing MyLinearMap 65 Exercise 7 66 Analyzing MyLinearMap 67 10. Hashing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Hashing 71 How Does Hashing Work? 73 Hashing and Mutation 74 Exercise 8 76 11. HashMap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Exercise 9 77 Analyzing MyHashMap 78 The Tradeoffs 80 iv | Table of Contents
Profiling MyHashMap 81 Fixing MyHashMap 81 UML Class Diagrams 83 12. TreeMap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 What’s Wrong with Hashing? 85 Binary Search Tree 86 Exercise 10 88 Implementing a TreeMap 89 13. Binary Search Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 A Simple MyTreeMap 93 Searching for Values 94 Implementing put 95 In-Order Traversal 97 The Logarithmic Methods 98 Self-Balancing Trees 100 One More Exercise 100 14. Persistence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Redis 102 Redis Clients and Servers 103 Making a Redis-Backed Index 103 Redis Data Types 105 Exercise 11 107 More Suggestions If You Want Them 108 A Few Design Hints 109 15. Crawling Wikipedia. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 The Redis-Backed Indexer 111 Analysis of Lookup 113 Analysis of Indexing 114 Graph Traversal 115 Exercise 12 116 16. Boolean Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Crawler Solution 119 Information Retrieval 121 Boolean Search 122 Exercise 13 122 Comparable and Comparator 124 Extensions 127 Table of Contents | v
17. Sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Insertion Sort 130 Exercise 14 131 Analysis of Merge Sort 133 Radix Sort 134 Heap Sort 136 Bounded Heap 137 Space Complexity 138 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 vi | Table of Contents
Preface The Philosophy Behind the Book Data structures and algorithms are among the most important inventions of the last 50 years, and they are fundamental tools software engineers need to know. But in my opinion, most of the books on these topics are too theoretical, too big, and too “bot‐ tom up”: Too theoretical Mathematical analysis of algorithms is based on simplifying assumptions that limit its usefulness in practice. Many presentations of this topic gloss over the simplifications and focus on the math. In this book I present the most practical subset of this material and omit or de-emphasize the rest. Too big Most books on these topics are at least 500 pages, and some are more than 1,000. By focusing on the topics I think are most useful for software engineers, I kept this book under 150 pages. Too “bottom up” Many data structures books focus on how data structures work (the implementa‐ tions), with less about how to use them (the interfaces). In this book, I go “top down”, starting with the interfaces. Readers learn to use the structures in the Java Collections Framework before getting into the details of how they work. Finally, some books present this material out of context and without motivation: it’s just one damn data structure after another! I try to liven it up by organizing the top‐ ics around an application—web search—that uses data structures extensively, and is an interesting and important topic in its own right. This application motivates some topics that are not usually covered in an introduc‐ tory data structures class, including persistent data structures with Redis. vii
I have made difficult decisions about what to leave out, but I have made some com‐ promises. I include a few topics that most readers will never use, but that they might be expected to know, possibly in a technical interview. For these topics, I present both the conventional wisdom as well as my reasons to be skeptical. This book also presents basic aspects of software engineering practice, including ver‐ sion control and unit testing. Most chapters include an exercise that allows readers to apply what they have learned. Each exercise provides automated tests that check the solution. And for most exercises, I present my solution at the beginning of the next chapter. Prerequisites This book is intended for college students in computer science and related fields, as well as professional software engineers, people training in software engineering, and people preparing for technical interviews. Before you start this book, you should know Java pretty well; in particular, you should know how to define a new class that extends an existing class or implements an inter face. If your Java is rusty, here are two books you might start with: • Downey and Mayfield, Think Java (O’Reilly Media, 2016), which is intended for people who have never programmed before • Sierra and Bates, Head First Java (O’Reilly Media, 2005), which is appropriate for people who already know another programming language If you are not familiar with interfaces in Java, you might want to work through the tutorial called “What Is an Interface?” at http://thinkdast.com/interface. One vocabulary note: the word “interface” can be confusing. In the context of an application programming interface (API), it refers to a set of classes and methods that provide certain capabilities. In the context of Java, it also refers to a language feature, similar to a class, that speci‐ fies a set of methods. To help avoid confusion, I’ll use “interface” in the normal type‐ face for the general idea of an interface, and interface in the code typeface for the Java language feature. You should also be familiar with type parameters and generic types. For example, you should know how create an object with a type parameter, like ArrayList<Integer>. If not, you can read about type parameters at http://thinkdast.com/types. You should be familiar with the Java Collections Framework (JCF), which you can read about at http://thinkdast.com/collections. In particular, you should know about the List interface and the classes ArrayList and LinkedList. viii | Preface
Ideally you should be familiar with Apache Ant, which is an automated build tool for Java. You can read more about Ant at http://thinkdast.com/anttut. And you should be familiar with JUnit, which is a unit testing framework for Java. You can read more about it at http://thinkdast.com/junit. Working with the Code The code for this book is in a Git repository at http://thinkdast.com/repo. Git is a version control system that allows you to keep track of the files that make up a project. A collection of files under Git’s control is called a repository. GitHub is a hosting service that provides storage for Git repositories and a conve‐ nient web interface. It provides several ways to work with the code: • You can create a copy of the repository on GitHub by pressing the Fork button. If you don’t already have a GitHub account, you’ll need to create one. After forking, you’ll have your own repository on GitHub that you can use to keep track of code you write. Then you can clone the repository, which downloads a copy of the files to your computer. • Alternatively, you could clone the repository without forking. If you choose this option, you don’t need a GitHub account, but you won’t be able to save your changes on GitHub. • If you don’t want to use Git at all, you can download the code in a ZIP archive using the Download button on the GitHub page, or this link: http://think dast.com/zip. After you clone the repository or unzip the ZIP file, you should have a directory called ThinkDataStructures with a subdirectory called code. The examples in this book were developed and tested using Java SE Development Kit 7. If you are using an older version, some examples will not work. If you are using a more recent version, they should all work. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates emphasis, keystrokes, menu options, URLs, and email addresses. Bold Used for new terms where they are defined. Preface | ix
Constant width Used for program listings, as well as within paragraphs to refer to filenames, file extensions, and program elements such as variable and function names, data types, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Safari® Books Online Safari Books Online (www.safaribooksonline.com) is an on- demand digital library that delivers expert content in both book and video form from the world’s leading authors in tech‐ nology and business. Technology professionals, software developers, web designers, and business and crea‐ tive professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training. Safari Books Online offers a range of plans and pricing for enterprise, government, education, and individuals. Members have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kauf‐ mann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and hundreds more. For more information about Safari Books Online, please visit us online. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) To comment or ask technical questions about this book, send email to bookques‐ tions@oreilly.com. x | Preface
For more information about our books, courses, conferences, and news, see our web‐ site at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Contributors This book is an adapted version of a curriculum I wrote for the Flatiron School in New York City, which offers a variety of online classes related to programming and web development. They offer a class based on this material, which provides an online development environment, help from instructors and other students, and a certificate of completion. You can find more information at http://flatironschool.com. • At the Flatiron School, Joe Burgess, Ann John, and Charles Pletcher provided guidance, suggestions, and corrections from the initial specification all the way through implementation and testing. Thank you all! • I am very grateful to my technical reviewers, Barry Whitman, Patrick White, and Chris Mayfield, who made many helpful suggestions and caught many errors. Of course, any remaining errors are my fault, not theirs! • Thanks to the instructors and students in Data Structures and Algorithms at Olin College, who read this book and provided useful feedback. • Charles Roumeliotis copyedited the book for O’Reilly Media and made many improvements. If you have comments or ideas about the text, please send them to feedback@greentea‐ press.com. Preface | xi
(This page has no text content)
CHAPTER 1 Interfaces This book presents three topics: Data structures Starting with the structures in the Java Collections Framework (JCF), you will learn how to use data structures like lists and maps, and you will see how they work. Analysis of algorithms I present techniques for analyzing code and predicting how fast it will run and how much space (memory) it will require. Information retrieval To motivate the first two topics, and to make the exercises more interesting, we will use data structures and algorithms to build a simple web search engine. Here’s an outline of the order of topics: • We’ll start with the List interface and you will write classes that implement this interface two different ways. Then we’ll compare your implementations with the Java classes ArrayList and LinkedList. • Next I’ll introduce tree-shaped data structures and you will work on the first application: a program that reads pages from Wikipedia, parses the contents, and navigates the resulting tree to find links and other features. We’ll use these tools to test the “Getting to Philosophy” conjecture (you can get a preview by reading http://thinkdast.com/getphil). • We’ll learn about the Map interface and Java’s HashMap implementation. Then you’ll write classes that implement this interface using a hash table and a binary search tree. 1
• Finally, you will use these classes (and a few others I’ll present along the way) to implement a web search engine, including a crawler that finds and reads pages, an indexer that stores the contents of web pages in a form that can be searched efficiently, and a retriever that takes queries from a user and returns relevant results. Let’s get started. Why Are There Two Kinds of List? When people start working with the Java Collections Framework, they are sometimes confused about ArrayList and LinkedList. Why does Java provide two implementa‐ tions of the List interface? And how should you choose which one to use? I will answer these questions in the next few chapters. I’ll start by reviewing interfaces and the classes that implement them, and I’ll present the idea of “programming to an interface”. In the first few exercises, you’ll implement classes similar to ArrayList and Linked List, so you’ll know how they work, and we’ll see that each of them has pros and cons. Some operations are faster or use less space with ArrayList; others are faster or smaller with LinkedList. Which one is better for a particular application depends on which operations it performs most often. Interfaces in Java A Java interface specifies a set of methods; any class that implements this interface has to provide these methods. For example, here is the source code for Comparable, which is an interface defined in the package java.lang: public interface Comparable<T> { public int compareTo(T o); } This interface definition uses a type parameter, T, which makes Comparable a generic type. In order to implement this interface, a class has to • Specify the type T refers to, and • Provide a method named compareTo that takes an object as a parameter and returns an int. 2 | Chapter 1: Interfaces
For example, here’s the source code for java.lang.Integer: public final class Integer extends Number implements Comparable<Integer> { public int compareTo(Integer anotherInteger) { int thisVal = this.value; int anotherVal = anotherInteger.value; return (thisVal<anotherVal ? -1 : (thisVal==anotherVal ? 0 : 1)); } // other methods omitted } This class extends Number, so it inherits the methods and instance variables of Number; and it implements Comparable<Integer>, so it provides a method named compareTo that takes an Integer and returns an int. When a class declares that it implements an interface, the compiler checks that it provides all methods defined by the interface. As an aside, this implementation of compareTo uses the “ternary operator”, sometimes written ?:. If you are not familiar with it, you can read about it at http://think dast.com/ternary. List Interface The Java Collections Framework (JCF) defines an interface called List and pro‐ vides two implementations, ArrayList and LinkedList. The interface defines what it means to be a List; any class that implements this interface has to provide a particular set of methods, including add, get, remove, and about 20 more. ArrayList and LinkedList provide these methods, so they can be used interchangea‐ bly. A method written to work with a List will work with an ArrayList, LinkedList, or any other object that implements List. Here’s a contrived example that demonstrates the point: public class ListClientExample { private List list; public ListClientExample() { list = new LinkedList(); } private List getList() { return list; } List Interface | 3
public static void main(String[] args) { ListClientExample lce = new ListClientExample(); List list = lce.getList(); System.out.println(list); } } ListClientExample doesn’t do anything useful, but it has the essential elements of a class that encapsulates a List; that is, it contains a List as an instance variable. I’ll use this class to make a point, and then you’ll work with it in the first exercise. The ListClientExample constructor initializes list by instantiating (that is, creat‐ ing) a new LinkedList; the getter method called getList returns a reference to the internal List object; and main contains a few lines of code to test these methods. The important thing about this example is that it uses List whenever possible and avoids specifying LinkedList or ArrayList unless it is necessary. For example, the instance variable is declared to be a List, and getList returns a List, but neither specifies which kind of list. If you change your mind and decide to use an ArrayList, you only have to change the constructor; you don’t have to make any other changes. This style is called interface-based programming, or more casually, “programming to an interface” (see http://thinkdast.com/interbaseprog). Here we are talking about the general idea of an interface, not a Java interface. When you use a library, your code should only depend on the interface, like List. It should not depend on a specific implementation, like ArrayList. That way, if the implementation changes in the future, the code that uses it will still work. On the other hand, if the interface changes, the code that depends on it has to change, too. That’s why library developers avoid changing interfaces unless absolutely neces‐ sary. Exercise 1 Since this is the first exercise, we’ll keep it simple. You will take the code from the previous section and swap the implementation; that is, you will replace the Linked List with an ArrayList. Because the code programs to an interface, you will be able to swap the implementation by changing a single line and adding an import state‐ ment. Start by setting up your development environment. For all of the exercises, you will need to be able to compile and run Java code. I developed the examples using Java SE Development Kit 7. If you are using a more recent version, everything should still work. If you are using an older version, you might find some incompatibilities. 4 | Chapter 1: Interfaces
I recommend using an interactive development environment (IDE) that provides syntax checking, auto-completion, and source code refactoring. These features help you avoid errors or find them quickly. However, if you are preparing for a technical interview, remember that you will not have these tools during the interview, so you might also want to practice writing code without them. If you have not already downloaded the code for this book, see the instructions in “Working with the Code” on page ix. In the directory named code, you should find these files and directories: • build.xml is an Ant file that makes it easier to compile and run the code. • lib contains the libraries you’ll need (for this exercise, just JUnit). • src contains the source code. If you navigate into src/com/allendowney/thinkdast, you’ll find the source code for this exercise: • ListClientExample.java contains the code from the previous section. • ListClientExampleTest.java contains a JUnit test for ListClientExample. Review ListClientExample and make sure you understand what it does. Then com‐ pile and run it. If you use Ant, you can navigate to the code directory and run ant ListClientExample. You might get a warning like: List is a raw type. References to generic type List<E> should be parameterized. To keep the example simple, I didn’t bother to specify the type of the elements in the List. If this warning bothers you, you can fix it by replacing each List or LinkedList with List<Integer> or LinkedList<Integer>. Review ListClientExampleTest. It runs one test, which creates a ListClientExam ple, invokes getList, and then checks whether the result is an ArrayList. Initially, this test will fail because the result is a LinkedList, not an ArrayList. Run this test and confirm that it fails. NOTE: This test makes sense for this exercise, but it is not a good example of a test. Good tests should check whether the class under test satisfies the requirements of the interface; they should not depend on the details of the implementation. In the ListClientExample, replace LinkedList with ArrayList. You might have to add an import statement. Compile and run ListClientExample. Then run the test again. With this change, the test should now pass. Exercise 1 | 5
To make this test pass, you only had to change LinkedList in the constructor; you did not have to change any of the places where List appears. What happens if you do? Go ahead and replace one or more appearances of List with ArrayList. The program should still work correctly, but now it is “overspecified”. If you change your mind in the future and want to swap the interface again, you would have to change more code. In the ListClientExample constructor, what happens if you replace ArrayList with List? Why can’t you instantiate a List? 6 | Chapter 1: Interfaces
CHAPTER 2 Analysis of Algorithms As we saw in the previous chapter, Java provides two implementations of the List interface, ArrayList and LinkedList. For some applications LinkedList is faster; for other applications ArrayList is faster. To decide which one is better for a particular application, one approach is to try them both and see how long they take. This approach, which is called profiling, has a few problems: 1. Before you can compare the algorithms, you have to implement them both. 2. The results might depend on what kind of computer you use. One algorithm might be better on one machine; the other might be better on a different machine. 3. The results might depend on the size of the problem or the data provided as input. We can address some of these problems using analysis of algorithms. When it works, algorithm analysis makes it possible to compare algorithms without having to imple‐ ment them. But we have to make some assumptions: 1. To avoid dealing with the details of computer hardware, we usually identify the basic operations that make up an algorithm—like addition, multiplication, and comparison of numbers—and count the number of operations each algorithm requires. 2. To avoid dealing with the details of the input data, the best option is to analyze the average performance for the inputs we expect. If that’s not possible, a com‐ mon alternative is to analyze the worst case scenario. 7
Comments 0
Loading comments...
Reply to Comment
Edit Comment