Clojure Data Structures and Algorithms Cookbook 25 recipes to deeply understand and implement advanced algorithms in Clojure Rafik Naccache BIRMINGHAM - MUMBAI www.it-ebooks.info
Clojure Data Structures and Algorithms Cookbook Copyright © 2015 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: August 2015 Production reference: 1140815 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78528-145-7 www.packtpub.com www.it-ebooks.info
Credits Author Rafik Naccache Reviewers Vitomir Kovanović Muktabh Mayank Commissioning Editor Nadeem N. Bagban Acquisition Editor Tushar Gupta Content Development Editor Natasha DSouza Technical Editor Vivek Pala Copy Editors Brandt D'Mello Neha Vyas Project Coordinator Vijay Kushlani Proofreader Safis Editing Indexer Mariammal Chettiyar Production Coordinator Conidon Miranda Cover Work Conidon Miranda www.it-ebooks.info
About the Author Rafik Naccache is a Tunisian, who is experienced in software architecture and is an emergent technology enthusiast. He earned his bachelor's degree in computer science engineering from the University of Tunis in 2001. Rafik fell in love with Clojure back in 2012 and has been developing it professionally since 2013. He has occupied various positions in the telecom and banking sectors and has launched a few innovative start-ups on the Internet, in which he was able to deploy Clojure apps. He also founded the Tunisian Clojure enthusiasts community. He contributes to open source projects such as Cryogen (https://github.com/cryogen-project/cryogen/graphs/contributors) and Milestones (https://github.com/automagictools/milestones). First of all, I am grateful to my mom, Safia, and dad, Abdelaziz, for the love and education that they generously provided me with. Thanks to Spectrum ZX, which we had back in the 80s. I grew as addicted to computers as I am right now, and this was the start of everything. I would also like to thank my in-laws, aunt Zohra and uncle Hammadi, who always supported me and had blind and unconditional faith in the work I did. They really wanted to see this book published. I am very thankful to my editors Tushar Gupta, Vivek Pala, and Natasha Dsouza; and reviewers, Vitomir Kovanovic and Multabh Mayank, for the valuable advice and professional guidance that they provided to accomplish this book. I would especially like to thank my friends Anas Zdadou and Sahbi Chakroun and my family: Tselma, Soussou, Dah, Hafedh, Zazza, and Idriss. I owe so much to you all. However, most of all, I am extremely grateful to my super wife, Khawla, who patiently had to suffer my absence while I wrote this book. During this period, she always kept her smile on and never complained. I can say that this book would probably never have been possible if she hadn't been there all along with our little Fatma Ezzahra, casting a light to help brighten my hard journey toward achievement and success. www.it-ebooks.info
About the Reviewers Vitomir Kovanović is a PhD student at the School of Informatics in the University of Edinburgh, UK. He received an MSc degree in computer science and software engineering in 2011 and a BSc degree in information systems and business administration in 2009 from the University of Belgrade, Serbia. His research interests include learning analytics, educational data mining, and online education. He is a member of the Society for Learning Analytics Research and the program committees of several conferences and journals in technology-enhanced learning. In his PhD research, he focuses on the use of trace data for understanding the effects of technology on the quality of the social learning process and learning outcomes. To find out more about him, visit http://vitomir.kovanovic.info/. www.it-ebooks.info
www.PacktPub.com Support files, eBooks, discount offers, and more For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM https://www2.packtpub.com/books/subscription/packtlib Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books. Why Subscribe? f Fully searchable across every book published by Packt f Copy and paste, print, and bookmark content f On demand and accessible via a web browser Free Access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access. www.it-ebooks.info
i Table of Contents Preface iii Chapter 1: Revisiting Arrays 1 Introduction 1 Efficiently compressing a byte array 2 Using Pascal's triangle to draw fractals 9 Simulating multithreading using time-sharing 14 Simulating a call stack using arrays 28 Chapter 2: Alternative Linked Lists 37 Building a doubly linked XOR list 37 Speeding up access to linked list elements 42 Building a simple shift-reduce parser 47 Implementing a skew binary random access list 59 Chapter 3: Walking Down Forests of Data 69 Introduction 69 Building self-balancing, search-efficient splay trees 70 Designing an efficient key-value store using B-trees 79 Devising an undo-capable data structure using a rope 86 Designing an autocomplete system using a trie 98 Chapter 4: Making Decisions with the Help of Science 105 Introduction 105 Designing a live recommendation engine 106 Resolving cost and profit optimization problems 112 Finding optimal paths in a graph 120 Summarizing texts by extracting the most representative sentences 123 www.it-ebooks.info
ii Table of Contents Chapter 5: Programming with Logic 131 Introduction 131 Querying a social website's data 134 Designing a type inferencer 138 Playing a round of checkers 142 Chapter 6: Sharing by Communicating 153 Introduction 153 Building a tiny web crawler 155 Designing an HTML5 game 160 Designing an online taxi-booking engine 170 Chapter 7: Transformations as First-class Citizens 177 Introduction 177 Building a recursive descent parser using trampoline 178 Implementing a reusable mini-firewall using transducers 183 Building a little unification engine with the continuation-passing style 188 Index 195 www.it-ebooks.info
iii Preface The invention of Lisp by John McCarthy in 1958 is certainly one of the most seminal events in the history of computer science. Originally intended as a way of porting Alonzo Church's lambda calculus theory into the realm of computer programs, Lisp pioneered many original ideas, such as higher order functions, recursion, and even garbage collection that proved to be so highly pragmatic that practically every modern high-level programming language of today is very likely built on some significant Lisp legacy. However, beyond any practical or technical contribution, Lisp's most important trait is undeniably its unparalleled expressiveness. It's simplistic, yet its extremely elegant syntax propels it as a privileged tool for creative computer scientists, one which you could use as a powerful "building material" to erect algorithmic monuments without worrying about ancillary implementation details. It's certainly this ability of abstracting away "incidental complexity" that made Lisp the language for conducting Artificial Intelligence experiments, for instance. Clojure, as a modern Lisp language, leverages this extraordinary expressive power to provide a platform that is highly suitable for algorithmic exploration. The abstractions it offers, the functional approach it suggests, as well as the built-in concurrency it supports, are all valuable facilities that enable straight and noise-free problem solving, which is an alternative way of approaching algorithms' design and sometimes, even innovative out-of-the-box thinking. This book tries to underpin this idea. Using Clojure, we'll consider seven areas of algorithmic challenges and try to address them by taking advantage of all the power that we can get from this Lisp dialect. Besides, while choosing these seven problem domains, I tried to attain two main objectives. First, I wanted to tackle algorithms that have concrete real-world usage, and theory in the recipes will only be present to serve well-defined use cases in our everyday work with computers. Second, I tried to come up with material as varied as possible, so that the recipes cover a wide range of topics, from compressing files and building parsers to designing HTML5 interactive games and assembling type inferencers. www.it-ebooks.info
Preface iv As far as the recipes' general layout is concerned, at the beginning, you will be given a thorough introduction to the intuition and the theory behind the algorithm being studied. Then, I will elaborate on every recipe's detail, making sure that I've explained every step and extensively commented the code. At the end of every recipe, you will see a sample usage of what has been implemented. This way, you will be guided through the whole process: the algorithm inception, its implementation, and its testing. In this process, I hope to have had mirrored the Clojure interactive workflow, in which the developer builds their program function by function, going back and forth to his/her REPL. I really enjoyed the process of writing this book. I can, for sure, assert that the Clojure promise of high expressive power has been fulfilled for this particular project, as I came up with quite complex algorithmic implementations with reasonable effort while being really productive. My only wish is that you, by the end of this work, will be as convinced by Clojure and Lisp as I am. You can reach me at @turbopape on Twitter, I'll answer, with extreme pleasure, all your questions. What this book covers Chapter 1, Revisiting Arrays, explores some interesting alternative uses for the array data structure. You'll learn in this recipe how to implement data compression using the LZ77 algorithm. Then, we'll see how you can use Pascal's triangle in order to draw some fractals. Next, we'll design a little multithreaded program execution simulator. We will end this chapter by studying an algorithm used to handle a call stack frames operation during a program execution. Chapter 2, Alternative Linked Lists, delves into the advanced matters related to linked lists. We will cover a method using XOR addressing in order to get doubly linked lists. We'll then cover how to speed up a linked list's element retrieval, thanks to caching. We'll also use this data structure to build a shift-reduce parser. At the end, we'll explore an immutable functional data representation of linked lists by using the skew binary numbers representation. Chapter 3, Walking Down Forests of Data, focuses on recipes related to tree data structure. First, we'll cover the self-balancing, search-optimized splay tree. Then, we'll elaborate on B-trees and show you how we can use a B-tree in order to build a key-value data store. Next, we'll show you how ropes can be used in order to create an undo-capable text editor. The last recipe of this chapter will be about tries and how they allow you to create efficient autocomplete engines. www.it-ebooks.info
Preface v Chapter 4, Making Decisions with the Help of Science, gives you an overview of a few machine learning and optimization algorithms. We'll first show you an approach that is used to build live recommendation engines. Then, we'll use the branch and bound algorithm to solve a cost/ profit optimization problem, which can only accept a natural numbers solution. Next, we'll use the Dijkstra algorithm in order to find the optimal paths in graphs. The final recipe in this chapter is about using the LexRank algorithm in order to summarize text documents. Chapter 5, Programming with Logic, focuses on logic programming. We'll first use this highly declarative approach in order to draw interesting facts out of a social networking website's traffic data. Then, we'll show you how a simple type inferencer can be built using logic programming. At the end, we'll design a simple IA module capable of playing one round of checkers. Chapter 6, Sharing by Communicating, gives particular attention to asynchronous programming. We'll begin by using this concurrency paradigm in order to build a tiny web scraper. Then, we'll go through the process of creating an interactive HTML5 game. Finally, we'll design an online taxi-booking platform as a complex system that could benefit from asynchronous programming. Chapter 7, Transformations as First-class Citizens, dives into a few particular algorithmic cases inherent to the functional nature of Clojure. We'll start by designing a simple recursive descent parser, making use of the efficient mutual recursion offered by the Trampoline construct. Then, we'll see the new Clojure 1.7 feature—the transducers—in action while developing a mini firewall simulator. Finally, we'll introduce you to the continuation-passing style while designing a little symbolic expression unification engine. What you need for this book To be able to follow along, you need a running Clojure REPL. Getting to a working Clojure environment is generally done through Leiningen (https://leiningen.org), which will need a JVM installation in order to be able to work. Clojure 1.6 will be fine for all the recipes except for the second recipe in Chapter 7, Transformations as First-class Citizens, which will need Clojure 1.7. Though not absolutely necessary, you can consider using a full-fledged Clojure development environment, such as Emacs/CIDER, IntelliJ/Cursive, Light Table, or Nightcode. Who this book is for This book is for intermediate Clojure developers who can read and write in this language quite comfortably. Besides, it is assumed that you have some knowledge of how to set up Clojure projects, include dependencies, how to run REPLs, and so on through Leiningen and Figwheel. No prior awareness of any of the algorithms covered in this book is needed, and, when appropriate, pointers are given to the explanation material about any theory related to them. www.it-ebooks.info
Preface vi Sections In this book, you will find several headings that appear frequently (Getting ready, How to do it, How it works, There's more, and See also). To give clear instructions on how to complete a recipe, we use these sections as follows: Getting ready This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe. How to do it… This section contains the steps required to follow the recipe. How it works… This section usually consists of a detailed explanation of what happened in the previous section. There's more… This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe. See also This section provides helpful links to other useful information for the recipe. Conventions In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We can use the core.match library in our program so we can use pattern matching." www.it-ebooks.info
Preface vii A block of code is set as follows: (defn accept-requests! [c->m requests] (go (while true (let [r (<! c->m)] ;;If something comes up ;;through the channel (swap! requests conj r))))) Any command-line input or output is written as follows: # lein repl New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "This step is followed when p is not at the root and p and x are aligned to the left or right, as shown in the next figure." Warnings or important notes appear in a box like this. Tips and tricks appear like this. Reader feedback Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of. To send us general feedback, simply send an e-mail to feedback@packtpub.com, and mention the book title via the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors. www.it-ebooks.info
Preface viii Customer support Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase. Downloading the example code You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. Errata Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support. Piracy Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at copyright@packtpub.com with a link to the suspected pirated material. We appreciate your help in protecting our authors, and our ability to bring you valuable content. Questions You can contact us at questions@packtpub.com if you are having a problem with any aspect of the book, and we will do our best to address it. www.it-ebooks.info
1 1 Revisiting Arrays In this chapter, we will see how you can use array abstractions in Clojure to cover the following topics: f Efficiently compressing a byte array f Using Pascal's triangle to draw fractals f Simulating multithreading using time-sharing f Simulating a call stack using arrays Introduction In this book, we will go on a journey through the broad land of algorithms and data structures, taking a ride on the comfortable vehicle that is Clojure programming language. First, we will take a look at arrays, exploring their particular structures to tackle problems as interesting as compression, fractal drawing, multithreading, and call stacks. Then we will elaborate on linked lists, transforming them in to doubly linked lists. We will do this to speed up access to their elements, build parsers, and devise fast random access. The next step of our trip will concern trees of data. We'll show you how to implement self- balancing red-black trees, how to design efficient key-value stores — thanks to B-trees (way to go in order to design undo-capable text editors), and lastly, a methodology to construct autocomplete text typing systems. After that, we'll focus on exploring some optimization and machine-learning techniques. We will see how to set up a recommendation engine, the way to go to optimize a problem where costs and profits are involved, a methodology to find the best possible paths in graphs, and how to summarize texts. www.it-ebooks.info
Revisiting Arrays 2 Then we'll study the topic of logic programming, analyzing some website traffic logs to detect visitors of interest to us. Doing this, we'll dive into the problem of type inferencing for the Java language, and simulate a turn of a checkers game. At that point, we'll talk about asynchronous programming as a means of tackling difficult problems. We'll build a tiny web spider, design an interactive HTML5 game, and design a complex online taxi-booking solution. The last rally point of this trip, but certainly not the least, is that we'll have a look at the higher order functions and transducers at the heart of Clojure. We'll design a recursive descent parser using a trampoline, build a reusable mini firewall thanks to transducers, and lastly, explore the continuation passing style as a tool to design a simple unification engine. This will be quite a tour, in which we will bring to life various real-world use cases related to the essential theory of computing as far as data structures and algorithms are involved, which are all served by the high expressive power of Clojure. By the end of this book, you'll be familiar with many of the advanced concepts that fuel most of the nontrivial applications of our world while you enhance your mastery of Clojure! Efficiently compressing a byte array Compressing a byte array is a matter of recognizing repeating patterns within a byte sequence and devising a method that can represent the same underlying information to take advantage of these discovered repetitions. To get a rough idea of how this works, imagine having a sequence as: ["a" "a" "a" "b" "b" "b" "b" "b" "b" "b" "c" "c"] It is intuitively more efficient to represent this as: [3 times "a", 7 times "b", 2 times "c"] Now, we are going to use a methodology based on a well-known algorithm, that is, the LZ77 compression method. LZ77 is, despite being quite old, the basis of most of all the well-known and currently used compression methods, especially the Deflate algorithm. Deflate is at the heart of the ZIP family of compression algorithms. It uses a slightly modified version of LZ77 plus a special encoding, that is, the Huffman encoding. The point of LZ77 is to walk a sequence and recognize a pattern in the past elements that will occur in the upcoming elements, replacing those with a couple of values: how many elements should go backwards in order to locate the recurring pattern, which is called "distance"; and how long is the recurring pattern, which is labeled as "length". www.it-ebooks.info
Chapter 1 3 The iteration of the LZ77 compression would look as follows: 1. At any point of time, the algorithm is processing a particular element, which is located at the current position. Consider a window of the size n, as a set of n elements preceding the one that is occupying current position, and consider lookahead as the rest of the elements up until the input's end. 2. Begin with the first element of the input. 3. Move on to the next element. 4. Find in the window (that is, past n elements), the longest pattern that can be found in lookahead. 5. If such a sequence is found, consider distance as the location where, the matching sequence was found, expressed in regards to the current position, consider length as the length of the matching pattern, and proceed with the two following actions: Replace the match in lookahead by "distance" and "length". Move forward using the "length" elements and resume algorithm execution at step 4. 6. Otherwise, resume at step 3. The procedure to uncompress is as follows: 1. Walk the compressed sequence. 2. If the "distance" and "length" are found, go back to the "distance" elements and replace this couple with the "length" elements. 3. If not, lay out the element that you've found. Let's see this in action in Clojure! How to do it... 1. First of all, here is the ns declaration containing the Clojure facilities that we are going to use: (ns recipe1.core (:require [clojure.set :as cset])) ;; => we'll need set operations later on. 2. Let's begin by working on the uncompressing part. First of all, we need an expand function that takes the source array as a vector of the elements distance and length and generates a repetition of a sub-vector of the last distance characters from the source array until the length is reached: (defn expand [the-vector distance www.it-ebooks.info
Revisiting Arrays 4 length] (let [end (count the-vector) start (- end distance) ;;=> Here we go backwards 'distance' elements. pattern (subvec the-vector start end)] ;=> We have our pattern. (into [] (take length ;=> We exactly take "length" from (cycle pattern))))) ;; an infinite repetition of our pattern. 3. Now, let's define un-LZ77 using expand function while walking through a sequence of bytes: (defn un-LZ77 [bytes] (loop [result [] remaining bytes] ;;=> We recur over the contents of the array. (if (seq remaining) (let [current (first remaining) the-rest (rest remaining)] ;=> Current element under scrutiny; (if-not (vector? Current) ;=> If it is not a vector, add to result (recur (conj result ;; the very element, and move on. current) the-rest) (recur (into result (expand result ;;=> This is a vector, then we'll expand here and move on (current 0) (current 1))) the-rest))) result))) ;;=> end of recursion, return result. www.it-ebooks.info
Chapter 1 5 4. Now let's address the topic of compressing. First of all, we need to grab all sub-vectors, as we'll have to find matches between window and lookahead and then pick the longest one among them: (defn all-subvecs-from-beginning ;;=> this function will generate a set of all sub-vectors starting ;; from begin [v] (set (map #(subvec v 0 %) ;;=> we apply subvec from 0 to all indices from 1 up to the size ;; of the array + 1. (range 1 (inc (count v)))))) (defn all-subvecs ;;=> this function will generate all [v] ; sub-vectors, applying (loop [result #{} ;; all-subvecs from beginning to ;; all possible beginnings. remaining v] (if (seq remaining) (recur (into result (all-subvecs-from-beginning remaining)) (into[] (rest remaining))) ;;=> We recur fetching all sub-vectors for next beginning. result))) ;;=> end of recursion, I return result. 5. Now we define a function to grab the longest match in left array with the beginning of right array: (defn longest-match-w-beginning [left-array right-array] (let [all-left-chunks (all-subvecs left-array) all-right-chunks-from-beginning ;;=> I take all sub-vectors from left-array (all-subvecs-from-beginning right-array) ;;=> I take all sub-vectors from right-array www.it-ebooks.info
Comments 0
Loading comments...
Reply to Comment
Edit Comment