📄 Page
1
M A N N I N G Radu Gheorghe Matthew Lee Hinman Roy Russo www.it-ebooks.info
📄 Page
2
Elasticsearch in ActionLicensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
3
Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
4
Elasticsearch in Action RADU GHEORGHE MATTHEW LEE HINMAN ROY RUSSO M A N N I N G SHELTER ISLANDLicensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
5
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2016 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. Development editor: Susan Conant 20 Baldwin Road Technical development editor: David Pombal PO Box 761 Copyeditor: Linda Recktenwald Shelter Island, NY 11964 Proofreader: Melody Dolab Technical proofreader: Valentin Crettaz Typesetter: Dennis Dalinnik Cover designer: Marija Tudor ISBN: 9781617291623 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – EBM – 20 19 18 17 16 15Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
6
brief contents PART 1 .......................................................................................1 1 ■ Introducing Elasticsearch 3 2 ■ Diving into the functionality 20 3 ■ Indexing, updating, and deleting data 53 4 ■ Searching your data 83 5 ■ Analyzing your data 118 6 ■ Searching with relevancy 148 7 ■ Exploring your data with aggregations 179 8 ■ Relations among documents 215 PART 2 ...................................................................................259 9 ■ Scaling out 261 10 ■ Improving performance 293 11 ■ Administering your cluster 340v Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
7
Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
8
contents preface xv acknowledgments xvii about this book xix about the cover illustration xxiii PART 1 ........................................................................ 1 1 Introducing Elasticsearch 3 1.1 Solving search problems with Elasticsearch 4 Providing quick searches 5 ■ Ensuring relevant results 6 Searching beyond exact matches 7 1.2 Exploring typical Elasticsearch use cases 8 Using Elasticsearch as the primary back end 9 Adding Elasticsearch to an existing system 9 Using Elasticsearch with existing tools 11 Main Elasticsearch features 12 ■ Extending Lucene functionality 13 ■ Structuring your data in Elasticsearch 15 Installing Java 15 ■ Downloading and starting Elasticsearch 16 ■ Verifying that it works 16 1.3 Summary 18vii Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
9
CONTENTSviii2 Diving into the functionality 20 2.1 Understanding the logical layout: documents, types, and indices 22 Documents 23 ■ Types 24 ■ Indices 25 2.2 Understanding the physical layout: nodes and shards 25 Creating a cluster of one or more nodes 26 ■ Understanding primary and replica shards 27 ■ Distributing shards in a cluster 30 ■ Distributed indexing and searching 31 2.3 Indexing new data 32 Indexing a document with cURL 32 ■ Creating an index and mapping type 35 ■ Indexing documents from the code samples 36 2.4 Searching for and retrieving data 37 Where to search 38 ■ Contents of the reply 39 How to search 42 ■ Getting documents by ID 45 2.5 Configuring Elasticsearch 46 Specifying a cluster name in elasticsearch.yml 46 Specifying verbose logging via logging.yml 47 Adjusting JVM settings 47 2.6 Adding nodes to the cluster 48 Starting a second node 50 ■ Adding additional nodes 51 2.7 Summary 52 3 Indexing, updating, and deleting data 53 3.1 Using mappings to define kinds of documents 54 Retrieving and defining mappings 56 ■ Extending an existing mapping 57 3.2 Core types for defining your own fields in documents 58 String 59 ■ Numeric 61 ■ Date 62 ■ Boolean 63 3.3 Arrays and multi-fields 63 Arrays 64 ■ Multi-fields 64 3.4 Using predefined fields 65 Controlling how to store and search your documents 66 Identifying your documents 68 3.5 Updating existing documents 70 Using the update API 72 ■ Implementing concurrency control through versioning 74Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
10
CONTENTS ix3.6 Deleting data 78 Deleting documents 78 ■ Deleting indices 80 Closing indices 81 ■ Re-indexing sample documents 81 3.7 Summary 82 4 Searching your data 83 4.1 Structure of a search request 84 Specifying a search scope 85 ■ Basic components of a search request 86 ■ Request body–based search request 88 Understanding the structure of a response 91 4.2 Introducing the query and filter DSL 92 Match query and term filter 92 ■ Most used basic queries and filters 95 ■ Match query and term filter 102 Phrase_prefix query 103 4.3 Combining queries or compound queries 105 bool query 105 ■ bool filter 107 4.4 Beyond match and filter queries 109 Range query and filter 109 ■ Prefix query and filter 111 Wildcard query 112 4.5 Querying for field existence with filters 113 Exists filter 114 ■ Missing filter 114 ■ Transforming any query into a filter 115 4.6 Choosing the best query for the job 116 4.7 Summary 117 5 Analyzing your data 118 5.1 What is analysis? 119 Character filtering 120 ■ Breaking into tokens 120 Token filtering 120 ■ Token indexing 120 5.2 Using analyzers for your documents 121 Adding analyzers when an index is created 122 Adding analyzers to the Elasticsearch configuration 123 Specifying the analyzer for a field in the mapping 124 5.3 Analyzing text with the analyze API 126 Selecting an analyzer 127 ■ Combining parts to create an impromptu analyzer 127 ■ Analyzing based on a field’s mapping 128 ■ Learning about indexed terms using the terms vectors API 128Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
11
CONTENTSx5.4 Analyzers, tokenizers, and token filters, oh my! 130 Built-in analyzers 130 ■ Tokenization 132 Token filters 134 5.5 Ngrams, edge ngrams, and shingles 141 1-grams 141 ■ Bigrams 141 ■ Trigrams 141 Setting min_gram and max_gram 141 ■ Edge ngrams 142 Ngram settings 142 ■ Shingles 143 5.6 Stemming 145 Algorithmic stemming 145 ■ Stemming with dictionaries 146 Overriding the stemming from a token filter 146 5.7 Summary 147 6 Searching with relevancy 148 6.1 How scoring works in Elasticsearch 149 How scoring documents works 149 ■ Term frequency 150 Inverse document frequency 150 ■ Lucene’s scoring formula 151 6.2 Other scoring methods 152 Okapi BM25 154 6.3 Boosting 154 Boosting at index time 155 ■ Boosting at query time 156 Queries spanning multiple fields 157 6.4 Understanding how a document was scored with explain 158 Explaining why a document did not match 160 6.5 Reducing scoring impact with query rescoring 160 6.6 Custom scoring with function_score 162 weight 162 ■ Combining scores 164 ■ field_value_factor 164 Script 165 ■ random 166 ■ Decay functions 167 Configuration options 169 6.7 Tying it back together 170 6.8 Sorting with scripts 171 6.9 Field data detour 172 The field data cache 173 ■ What field data is used for 174 Managing field data 174 6.10 Summary 178Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
12
CONTENTS xi7 Exploring your data with aggregations 179 7.1 Understanding the anatomy of an aggregation 182 Structure of an aggregation request 182 ■ Aggregations run on query results 184 ■ Filters and aggregations 185 7.2 Metrics aggregations 186 Statistics 186 ■ Advanced statistics 188 Approximate statistics 189 7.3 Multi-bucket aggregations 192 Terms aggregations 193 ■ Range aggregations 200 Histogram aggregations 202 7.4 Nesting aggregations 204 Nesting multi-bucket aggregations 206 ■ Nesting aggregations to get result grouping 208 ■ Using single-bucket aggregations 209 7.5 Summary 213 8 Relations among documents 215 8.1 Overview of options for defining relationships among documents 216 Object type 217 ■ Nested type 218 ■ Parent-child relationships 219 ■ Denormalizing 220 8.2 Having objects as field values 221 Mapping and indexing objects 222 ■ Searching in objects 223 8.3 Nested type: connecting nested documents 225 Mapping and indexing nested documents 226 Searches and aggregations on nested documents 229 8.4 Parent-child relationships: connecting separate documents 236 Indexing, updating, and deleting child documents 238 Searching in parent and child documents 240 8.5 Denormalizing: using redundant data connections 247 Use cases for denormalizing 248 ■ Indexing, updating, and deleting denormalized data 250 ■ Querying denormalized data 253 8.6 Application-side joins 255 8.7 Summary 256Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
13
CONTENTSxiiPART 2 .................................................................... 259 9 Scaling out 261 9.1 Adding nodes to your Elasticsearch cluster 262 Adding nodes to your cluster 262 9.2 Discovering other Elasticsearch nodes 265 Multicast discovery 265 ■ Unicast discovery 266 Electing a master node and detecting faults 267 Fault detection 268 9.3 Removing nodes from a cluster 269 Decommissioning nodes 270 9.4 Upgrading Elasticsearch nodes 274 Performing a rolling restart 274 ■ Minimizing recovery time for a restart 276 9.5 Using the _cat API 276 9.6 Scaling strategies 279 Over-sharding 279 ■ Splitting data into indices and shards 280 Maximizing throughput 281 9.7 Aliases 282 What is an alias, really? 283 ■ Alias creation 284 9.8 Routing 286 Why use routing? 287 ■ Routing strategies 287 Using the _search_shards API to determine where a search is performed 289 ■ Configuring routing 290 Combining routing with aliases 291 9.9 Summary 292 10 Improving performance 293 10.1 Grouping requests 294 Bulk indexing, updating, and deleting 295 Multisearch and multiget APIs 299 10.2 Optimizing the handling of Lucene segments 301 Refresh and flush thresholds 302 ■ Merges and merge policies 305 ■ Store and store throttling 308 10.3 Making the best use of caches 312 Filters and filter caches 312 ■ Shard query cache 318 JVM heap and OS caches 321 ■ Keeping caches up with warmers 323Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
14
CONTENTS xiii10.4 Other performance tradeoffs 325 Big indices or expensive searches 326 ■ Tuning scripts or not using them at all 329 ■ Trading network trips for less data and better distributed scoring 333 ■ Trading memory for better deep paging 336 10.5 Summary 338 11 Administering your cluster 340 11.1 Improving defaults 341 Index templates 341 ■ Default mappings 344 11.2 Allocation awareness 347 Shard-based allocation 347 ■ Forced allocation awareness 349 11.3 Monitoring for bottlenecks 350 Checking cluster health 350 ■ CPU: slow logs, hot threads, and thread pools 353 ■ Memory: heap size, field, and filter caches 356 ■ OS caches 360 ■ Store throttling 361 11.4 Backing up your data 362 Snapshot API 362 ■ Backing up data to a shared file system 362 Restoring from backups 366 ■ Using repository plugins 367 11.5 Summary 368 appendix A Working with geospatial data 369 appendix B Plugins 383 appendix C Highlighting 390 appendix D Elasticsearch monitoring plugins 410 appendix E Turning search upside down with the percolator 419 appendix F Using suggesters for autocomplete and did-you-mean functionality 437 index 461Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
15
Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
16
preface While writing this book, my objective was to provide you the information I needed when I started using Elasticsearch: what its main features are and how they work under the hood. To give you a better overview of this objective, let me tell you a more detailed story of how this book came to life. I first met Elasticsearch in 2011 while working on a project for centralizing logs. My colleague Mihai Sandu showed me Graylog, which used Elasticsearch for log search, and setting everything up was extremely easy. Two servers could handle all our logging needs at the time, but we expected the data volume to grow hundreds of times in about one year. And it did. On top of that, we had more and more complex analysis requirements, so we quickly found out that tuning and scaling the setup required a deep understanding of Elasticsearch and its features. There was no book to teach us that, so we had to learn the hard way: lots of exper- iments, lots of questions and answers to the mailing list. The upside was that I got to know a lot of nice people that posted there regularly. This is how I came to work at Sematext, where I could concentrate on Elasticsearch full-time, and this is why Man- ning asked me if I would be interested in writing about Elasticsearch. Of course I was. They warned me it was hard work, but told me that Lee Hinman was also interested, so we joined forces. With two authors, we thought it was going to be easy, especially as Lee and I really clicked and provided useful feedback to one another. Little did we know that it’s much easier to present features in the early chap- ters than to combine those features into best practices for various use cases in later chapters. Then, with feedback from our reviewers, we found that it’s even more workxv Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
17
PREFACExvito fit everything together, so our pace became slower and slower. That’s when Roy Russo joined us and helped with that final push. After two and a half years of early mornings, late nights, and weekends, I can finally say we’re done. It was a tough experience, but a rich one as well. I would surely have loved to have this book in my hands four years ago, and I hope you’ll enjoy it, too. RADU GHEORGHELicensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
18
acknowledgments Many people provided their invaluable support to make this book possible: ■ Susan Conant, our development editor at Manning, who supported us in so many ways: by providing valuable feedback on draft chapters, helping to plan book and individual chapter structures, giving encouragement, advising us on upcoming steps, helping us overcome bumps in the road, and so on ■ Jettro Coenradie, our technical editor, who helped us review big chunks of the manuscript before it went to production and again helped with the final steps before the book went to press ■ Valentin Crettaz, who helped with his thorough technical proofread ■ Our Manning Early Access Program (MEAP) readers who posted so many help- ful comments in the Author Online forum ■ The reviewers from the development process who provided such good feedback that I can’t even begin to imagine how the book would look without them: Achim Friedland, Alan McCann, Artur Nowak, Bhaskar Karambelkar, Daniel Beck, Gabriel Katenbaumn, Gianluca Rhigetto, Igor Motov, Jeelani Shaik, Joe Gallo, Konstantin Yakushev, Koray Güclü, Michael Schleichardt, Paul Stadig, Ray Lugo Jr., Sen Xu, and Tanguy Leroux RADU GHEORGHE I’d like to express my thanks in chronological order. To my colleagues from Avira: Mihai Sandu, Mihai Efrim, Martin Ahrens, Matthias Ollig and many others, for sup- porting me in learning about Elasticsearch and tolerating my not-always-successfulxvii Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
19
ACKNOWLEDGMENTSxviiiexperiments. To my colleagues from Sematext: Otis Gospodnetić, who supported me in learning and interacting with the community, and Rafał Kuć (aka Master Rafał) for his invaluable tips and tricks. Finally, I’d like to thank my family for supporting me in so many ways that I can barely scratch the surface here: my parents, Nicoleta and Mihai Gheorghe, and my in-laws, Mădălina and Adrian Radu, for providing good food, quiet spaces, and the all-important moral support. My wife Alexandra, for being a real hero: she somehow managed to write her own stuff and still take care of every- thing in order for me to write. Last but not least, my son Andrei, now 6, for his under- standing and his creative solutions on spending time together, like working on his own book next to me. LEE HINMAN First and foremost I’d like to give my sincerest thanks to my wife Delilah for encourag- ing me in this endeavor and for being my adventuring partner. You have given me so much support in this and so many other parts of my life. Thank you for continuing to encourage me throughout the birth of our daughter, Vera Ovelia. I’d also like to thank all of the people who have contributed to Elasticsearch. Without you, open source software would not be possible. I’m honored to contribute to such a wide- reaching and powerful piece of software. ROY RUSSO I would like to thank my daughters Olivia and Isabella, my son Jacob, and my wife Roberta, for standing beside me throughout my career and acting as a source of inspi- ration and motivation. You guys make the impossible possible with your support, love, and understanding.Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info
📄 Page
20
about this book Since it came out in 2010, Elasticsearch has become increasingly popular. It’s being used in a variety of setups, from product search—which is the traditional use case for a search engine—to real-time analytics of social media, application logs, and other flow- ing data. The strong points of Elasticsearch have always been its distributed model— which makes it scale out easily and efficiently—as well as its rich analytics functionality. All of this was built on top of the already established Apache Lucene search engine library. Lucene has evolved during this time as well, making it possible to process the same amount of data with less CPU, memory, and disk space. Elasticsearch in Action covers all the major features of Elasticsearch, from rele- vancy tuning by using different analyzers and query types to using aggregations for real-time analytics, as well as more “exotic” features, like geo-spatial search and doc- ument percolation. You’ll quickly find that Elasticsearch is easy to get started with. You can get your documents in, search them, build statistics, and even distribute and replicate your data onto multiple machines in a matter of hours. Default behavior and settings are very developer-friendly, making proof-of-concepts that much easier to build. Moving from prototypes to production is often more difficult, as you’ll bump into various functionality or performance limitations. That’s why we explain how each feature works under the hood, so you can tweak the right knobs in order to get good relevance out of your searches and good performance for both reads and writes to your cluster.xix Licensed to Thomas Snead <n.ordickan@gmail.com>www.it-ebooks.info