📄 Page
1
M A N N I N G Alessandro Negro Foreword by Dr. Jim Webber
📄 Page
2
Graph-Powered Machine Learning
📄 Page
3
(This page has no text content)
📄 Page
4
Graph-Powered Machine Learning ALESSANDRO NEGRO FOREWORD BY DR. JIM WEBBER M A N N I N G SHELTER ISLAND
📄 Page
5
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2021 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. The author and publisher have made every effort to ensure that the information in this book was correct at press time. The author and publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause, or from an usage of the information herein. Manning Publications Co. Development editor: Dustin Archibald 20 Baldwin Road Technical development editors: Michiel Trimpe & Al Krinker PO Box 761 Review editor: Ivan Martinović Shelter Island, NY 11964 Production editor: Andy Marinkovich Copy editor: Keir Simpson Proofreader: Katie Tennant Technical proofreader: Alex Ott Typesetter: Gordan Salinovic Cover designer: Marija Tudor ISBN 9781617295645 Printed in the United States of America
📄 Page
6
To Filippo and Flavia: I hope you are as proud of your father as I am always proud of you.
📄 Page
7
(This page has no text content)
📄 Page
8
vii brief contents PART 1 INTRODUCTION ................................................................1 1 ■ Machine learning and graphs: An introduction 3 2 ■ Graph data engineering 30 3 ■ Graphs in machine learning applications 71 PART 2 RECOMMENDATIONS......................................................113 4 ■ Content-based recommendations 119 5 ■ Collaborative filtering 166 6 ■ Session-based recommendations 202 7 ■ Context-aware and hybrid recommendations 227 PART 3 FIGHTING FRAUD ..........................................................263 8 ■ Basic approaches to graph-powered fraud detection 265 9 ■ Proximity-based algorithms 295 10 ■ Social network analysis against fraud 320
📄 Page
9
BRIEF CONTENTSviii PART 4 TAMING TEXT WITH GRAPHS............................................357 11 ■ Graph-based natural language processing 359 12 ■ Knowledge graphs 389
📄 Page
10
ix contents foreword xiii preface xv acknowledgments xvii about this book xix about the author xxiii about the cover illustration xxiv PART 1 INTRODUCTION ......................................................1 1 Machine learning and graphs: An introduction 3 1.1 Machine learning project life cycle 5 Business understanding 7 ■ Data understanding 8 ■ Data preparation 8 ■ Modeling 9 ■ Evaluation 9 ■ Deployment 9 1.2 Machine learning challenges 10 The source of truth 10 ■ Performance 13 ■ Storing the model 14 ■ Real time 14 1.3 Graphs 15 What is a graph? 15 ■ Graphs as models of networks 17 1.4 The role of graphs in machine learning 23 Data management 25 ■ Data analysis 25 ■ Data visualization 26 1.5 Book mental model 27
📄 Page
11
CONTENTSx 2 Graph data engineering 30 2.1 Working with big data 33 Volume 34 ■ Velocity 36 ■ Variety 38 ■ Veracity 39 2.2 Graphs in the big data platform 40 Graphs are valuable for big data 41 ■ Graphs are valuable for master data management 48 2.3 Graph databases 53 Graph database management 54 ■ Sharding 57 ■ Replication 60 Native vs. non-native graph databases 61 ■ Label property graphs 67 3 Graphs in machine learning applications 71 3.1 Graphs in the machine learning workflow 73 3.2 Managing data sources 76 Monitor a subject 79 ■ Detect a fraud 82 ■ Identify risks in a supply chain 85 ■ Recommend items 87 3.3 Algorithms 93 Identify risks in a supply chain 93 ■ Find keywords in a document 96 ■ Monitor a subject 98 3.4 Storing and accessing machine learning models 100 Recommend items 101 ■ Monitoring a subject 103 3.5 Visualization 106 3.6 Leftover: Deep learning and graph neural networks 109 PART 2 RECOMMENDATIONS ............................................113 4 Content-based recommendations 119 4.1 Representing item features 122 4.2 User modeling 136 4.3 Providing recommendations 143 4.4 Advantages of the graph approach 164 5 Collaborative filtering 166 5.1 Collaborative filtering recommendations 170 5.2 Creating the bipartite graph for the User-Item dataset 172 5.3 Computing the nearest neighbor network 177 5.4 Providing recommendations 189
📄 Page
12
CONTENTS xi 5.5 Dealing with the cold-start problem 194 5.6 Advantages of the graph approach 198 6 Session-based recommendations 202 6.1 The session-based approach 203 6.2 The events chain and the session graph 206 6.3 Providing recommendations 212 Item-based k-NN 213 ■ Session-based k-NN 219 6.4 Advantages of the graph approach 224 7 Context-aware and hybrid recommendations 227 7.1 The context-based approach 228 Representing contextual information 231 ■ Providing recommendations 235 ■ Advantages of the graph approach 253 7.2 Hybrid recommendation engines 254 Multiple models, single graph 256 ■ Providing recommendations 258 ■ Advantages of the graph approach 260 PART 3 FIGHTING FRAUD ................................................263 8 Basic approaches to graph-powered fraud detection 265 8.1 Fraud prevention and detection 267 8.2 The role of graphs in fighting fraud 271 8.3 Warm-up: Basic approaches 279 Finding the origin point of credit card fraud 279 ■ Identifying a fraud ring 287 ■ Advantages of the graph approach 293 9 Proximity-based algorithms 295 9.1 Proximity-based algorithms: An introduction 296 9.2 Distance-based approach 298 Storing transactions as a graph 300 ■ Creating the k-nearest neighbors graph 302 ■ Identifying fraudulent transactions 309 Advantages of the graph approach 318 10 Social network analysis against fraud 320 10.1 Social network analysis concepts 323 10.2 Score-based methods 326 Neighborhood metrics 330 ■ Centrality metrics 336 Collective inference algorithms 344
📄 Page
13
CONTENTSxii 10.3 Cluster-based methods 348 10.4 Advantages of graphs 354 PART 4 TAMING TEXT WITH GRAPHS..................................357 11 Graph-based natural language processing 359 11.1 A basic approach: Store and access sequence of words 363 Advantages of the graph approach 373 11.2 NLP and graphs 373 Advantages of the graph approach 387 12 Knowledge graphs 389 12.1 Knowledge graphs: Introduction 390 12.2 Knowledge graph building: Entities 393 12.3 Knowledge graph building: Relationships 402 12.4 Semantic networks 409 12.5 Unsupervised keyword extraction 415 Keyword co-occurrence graph 423 ■ Clustering keywords and topic identification 425 12.6 Advantages of the graph approach 428 appendix A Machine learning algorithms taxonomy 431 appendix B Neo4j 435 appendix C Graphs for processing patterns and workflows 449 appendix D Representing graphs 458 index 461
📄 Page
14
xiii foreword The technology world is abuzz with machine learning. Every day we are bombarded with articles on its applications and advances. But there is a quiet revolution brewing among practitioners, and that revolution puts graphs at the very heart of machine learning. Alessandro wrote this book after almost a decade of practice, at the confluence of graphs and machine learning. Had Alessandro worked for one of the Web giants dis- tilling the knowledge of an army of PhDs working on special one-off systems, this would be an interesting book, but for the majority of us it would satisfy our curiosity rather than being a practical guide. Fortunately for us, while Alessandro does have a PhD, he works in the enterprise space and has deep empathy and understanding for the kinds of systems that enterprises build. The book reflects this: Alessandro ably addresses the kinds of practical design and implementation challenges that software engineers and data professionals building contemporary systems outside of the hyper- scale Web giants must circumvent. Graph-Powered Machine Learning demonstrates how important graphs are to the future of machine learning. It shows not only that graphs provide a superior means of fuelling contemporary ML pipelines, but also how graphs are a natural way of organiz- ing, analyzing, and processing data for machine learning. The book offers a rich, curated tour of graph machine learning, and each topic is underpinned with detailed examples drawing on Alessandro’s deep experience and the easy, refined confidence of a long-term practitioner.
📄 Page
15
FOREWORDxiv The book eases us in, providing an overall framework to reason about machine learning and integrate it into our data systems. It follows up immediately with a practi- cal approach to recommendations covering a variety of approaches, such as collabora- tive filtering, content- and session-based recommendations, and hybrid styles. Alessandro calls out the problems which lack explainability in state-of-the-art tech- niques and shows that this isn’t an issue with the graph approach. He then continues to tackle fraud detection, taking in concepts like proximity and social network analy- sis, where we relearn the maxim that “birds of a feather flock together” in the context of criminal networks. Finally, the book deals with knowledge graphs: the ability of graph technology to consume documents and distil connected knowledge from them, disambiguate terms, and handle ambiguous query terms. The breadth of topics is vast, but the quality of information is always excellent. Throughout the book, Alessandro gently guides the reader, building up from the basics to advanced concepts. With the examples and companion code, practically minded readers are able to get examples working quickly, and from there to adapt them for their own needs. You will finish this book armed with a variety of practical tools at your disposal and, if you like, some dirt under your fingernails. You will be ready to extract graph features to make your existing models perform better today, and you’ll be equipped to work natively with graphs tomorrow. I promise it’s going to be a wonderful journey. —DR. JIM WEBBER, CHIEF SCIENTIST @ NEO4J
📄 Page
16
xv preface The summer of 2012 was one of the warmest I can remember in southern Italy. My wife and I were awaiting our first son, who was going to be delivered quite soon, so we had few chances to go out or take any refreshment in the awesomely fresh, clean water of Apulia. Under those conditions, you can get crazy with DIY (not my case), or you can keep your mind busy with something challenging. Because I’m not a great fan of Sudoku, I started working on a night and weekend project: attempting to build a generic recommendation engine that could serve multiple scopes and scenarios, from small and simple to complex and articulated datasets of user-item interactions, eventu- ally with related contextual information. This was the moment when graphs forcefully entered my life. Such a flexible data model allowed me to store in the same way not only the users’ purchases, but also all the corollary information (later formally defined as contextual information) together with the resulting recommendation model. At that time, Neo4j 1.x was recently released. Although it didn’t have Cypher or the other advanced query mechanisms it has now, it was stable enough for me to select it as the main graph database for my project. Adopting graphs helped me unblock the project, and after four months, I released the alpha version of reco4j: the first graph-powered recommendation engine in history! It was the beginning of a true and passionate love story. For three years, I experi- mented on my own, trying to sell the reco4j idea here and there (not so successfully, to be honest) when I had a call with Michal Bachman, CEO of GraphAware. A few days later, I flew to London to sign my contract as the sixth employee of this small consultancy
📄 Page
17
PREFACExvi firm, which helped companies succeed in their graph projects. Finally, graphs had become my raison d’etre (after my two children, of course ). After that, the graph ecosystem changed a lot. More and bigger companies started to adopt graphs as their core technology to deliver advanced services to their custom- ers or solve internal problems. At GraphAware, which had grown significantly and where I had become Chief Scientist, I had the opportunity to help companies to build new services and improve existing ones with the help of graphs. Graphs were capable not only of solving classical problems—from basic search facilities to recommendation engines, from fraud detection to information retrieval—but also became prominent technology for improving and enhancing machine learning projects. Network science and graph algorithms provided new tools for performing different types of analysis of naturally connected data and unconnected data. In many years of consultancy, speaking with data scientists and data engineers, I found a lot of common problems that could have been solved by using graph models or graph algorithms. This experience of showing people a different way to approach machine learning projects led me to write the book you have in your hands. Graphs don’t pretend to solve all problems, but they could be another arrow in your quiver. This book is where your own love story could begin.
📄 Page
18
xvii acknowledgments This book took more than three years to be released. It required a lot of work—defi- nitely more than what I thought when this crazy idea hit my mind. At the same time, it has been the most exciting experience of my career up until now. (And, yes, I am planning a second book.) I enjoyed crafting this book, but it was a long journey. That’s why I’d like to thank quite a few people for helping me along the way. First and foremost, I want to thank my family. My poor wife, Aurora, had to sleep alone during all my long nights and early-starting mornings, and my kids rarely saw my face except on the laptop screen for countless weekends. Thank you for your understanding and your unconditional love. Next, I’d like to acknowledge my development editor at Manning, Dustin Archibald. Thank you for working with me and teaching me all I know about writing. I especially thank you for being so patient when I was late month after month. Your commitment to the quality of this book has made it better for everyone who reads it. Another big thank you goes to all the other people at Manning who worked with me at each stage of the publishing process. Your team is a great and well-oiled machine, and I enjoyed working with each of you. To all the reviewers: Alex Ott, Alex Lucas, Amlan Chatterjee, Angelo Simone Scotto, Arnaud Castelltort, Arno Bastenhof, Dave Bechberger, Erik Sapper, Helen Mary Labao Barrameda, Joel Neely, Jose San Leandro Armendáriz, Kalyan Reddy, Kelvin Rawls, Koushik Vikram, Lawrence Nderu, Manish Jain, Odysseas Pentakalos, Richard Vaughan, Robert Diana, Rohit Mishra, Tom Heiman, Tomaz Bratanic, Venkata Mar- rapu, and Vishwesh Ravi Shrimali; your suggestions helped make this a better book.
📄 Page
19
ACKNOWLEDGMENTSxviii Last but not least, this book would not exist without GraphAware and especially without Michal—not only because he hired me and allowed me to grow in such an amazing company, but also because when, over a beer, I told him I was thinking about writing a book, he said, “I think you should do it!” This book also would not exist with- out Chris and Luanne, who have been my greatest fans from day one; without KK, who always had the right words at the right moment to cheer me up and motivate me; with- out Claudia, who helped me review the images; and without my awesome colleagues with whom I had the most interesting and challenging discussions. You all made this book happen!
📄 Page
20
xix about this book Graph-Powered Machine Learning is a practical guide to using graphs effectively in machine learning applications, showing you all the stages of building complete solu- tions in which graphs play a key role. It focuses on methods, algorithms, and design patterns related to graphs. Based on my experience in building complex machine learning applications, this book suggests many recipes in which graphs are the main ingredient of a tasty product for your customers. Across the life cycle of a machine learning project, such approaches can be useful in several aspects, such as managing data sources more efficiently, implementing better algorithms, storing prediction models so that they can be accessed faster, and visualizing the results in a more effec- tive way for further analysis. Who should read this book? Is this book the right book for you? If you are a data scientist or a data engineer practitioner, it could help you complete or start your learning path. If you are a manager who has to start or drive a new machine learning project, it could help you suggest a different perspective to your team. If you are an advanced developer who’s interested in exploring the power of graphs, it could help you discover a new perspective on the role of the graph not only as a kind of database, but also as an enabler technique for AI. This book is not a compendium on machine learning techniques in general; it focuses on methods, algorithms, and design patterns related to graphs, which are the prominent topic here. Specifically, the book focuses on how graph approaches can help you develop and deliver better machine learning projects. Graph model techniques are