Graph Algorithms for Data Science With examples in Neo4j (Tomaž Bratanic) (Z-Library)

M A N N I N G Tomaž BrataniË Foreword by Michael Hunger With examples in Neo4j

Path to becoming a graph data scientist Graph modeling and construction Graph query language Graph algorithms and inferred networks Graph machine learning Learn how to • Identify relationships between data points • Describe a graph structure • Import data into a graph database Learn how to • Identify graph patterns • Traverse connections • Aggregate data • Perform exploratory data analysis Learn how to • Find the most important or critical nodes • Group nodes into communities • Identify similar nodes • Analyze indirect relationships Learn how to • Extract features from graphs • Predict node labels • Predict new connections

Graph Algorithms for Data Science WITH EXAMPLES IN NEO4J TOMAŽ BRATANIČ MANN I NG SHELTER ISLAND

For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2024 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. The authors and publisher have made every effort to ensure that the information in this book was correct at press time. The authors and publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause, or from any usage of the information herein. Manning Publications Co. Development editor: Dustin Archibald 20 Baldwin Road Technical Editor: Arturo Geigel PO Box 761 Technical development editor: Ninoslav Cerkez Shelter Island, NY 11964 Review editor: Aleksandar Dragosavljević Production editor: Deirdre S. Hiam Copy editor: Christian Berk Proofreader: Katie Tennant Technical proofreader: Jerry Kuch Typesetter: Dennis Dalinnik Cover designer: Marija Tudor ISBN: 9781617299469 Printed in the United States of America

brief contents PART 1 INTRODUCTION TO GRAPHS ..........................................1 1 ■ Graphs and network science: An introduction 3 2 ■ Representing network structure: Designing your first graph model 18 PART 2 SOCIAL NETWORK ANALYSIS .........................................53 3 ■ Your first steps with Cypher query language 55 4 ■ Exploratory graph analysis 87 5 ■ Introduction to social network analysis 109 6 ■ Projecting monopartite networks 144 7 ■ Inferring co-occurrence networks based on bipartite networks 165 8 ■ Constructing a nearest neighbor similarity network 198iii

BRIEF CONTENTSivPART 3 GRAPH MACHINE LEARNING .......................................219 9 ■ Node embeddings and classification 221 10 ■ Link prediction 247 11 ■ Knowledge graph completion 274 12 ■ Constructing a graph using natural language processing techniques 293

contents foreword xi preface xiii acknowledgments xv about this book xvi about the author xix about the cover illustration xx PART 1 INTRODUCTION TO GRAPHS.................................1 1 Graphs and network science: An introduction 3 1.1 Understanding data through relationships 8 1.2 How to spot a graph-shaped problem 11 Self-referencing relationships 12 ■ Pathfinding networks 13 Bipartite graphs 14 ■ Complex networks 15 2 Representing network structure: Designing your first graph model 18 2.1 Graph terminology 21 Directed vs. undirected graph 21 ■ Weighted vs. unweighted graphs 22 ■ Bipartite vs. monopartite graphs 23 Multigraph vs. simple graph 24 ■ A complete graph 25v

CONTENTSvi2.2 Network representations 25 Labeled-property graph model 26 2.3 Designing your first labeled-property graph model 29 Follower network 30 ■ User–tweet network 32 ■ Retweet network 36 ■ Representing graph schema 39 2.4 Extracting knowledge from text 41 Links 42 ■ Hashtags 44 ■ Mentions 48 Final Twitter social network schema 50 PART 2 SOCIAL NETWORK ANALYSIS ...............................53 3 Your first steps with Cypher query language 55 3.1 Cypher query language clauses 57 CREATE clause 57 ■ MATCH clause 60 ■ WITH clause 64 SET clause 65 ■ REMOVE clause 67 ■ DELETE clause 67 MERGE clause 70 3.2 Importing CSV files with Cypher 73 Clean up the database 73 ■ Twitter graph model 74 Unique constraints 75 ■ LOAD CSV clause 76 Importing the Twitter social network 77 3.3 Solutions to exercises 83 4 Exploratory graph analysis 87 4.1 Exploring the Twitter network 88 4.2 Aggregating data with Cypher query language 90 Time aggregations 95 4.3 Filtering graph patterns 97 4.4 Counting subqueries 101 4.5 Multiple aggregations in sequence 102 4.6 Solutions to exercises 105 5 Introduction to social network analysis 109 5.1 Follower network 112 Node degree distribution 115 5.2 Introduction to the Neo4j Graph Data Science library 120 Graph catalog and native projection 121

CONTENTS vii5.3 Network characterization 122 Weakly connected component algorithm 123 ■ Strongly connected components algorithm 127 ■ Local clustering coefficient 130 5.4 Identifying central nodes 134 PageRank algorithm 134 ■ Personalized PageRank algorithm 138 ■ Dropping the named graph 140 5.5 Solutions to exercises 141 6 Projecting monopartite networks 144 6.1 Translating an indirect multihop path into a direct relationship 149 Cypher projection 150 6.2 Retweet network characterization 152 Degree centrality 152 ■ Weakly connected components 156 6.3 Identifying the most influential content creators 159 Excluding self-loops 159 ■ Weighted PageRank variant 159 Dropping the projected in-memory graph 161 6.4 Solutions to exercises 162 7 Inferring co-occurrence networks based on bipartite networks 165 7.1 Extracting hashtags from tweets 173 7.2 Constructing the co-occurrence network 177 Jaccard similarity coefficient 179 ■ Node similarity algorithm 180 7.3 Characterization of the co-occurrence network 188 Node degree centrality 188 ■ Weakly connected components 189 7.4 Community detection with the label propagation algorithm 190 7.5 Identifying community representatives with PageRank 193 Dropping the projected in-memory graphs 195 7.6 Solutions to exercises 195

CONTENTSviii8 Constructing a nearest neighbor similarity network 198 8.1 Feature extraction 202 Motifs and graphlets 204 ■ Betweenness centrality 207 Closeness centrality 208 8.2 Constructing the nearest neighbor graph 210 Evaluating features 210 ■ Inferring the similarity network 212 8.3 User segmentation with the community detection algorithm 213 8.4 Solutions to exercises 215 PART 3 GRAPH MACHINE LEARNING .............................219 9 Node embeddings and classification 221 9.1 Node embedding models 224 Homophily vs. structural roles approach 225 ■ Inductive vs. transductive embedding models 227 9.2 Node classification task 227 Defining a connection to a Neo4j database 229 ■ Importing a Twitch dataset 230 9.3 The node2vec algorithm 232 The word2vec algorithm 232 ■ Random walks 235 ■ Calculate node2vec embeddings 237 ■ Evaluating node embeddings 238 Training a classification model 242 ■ Evaluating predictions 243 9.4 Solutions to exercises 245 10 Link prediction 247 10.1 Link prediction workflow 249 10.2 Dataset split 253 Time-based split 254 ■ Random split 255 ■ Negative samples 258 10.3 Network feature engineering 259 Network distance 260 ■ Preferential attachment 262 Common neighbors 264 ■ Adamic–Adar index 264 Clustering coefficient of common neighbors 266 10.4 Link prediction classification model 267 Missing values 269 ■ Training the model 269 ■ Evaluating the model 270 10.5 Solutions to exercises 271

CONTENTS ix11 Knowledge graph completion 274 11.1 Knowledge graph embedding model 279 Triple 279 ■ TransE 280 ■ TransE limitations 281 11.2 Knowledge graph completion 283 Hetionet 284 ■ Dataset split 287 ■ Train a PairRE model 287 ■ Drug application predictions 288 Explaining predictions 289 11.3 Solutions to exercises 291 12 Constructing a graph using natural language processing techniques 293 12.1 Coreference resolution 297 12.2 Named entity recognition 297 Entity linking 298 12.3 Relation extraction 299 12.4 Implementation of information extraction pipeline 300 SpaCy 301 ■ Corefence resolution 301 ■ End-to-end relation extraction 303 ■ Entity linking 305 ■ External data enrichment 308 12.5 Solutions to exercises 308 appendix The Neo4j environment 310 references 318 index 323

(This page has no text content)

foreword When you read this book, I hope you are as astonished by the power of relationships and connected information as I was when I first met Emil Eifrém, one of the founders of Neo4j, 15 years ago on a geek cruise on the Baltic Sea. Ten years later, a similarly inspiring and impactful meeting happened when Tomaž and I met for the first time in person in London. He’d been active in the Neo4j community for a while. After that meeting, his contributions skyrocketed, initially helping test and document the prede- cessor of the Graph Data Science library and at the same time becoming a prolific author on data science topics related to graphs, NLP, and their practical applications (bratanic-tomaz.medium.com). Tomaž must have published hundreds of articles by the time we were contacted by Manning to discuss creating a book on graph analytics— the one you’re holding right now. Tomaž was the obvious choice to become its author, and he did an amazing job, distilling his experience, educational writing style, and real-world examples into an insightful and entertaining book. This book is a journey into the hidden depths of connected data using graph algorithms and new ML tech- niques—like node embeddings—and graph machine learnings, like link prediction and node classification, many of which now find applications in areas like vector search or large-language models based on transformers like GPT. I’ve often said that in the real world, there is no such thing as isolation; everything is connected—people, events, devices, content and products, art, history, politics, mar- kets, biological pathways, and climate tipping points, from the smallest subatomic par- ticles (relational quantum dynamics) to the largest structures in the universe (galactic pathways). Humans have accelerated the volume and density of those connections byxi

FOREWORDxiiadding information technology, the internet, social networks, mobile computing, IoT, and widespread use of ML models. Our lives depend on all those networks working properly, even if we are unaware of most of them. How does one make sense of all these obvious and hidden relationships, which add context and meaning to all indi- vidual data points? Sure, you can query for patterns that you already know, but what about the unknown unknowns? This is where graph analytics and graph-based ML techniques shine. They help you find the insights you need. We start with centrality or clustering algorithms, like PageRank or Louvain, which can be used for unsupervised learning about the structure and importance of elements in your data. One of my favorite examples is still Network of Thrones by Andrew Beveridge, where he used spatial closeness of characters in the natural-language-processed texts of the Game of Thrones books to determine importance, groups, and dependencies. Those algorithms achieved results impressively similar to what you as a human would find if you read the books. Using results from those algorithms as feature vectors in your ML models already improves the accuracy of your predictions, as they capture the context of your entities both structurally and behaviorally. But you can even go a step further and explicitly compute embeddings for nodes based on graph operations. One of the first algo- rithms in this space was node2vec, which used the word2vec embeddings on paths from random walks out of your starting point (an approach conceptually similar to PageRank). Now, we are much further along, with knowledge graph embeddings using graphs as inputs and outputs that can make real use of the richness of connected data. And in current ML papers and architectures, you will commonly find mentions of graph structures and algorithms, so this is now a kind of foundational technology. Tomaž will take you along the learning journey, starting from data modeling, inges- tion, and querying; to the first applications of graph algorithms; all the way to extracting knowledge graphs from text using NLP; and, finally, utilizing embeddings of nodes and graphs in ML training applications. Enjoy the ride to its graph epiphany, and I hope you will come out on the other side a graph addict, as we all turned out to be. MICHAEL HUNGER, senior director of user innovation, Neo4j

preface I transitioned to software development in my professional path about seven years ago. As if the universe had a plan for me, I was gently pushed toward graphs in my first developer job. I am probably one of the few people who can claim Cypher query lan- guage was the first language they were introduced to and started using, even before SQL or any scripting language, like Python. Kristjan Pećanac, my boss at the time, foresaw that graphs, particularly labeled-property graphs, were the future. At the time, there weren’t many native graph databases out there, so Neo4j felt like a clear-cut choice. The more I learned about graphs and Neo4j, the more I liked them. However, one thing was rather obvious. Even though there were so many awesome things I could do with graphs, the documentation could have been much better. I started writing a blog to showcase all the remarkable things one can do with graphs and to spare people the effort of searching the internet and source code to learn how to implement various features and workflows. Additionally, I treated the blog as a repository of code I could use and copy in my projects. Fast-forward five years: after more than 70 published blog posts, I authored a post about combining natural language processing and graphs. It was probably my best post to date, and interestingly, I wrote in the summary that if I ever wrote a book, that post would be a chapter in it. Life is a combination of lucky coincidences. Michael Hunger read my NLP post and asked if I was serious about writing a book. I half-jokingly replied that writing a book might be a good idea and would help me advance in my career. Michael took it seriously, and we met with Manning the next month. The rest is history, and the book before you is the result of my journey toxiii

PREFACExivmake graphs and graph data science easier to learn, understand, and implement in your projects.

acknowledgments At first, I didn’t realize how much work goes into writing a book. After writing this book, I have gained considerable respect for any author who has published a book. Luckily, I had great people around me who helped improve the book with their ideas, reviews, and feedback. First, I would like to thank my development editor at Manning, Dustin Archibald, for helping me become a better writer and guiding and introducing me to the many con- cepts that make a great book even better. Thank you as well, Deirdre Hiam, my project editor; Christian Berk, my copyeditor; Katie Tennant, my proofreader; and Aleksandar Dragosavljević, my reviewing editor. I would also like to thank, in no particular order, the many people who contributed their ideas and helped with reviews: Ljubica Lazarevic, Gopi Krishna Phani Dathar, David Allen, Charles Tapley Hoyt, Pere-Lluís Huguet Cabot, Amy Hodler, Vlad Batushkov, Jaglan Gaurav, Megan Tomlin, Al Krinker, Andrea Paciolla, Atilla Ozgur, Avinash Tiwari, Carl Yu, Chris Allan, Clair Sullivan, Daivid Morgan, Dinesh Ghanta, Hakan Lofquist, Ian Long, Ioannis Atsonios, Jan Pieter Herweijer, Karthik Rajan, Katie Roberts, Lokesh Kumar, Marcin Sęk, Mark Needham, Mike Fowler, Ninoslav Cerkez, Pethuru Raj, Philip Patterson, Prasad Seemakurthi, Richard Tobias, Sergio Govoni, Simone Sguazza, Subhash Talluri, Sumit Pal, Syed Nouman Hasany, Thomas Joseph Heiman, Tim Wooldridge, Tom Kelly, Viron Dadala, and Yogesh Kulkarni. I would also like to extend my gratitude to Jerry Kuch and Arturo Geigel for their invaluable technical comments. Arturo is an independent researcher from Puerto Rico. He received his PhD in computer science from Nova Southeastern University, is recognized for being the inventor of Neural Trojans, and currently carries out research machine learning, graph theory, and technological analysis.xv

about this book Graph Algorithms for Data Science was written to help you incorporate graph analytic toolkits into your analytics workflows. The idea behind the book is to take a person who has never heard of graphs before and walk them through their first graph model and graph aggregations, eventually arriving at more advanced, graph, machine learn- ing workflows, like node classification and link prediction. Who should read this book Graph Algorithms for Data Science is intended for data analysts and developers looking to augment their data analytics toolkit by incorporating graph algorithms to explore rela- tionships between data points. This book is perfect for individuals with a basic under- standing of Python and machine learning concepts, like classification models, eager to enhance their data analysis capabilities. With its structured approach, this book caters to a wide range of readers, aiding junior analysts in building a strong foundation in graph algorithms while also providing more experienced analysts with new perspectives and advanced techniques, thereby broadening their data science competencies. How this book is organized The book has 3 sections that cover 12 chapters. Part 1 introduces graphs and walks you through a graph modeling task:  Chapter 1 introduces the concept of graphs and how to spot a graph-shaped problem. It also introduces the types of graph algorithms you will learn about throughout the book.xvi

ABOUT THIS BOOK xvii Chapter 2 starts by presenting basic graph terminology you can use to describe a graph. It continues by introducing a labeled-property graph model and walk- ing you through a graph modeling task. Part 2 introduces Cypher query language and frequently used graph algorithms:  Chapter 3 covers the basic Cypher query language syntax and clauses. It also demonstrates how to import a graph from CSV files.  Chapter 4 walks you through an exploratory graph analysis. You will learn how to retrieve, filter, and aggregate data using Cypher query language.  Chapter 5 demonstrates how to use Cypher query language and graph algo- rithms to characterize a graph. It also shows how to find the most important nodes by using the PageRank algorithm.  Chapter 6 illustrates how to transform indirect relationships between data points to a direct one, which can be used as input to graph algorithms. Additionally, it introduces the weighted variants of some graph algorithms, like node degree and PageRank.  Chapter 7 displays how to project a co-occurrence network, where the number of common neighbors between a pair of nodes defines how similar they are.  Chapter 8 demonstrates how to characterize node roles in the network using vari- ous features and metrics. Later in the chapter, you will learn how to construct a k-nearest neighbor graph and find communities of nodes with similar roles. Part 3 covers more advanced graph machine learning workflows, such as node classifi- cation and link prediction:  Chapter 9 introduces node embedding models and walks you through a node classification task.  Chapter 10 walks you through link prediction tasks, where you use Cypher query language to extract relevant features and use them to train a link predic- tion model.  Chapter 11 covers the difference between link prediction in simple versus com- plex graphs and introduces knowledge graph embedding models, which can be used to predicts links in complex networks.  Chapter 12 shows how to construct a graph using natural language processing techniques, like named entity recognition and relationship extraction. In overview, the first two chapters introduce you to basic graph theory and terminol- ogy while also discussing the Twitter graph model that will be used in chapters 3–8. Chapters 3 and 4 are intended to familiarize you with Cypher query language. The fol- lowing chapters are designed as individual analyst assignments, introducing relevant graph algorithms where needed.

ABOUT THIS BOOKxviiiAbout the code This book contains many examples of source code both in numbered listings and in line with normal text. In both cases, source code is formatted in a fixed-width font like this to separate it from ordinary text. Sometimes code is also in bold to high- light code that has changed from previous steps in the chapter, such as when a new fea- ture adds to an existing line of code. In many cases, the original source code has been reformatted; we’ve added line breaks and reworked indentation to accommodate the available page space in the book. In rare cases, even this was not enough, and listings include line-continuation markers (➥). Additionally, comments in the source code have often been removed from the listings when the code is described in the text. Code annotations accompany many of the listings, highlighting important concepts. The source code for chapters 3–8 is only available as part of the book, while the source code for chapters 9–12 is provided as Jupyter notebooks on this GitHub repos- itory: https://github.com/tomasonjo/graphs-network-science. liveBook discussion forum Purchase of Graph Algorithms for Data Science includes free access to liveBook, Man- ning’s online reading platform. Using liveBook’s exclusive discussion features, you can attach comments to the book globally or to specific sections or paragraphs. It’s a snap to make notes for yourself, ask and answer technical questions, and receive help from the author and other users. To access the forum, go to https://livebook.manning .com/book/graph-algorithms-for-data-science/discussion. You can also learn more about Manning’s forums and the rules of conduct at https://livebook.manning.com/ discussion. Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We sug- gest you try asking him some challenging questions lest his interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.

Statistics

Uploader

Graph Algorithms for Data Science With examples in Neo4j (Tomaž Bratanic) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Statistics

Uploader

Graph Algorithms for Data Science With examples in Neo4j (Tomaž Bratanic) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment