Statistics
45
Views
0
Downloads
0
Donations
Support
Share
Uploader

高宏飞

Shared on 2025-12-28

AuthorChristos Tjortjis, (ed.)

Graph databases have gained increasing popularity recently, disrupting areas traditionally dominated by conventional relational, SQL-based databases, as well as domains requiring the extra capabilities afforded by graphs. This book is a timely effort to capture the state of the art of Graph databases and their applications in domains, such as Social Media analysis and Smart Cities. This practical book aims at combining various advanced tools, technologies, and techniques to aid understanding and better utilizing the power of Social Media Analytics, Data Mining, and Graph Databases. The book strives to support students, researchers, developers, and simple users involved with Data Science and Graph Databases to master the notions, concepts, techniques, and tools necessary to extract data from social media or smart cities that facilitate information acquisition, management, and prediction. The contents of the book guide the interested reader into a tour starting with a detailed comparison of relational SQL Databases with NoSQL and Graph Databases, reviewing their popularity, with a focus on Neo4j.

Tags
graph database
ISBN: 1032024798
Publisher: CRC Press
Publish Year: 2024
Language: 英文
Pages: 19
File Format: PDF
File Size: 12.0 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

(This page has no text content)
Graph Databases Applications on Social Media Analytics and Smart Cities A SCIENCE PUBLISHERS BOOK p, Editor Christos Tjortjis Dean, School of Science and Technology International Hellenic University Greece
First edition published 2024 by CRC Press 2385 NW Executive Center Drive, Suite 320, Boca Raton FL 33431 and by CRC Press 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN © 2024 Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, LLC Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978­ 750-8400. For works that are not available on CCC please contact mpkbookspermissions@tandf.co.uk Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data (applied for) ISBN: 978-1-032-02478-3 (hbk) ISBN: 978-1-032-02479-0 (pbk) ISBN: 978-1-003-18353-2 (ebk) DOI: 10.1201/9781003183532 Typeset in Times New Roman by Innovative Processors
Preface The idea for the book came about after fruitful discussions with members of the Data Mining and Analytics research group, stemming out of frustrations using conventional database management systems for our research in Data Mining, Social Media Analytics, and Smart cities, as well as aspirations to enhance the utilisation of complex heterogeneous big data for our day-to-day research. The need was confirmed after discussions with colleagues across the globe, as well as surveying the state of the art, so we happily embarked on the challenge to put together a high-quality collection of chapters complementary but coherent, telling the story of the ever-increasing rate of graph database usage, especially in the context of social media and smart cities. Meanwhile, our planet was taken aback by the new pandemic storm. Plans were disrupted, priorities changed, and attention was diverted towards more pressing matters. Yet, the notion of the need for social media analytics coupled with smart city applications including healthcare and facilitated by graph databases emerged even stronger. The editor is grateful to all authors who weathered the storm for their contributions and to the editorial team for their support throughout the compilation of this book. I hope that the selected chapters offer a firm foundation, but also new knowledge and ideas for readers to understand, use and improve applications of graph databases in the areas of Social Media Analytics, Smart Cities, and beyond. September 2022 Christos Tjortjis
(This page has no text content)
Contents Preface iii Introduction vii 1. From Relational to NoSQL Databases – Comparison and Popularity Graph Databases and the Neo4j Use Cases Dimitrios Rousidis and Paraskevas Koukaras 1 2. A Comparative Survey of Graph Databases and Software for Social Network Analytics: The Link Prediction Perspective Nikos Kanakaris, Dimitrios Michail and Iraklis Varlamis 36 3. A Survey on Neo4j Use Cases in Social Media: Exposing New Capabilities for Knowledge Extraction Paraskevas Koukaras 56 4. Combining and Working with Multiple Social Networks on a Single Graph Ahmet Anil Müngen 79 5. Child Influencers on YouTube: From Collection to Overlapping Community Detection Maximilian Kissgen, Joachim Allgaier and Ralf Klamma 98 6. Managing Smart City Linked Data with Graph Databases: An Integrative Literature Review Anestis Kousis 118 7. Graph Databases in Smart City Applications – Using Neo4j and Machine Learning for Energy Load Forecasting Aristeidis Mystakidis 142 8. A Graph-Based Data Model for Digital Health Applications Jero Schäfer and Lena Wiese 157 Index 179
(This page has no text content)
Introduction Christos Tjortjis Dean of the School of Science and Technology, International Hellenic University email: c.tjortjis@ihu.edu.gr With Facebook having 2.9 billion active users, YouTube following with 2.5 billion, Instagram with 1.5 billion, TikTok with 1 billion, and Twitter with 430 million, the amount of data published daily is excessive. In 2022 it was estimated that 500 million tweets were published daily, 1 billion get posted daily across Facebook apps, there were 17 billion posts with location tracking on Facebook, and 350 million uploaded daily, with the accumulated number of uploaded photos reaching 350 billion. Every minute 500 hours of new video content is uploaded on YouTube, meaning that 82, 2 years of video content is uploaded daily. On Instagram, 95 million photos and videos are uploaded daily. The importance of gathering such rich data, often called “the digital gold rush”, processing it, and retrieving information are vital. Graph databases have gained increasing popularity recently, disrupting areas traditionally dominated by conventional relational, SQL-based databases, as well as domains requiring the extra capabilities afforded by graphs. This book is a timely effort to capture the state of the art of Graph databases and their applications in domains, such as Social Media analysis and Smart Cities. This practical book aims at combining various advanced tools, technologies, and techniques to aid understanding and better utilizing the power of Social Media Analytics, Data Mining, and Graph Databases. The book strives to support students, researchers, developers, and simple users involved with Data Science and Graph Databases to master the notions, concepts, techniques, and tools necessary to extract data from social media or smart cities that facilitate information acquisition, management, and prediction. The contents of the book guide the interested reader into a tour starting with a detailed comparison of relational SQL Databases with NoSQL and Graph Databases, reviewing their popularity, with a focus on Neo4j.
viii Introduction Chapter 1 details the characteristics and reviews the pros and cons of relational and NoSQL databases assessing and explaining the increasing popularity of the latter, in particular when it comes to Neo4j. The chapter includes a categorization of NoSQL Database Management Systems (DBMS) into i) Column, ii) Document, iii) Key-value, iv) Graph and v) TimeSeries.Neo4j Use Cases and related scientific research are detailed, and the chapter concludes with an insightful discussion. It is essential reading for any reader who is not familiar with the related concepts before engaging with the following chapters. Next, two surveys review the state of play regarding graph databases and social media data. The former emphasises analytics from a link prediction perspective and the latter focuses on knowledge extraction from social media data stored in Neo4j. Graph databases can manage highly connected data originating from social media, as they are suitable for storing, searching, and retrieving data that are rich in relationships. Chapter 2 reviews the literature for graph databases and software libraries suitable for performing common social network analytic tasks. It proposes a taxonomy of graph database approaches for social network analytics based on the available algorithms and the provided means of storing, importing, exporting, and querying data, as well as the ability to deal with big social graphs, and the corresponding CPU and memory usage. Various graph technologies are evaluated by experiments related to the link prediction problem on datasets of diverse sizes. Chapter 3 introduces novel capabilities for knowledge extraction by surveying Neo4j usage for social media. It highlights the importance of transitioning from SQL to NoSQL databases and proposes a categorization of Neo4j use cases in Social Media. The relevant literature is reviewed including various domains, such as Recommendation systems, marketing, learning applications, Healthcare analytics, Influence detection, and Fake news. The theme is further developed by two more chapters: one elaborating on combining multiple social networks on a single graph, and another on YouTube child influencers and the relevant community detection. Chapter 4 makes the case for combining multiple Social Networks into a Single Graph, since one user maintains several accounts across a variety of social media platforms. This combination brings forward the potential for improved recommendations enhancing user experience. It studies actual data from thousands of users on nine social networks (Twitter, Instagram, Flickr, Meetup, LinkedIn, Pinterest, Reddit, Foursquare, and YouTube). Node similarity methods were developed, and node matching success was increased. In addition, a new alignment method for multiple social networks is proposed. Success rates are measured and a broad user profile covering more than one social network is created.
ix Introduction Chapter 5 investigates data collection and analysis about child Influencers and their follower communities on YouTube to detect overlapping communities and understand the socioeconomic impact of child influencers in different cultures. It presents an approach to data collection, and storage using the graph database ArangoDB, and analysis with overlapping community detection algorithms, such as SLPA, CliZZ, and LEMON. With the open source WebOCD framework, community detection revealed that communities form around child influencer channels with similar topics, and that there is a potential divide between family channel communities and singular child influencer channel communities. The network collected contains 72,577 channels and 2,025,879 edges with 388 confirmed child influencers. The collection scripts, the software, and the data set in the database are available freely for further use in education and research. The smart city theme is investigated in three chapters. First a comprehensive literature survey on using graph databases to manage smart city linked data. Two case studies follow, one emphasising energy load forecasting using graph databases and Machine Learning, and another on digital health applications which utilise a Graph-Based data model. Chapter 6 provides a detailed literature survey integrating the concepts of Smart City Linked Data with Graph Databases and social media. Based on the concept of a smart city as a complex linked system producing vast amounts of data, and carrying many connections, it capitalises on the opportunities for efficient organization and management of such complex networks provided by Graph databases, given their high performance, flexibility, and agility. The insights gained through a detailed and critical review and synthesis of the related work show that graph databases are suitable for all layers of smart city applications. These relate to social systems including people, commerce, culture, and policies, posing as user-generated in social media. Graph databases are an efficient tool for managing the high density and interconnectivity that characterizes smart cities. Chapter 7 ventures further into the domain of smart cities focusing on the case of Energy Load Forecasting (ELF) using Neo4j, the leading NoSQL Graph database, and Machine Learning. It proposes and evaluates a method for integrating multiple approaches for executing ELF tests on historical building data. The experiments produce data resolution for 15 minutes as one step ahead of the time series forecast and reveal accuracy comparisons. The chapter provides guidelines for developing correct insights for energy demand predictions and proposes useful extensions for future work. Finally, Chapter 8 concludes with an interesting Graph-Based Data Model for Digital Health Applications in the context of smart cities. A key challenge for modern smart cities is the generation of large volumes of heterogeneous data to be integrated and managed to support the discovery of complex relationships in
x Introduction domains like healthcare. This chapter proposes a methodology for transferring a relational to a graph database by mapping the relational schema to a graph schema. To this end, a relational schema graph is constructed for the relational database and transformed in several steps. The approach is demonstrated in the example of a graph-based medical information system using a dashboard on top of a Neo4j database system to visualize, explore and analyse the stored data.
CHAPTER 1 From Relational to NoSQL Databases – Comparison and Popularity Graph Databases and the Neo4j Use Cases Dimitrios Rousidis [0000-0003-0632-9731] and Paraskevas Koukaras [0000-0002-1183-9878] The Data Mining and Analytics Research Group, School of Science and Technology, International Hellenic University, GR-570 01 Thermi, Thessaloniki, Greece e-mail: d.rousidis@ihu.edu.gr, p.koukaras@ihu.edu.gr In this chapter an in-depth comparison between Relational Databases (RD) and NoSQL Databases is performed. There is a recent trend for the IT community and enterprises to increasingly rely on the NoSQL Databases. This chapter briefly introduces the main types of NoSQL Databases. It also investigates this trend by discussing the disadvantages of RD and the benefits of NoSQL Databases. The interchanges of the popularity in the past 10 years of both types of Databases are also depicted. Finally, the most important Graph Databases are discussed and the most popular one, namely Neo4j, is analyzed. 1.1 Introduction In Codd We Trust. Published on March 6th, 1972, the paper with the title “Relational Completeness of Data Base Sublanguages” [1] written by Edgar Frank “Ted” Codd (19 August 1923–18 April 2003), an Oxford-educated mathematician working for IBM, was one of the most seminal and ground-breaking IT publications of the 20th century. The abstract of the publication starts with “In the near future, we can expect a great variety of languages to be proposed for interrogating and updating data bases. This paper attempts to provide a theoretical basis which may be used to determine how complete a selection capability is provided in a proposed data
2 Graph Databases: Applications on Social Media Analytics and Smart Cities sublanguage independently of any host language in which the sublanguage may be embedded”, whilst the last section of the paper with the heading “Calculus Versus Algebra” begins with “A query language (or other data sublanguage) which is claimed to be general purpose should be at least relationally complete in the sense defined in this paper. Both the algebra and calculus described herein provide a foundation for designing relationally complete query languages without resorting to programming loops or any other form of branched execution – an important consideration when interrogating a data base from a terminal”. The paper, which proved to be prophetic, proposed a relational data model for massive, shared data banks and established a relational model based on the mathematical set theory and laid the foundation for a world created and organized by relations, tuples, attributes and relationships. In the 70’s Codd and CJ Date developed the relational model, 1979 was the year that Oracle developed the first commercial Relational Database Management System (RDBMS) and since, after 40 plus years is still the predominate database model. Since then, the world has entered the Big Data era and according to an article by Seed Scientific titled “How much data is created every day? – 27 staggering facts” the total volume of data in the world was predicted to reach 44 zettabytes at the start of 2020, with more than 2.5 quintillion bytes produced each day, and Google, Facebook, Microsoft, and Amazon storing at least 1,200 petabytes of data [2]. These numbers are expected to get even bigger as Internet of Things (IoT) connected devices are projected to reach more than 75 billion and globally, the volume of data created each day is predicted to reach 463 exabytes by 2025 [3]. The constant rise of use of Social Media (SM) [14, 15] and Cloud Computing provide many useful mechanisms in our digital world [16], but at the same time this massive production of data boomed the volume and forced the IT industry to search for new, more powerful, flexible and able to cope with enormous datasets Database Management Systems (DBMS). Therefore, NoSQL, originally referring to “non-SQL” but nowadays referring to “Not-Only SQL”, a term introduced by C. Strozzi in 1998 [4], provides a way for enhancing the features of standard RDBMS. The aim of this chapter is to articulate on the advantages of NoSQL DBs, and also the prospects and potential that can be realized by incorporating Machine Learning (ML) methods. It is foreseen to provide practitioners and academics with sufficient reasoning for the need of NoSQL DBs, what types can be used depending on multiple occasions, and what mining tasks can be conducted with them. The rest of the chapter is structured as follows: Section 2 presents the characteristics, the advantages and disadvantages of both RDs and NoSQL DBs, the categorization of NoSQL DBs. Section 3 analyzes the popularity of the NoSQL DBs, based on the ranking of the db-engines.com website. The next section focuses on Graph DBMS and the leading DBMS, Neo4j and its most important use cases are presented. The chapter concludes with the most important findings of this research.
3 From Relational to NoSQL Databases – Comparison and Popularity... 1.2 NoSQL Databases 1.2.1 Characteristics of NoSQL Databases In comparison to the Relational Databases’ ACID (Atomicity, Consistency, Isolation, Durability) concept, NoSQL is built on the BASE (Basically Available, Soft State, and Eventually Consistent) concept. Its key benefit is the simplicity with which it can store, handle, manipulate, and retrieve massive amounts of data, making it perfect for data-intensive internet applications [5] and giving several functional advantages and data mining capabilities [6]. The following are the primary features of NoSQL DBs: 1. Non-relational: They do not completely support relational DB capabilities like as joins, for example. 2. No Schema (they do not have a fixed data structure). 3. Data are replicated across numerous nodes, making it fault-tolerant. 4. Horizontal scalability (connecting multiple hardware or software entities so that they work as a single logical unit). 5. Since they are open source, they are inexpensive and simple to install. 6. Outstanding write-read-remove-get performance. 7. Stable consistency (all users see the same data). 8. High availability (every user has at least one copy of the desired data). 9. Partition-tolerant (the total system keeps its characteristics, even when being deployed on different servers, transparently to the client). 1.2.2 Drawbacks and Advantages According to [7] the main uses of NoSQL in industry are: 1. Session store (which manages session data). 2. User profile store (which enables online transactions and creates a user- friendly environment). 3. Content and metadata store (building a data and metadata warehouse, enabling the storage of different types of data). 4. Mobile applications. 5. IoT (assisting the concurrent expansion, access, and manipulation of data from billions of devices). 6. Third-party aggregation (with the ease of managing massive amounts of data, allowing access by third-party organizations). 7. E-commerce (storing and handling enormous volumes of data). 8. Social gaming. 9. Ad-targeting (enabling tracking user details quickly). From Table 1.1, where the main advantages and disadvantages of NoSQL have been gathered and summarily explained, it is obvious that the benefits of NoSQL prevail over its few drawbacks, constituting NoSQL very effective, providing better performance, and offering a cost-effective way of creating, collecting, manipulating, querying, sharing and visualizing data.
Table 1.1. Advantages/Disadvantages of NoSQL over SQL [8] Advantages Explanation Non-relational Schema-less Data are replicated to multiple nodes and can be partitioned Horizontally scalable Provides a wide range of data models Database administrators are not required Less hardware failures Faster, more efficient, and flexible Has evolved at a very high pace Less time writing queries Less time debugging queries Code is easier to read The growing of big data - in that high data velocity, data variety, data volume, and data complexity It has huge volumes of fast changing structured, semi-structured, and unstructured data that are generated by users Not tuple-based – no joins and other RD features Not strict/fixed structure Down nodes are simply replaced, and there is no single point of failure Cheap, simple to set up (open-source), vast write performance, and quick key-value access Supports new/modern datatypes and models Not direct management and supervision required NoSQL DBaaS providers such as Riak and Cassandra are designed to deal with equipment failures Simple, scalable, efficient, multifunctional Fast growth by IT leading companies More time comprehending answers – more condensed and functional queries More time spent developing the next piece of code, elevated overall code quality Faster ramp-up for new project members, improved maintainability and troubleshooting Data is constantly available. True spatial transparency. Transactional capabilities for the modern era. Data architecture that is adaptable. High-performance architecture with a high level of intelligence Quick schema iteration, agile sprints and frequently code pushes quickly. Support for object-oriented programming languages that are simple to comprehend and use in a short amount of time. NoSQL is a globally distributed scale-out architecture that is not costly and monolithic. Agnostic to schema, scalability, speed, and high availability. Handles enormous amounts of data with ease and at a cheaper cost Disadvantages Explanation Immature No standard query language Some NoSQL DBs are not ACID compliant No standard interface. Maintenance is difficult Need additional time to acquire the RDs’ consistency, sustainability and maturity Compared to RD’s SQL Atomicity, consistency, isolation, durability offers stability Some DBs do not offer GUI yet 4 Graph Databases: Applications on Social Media Analytics and Smart Cities
5 From Relational to NoSQL Databases – Comparison and Popularity... 1.2.3 Categorization According to the bibliography, there are five main/dominant categories for NoSQL DBMS: (1) Column, (2) Document, (3) Key-value, (4) Graph and (5) Time Series. In “NoSQL Databases List by Hosting Data” [9] it is mentioned that there are 15 categories; the five aforementioned ones and ten more that were characterized as Soft NoSQL Systems (6 to 15): (6) Multimodel DB, (7) Multivalue DB, (8) Multidimensional DB, (9) Event Sourcing, (10) XML DB, (11) Grid & Cloud DB Solutions, (12) Object DB, (13) Scientific and Specialized DBs, (14) Other NoSQL related DB, and finally (15) Unresolved and uncategorized [10], whilst the db-engines.com website has 15 categories in total, too, to be discussed in the following section. Next, the five most popular categories are analyzed as described at the db- engines website: (1) Key-value stores1: Key-Value which are based on Amazon’s Dynamo paper [11] and “they are considered to be the simplest NoSQL DBMS since they can only store pairs of keys and values, as well as retrieve values when a key is known”. Figure 1.1. Key-value data. (2) Wide column stores2: They were first introduced at Google’s BigTable paper [12], and they are “also called extensible record stores, store data in records with an ability to hold very large numbers of dynamic columns. Since the column names as well as the record keys are not fixed, and since a record can have billions of columns, wide column stores can be seen as two-dimensional key-value stores”. Figure 1.2. Wide column data. 1 https://db-engines.com/en/article/Key-value+Stores 2 https://db-engines.com/en/article/Wide+Column+Stores
6 Graph Databases: Applications on Social Media Analytics and Smart Cities (3) Graph databases3: Also called graph-oriented DBMS, they represent data in graph structures as nodes and edges, which are relationships between nodes. “They allow easy processing of data in that form, and simple calculation of specific properties of the graph, such as the number of steps needed to get from one node to another node”. Figure 1.3. Graph databases data. (4) Document stores4: “Document stores, also called document-oriented database systems, are characterized by their schema-free organization of data”. That is, the columns can have more than one value (arrays), the records can have a nested configuration, and the configuration of the records and the types of data are not obliged to be uniform. Thus, different records can have different columns, and the values of individual columns may be different for each record. persons { { name: “George”, name: “Mary” date_of birth: “31/12/1999”, date_of_birth: “10/10/1990”, owns: [“car”, “house”, “boat”,] owns: “car” like: “Mary” } } Figure 1.4. Document based data. (v) Time series5: “A Time Series DBMS is a database management system that is optimized for handling time series data: each entry is associated with a timestamp. Time Series DBMS are designed to efficiently collect, store and query various time series with high transaction volumes. Although time series data can be managed with other categories of DBMS (from key- value stores to relational systems), the specific challenges often require specialized systems”. 3 https://db-engines.com/en/article/Graph+DBMS 4 https://db-engines.com/en/article/Document+Stores 5 https://db-engines.com/en/article/Time+Series+DBMS
Measurement Time System Threads System Processes 2021/07/01 00:00 1,078 65 2021/07/01 01:00 1,119 66 2021/07/01 02:00 654 39 … 2021/07/01 21:00 655 33 2021/07/01 22:00 1,251 72 2021/07/01 23:00 1,975 99 Figure 1.5. Time series data. 7 From Relational to NoSQL Databases – Comparison and Popularity... 1.3 Popularity In this section the popularity of the Relational and the NoSQL databases is discussed, based on the ranking of an excellent website (db-engines.com). The creators of db-engines.com have accumulated hundreds of databases and they have developed a methodology where all these databases are ranked based on their popularity 1.3.1 DB engines The creators of db-engines claim that the platform “is an initiative to collect and present information on database management systems (DBMS). In addition to established relational DBMS, systems and concepts of the growing NoSQL area are emphasized. The DB-Engines Ranking is a list of DBMS ranked by their current popularity. The list is updated monthly. The most important properties of numerous systems are shown in the overview of database management systems”. Each system’s attributes may be examined by the user, and they can be compared side by side. This topic’s words and concepts are discussed in the database encyclopedia. Recent DB-Engines news, citations and major events are also highlighted on the website. 1.3.2 Methodology The platform’s creators established a system for computing DBMS scores termed ‘DB-Engines Ranking’, which is a list of DBMS rated by their current popularity. They use the following parameters to assess a system’s popularity6: • Number of mentions of the system on websites, assessed by the number of results in Google and Bing search engine inquiries. To count only relevant results, they search for the system name followed by the phrase database, such as ‘Oracle’ and ‘database’. General interest in the system, measured by the number of queries in Google Trends.. • Frequency of technical discussions about the system. They utilize the number of similar queries and interested users on the well-known IT- related Q&A sites Stack Overflow and DBA Stack Exchange. 6 https://db-engines.com/en/ranking_definition
8 Graph Databases: Applications on Social Media Analytics and Smart Cities • Number of job offers, in which the system is referenced and where they utilize the number of offers on the top job search engines Indeed and Simply Hired. • Number of profiles in professional networks, where the system is referenced, as determined by data from the most prominent professional network LinkedIn. • Relevance in social networks, where they calculate the number of tweets, in which the system is cited. They calculate the popularity value of a system “by standardizing and averaging of the individual parameters. These mathematical transformations are made in a way so that the distance of the individual systems is preserved. That means, when system A has twice as large a value in the DB-Engines Ranking as system B, then it is twice as popular when averaged over the individual evaluation criteria”. To remove the impact induced by changing amounts of the data sources themselves, the popularity score is a relative number that should only be understood in relation to other systems. However, the DB-Engines Ranking does not take into account the number of systems installed or their application in IT systems. On the website there is information on 432 database systems (https://db­ engines.com/en/systems), which are examined and divided in 15 categories. However, just 381 databases are ranked in accordance with the aforementioned methodology. The number and percentage of databases per category is shown in Table 1.2. Table 1.2. Number of DBMS by category Categories Number of databases Percentage (%) Relational DBMS 152 39.90 Key-value stores 64 16.80 Document stores 53 13.91 Time series DBMS 39 10.24 Graph DBMS 36 9.45 Object oriented DBMS 21 5.51 Search engines 21 5.51 Wide column stores 21 5.51 RDF stores 20 5.25 Multivalue DBMS 11 2.89 Native XML DBMS 7 1.84 Spatial DBMS 5 1.31 Event stores 3 0.79 Content stores 2 0.52 Navigational DBMS 2 0.52
9 From Relational to NoSQL Databases – Comparison and Popularity... It is evident that the sum of the total number of databases exceeds 381, since there are databases that belong to more than one category. 1.3.3 Measuring Popularity As per the DB-Engines ranking, “an initiative to collect and present information on DBMS” that ranks (updated monthly) DB management systems based on their popularity, NoSQL DBs are steadily on the rise, whereas relational DBs, while still at the top, either remain intact or show a minor declining trend. The DB- engines technique for determining a system’s popularity is based on the following six parameters: (1) Number of mentions of the system on the website, (2) General interest in the system, (3) Frequency of technical debates about the system, (4) Number of job offers mentioning the system, (5) Number of profiles in professional networks mentioning the system, and (6) Relevance in social networks. The top 20 most popular DBMSs, with their score are presented in Table 1.3, which also includes the change of their position since the previous month and the previous year (2020). Table 1.3. Databases popularity (December 2021) Rank DBMS Database model Score Dec Nov Dec Dec 2021 2021 2021 2020 1 1 1 Oracle Relational, Multi-model 1,281.74 2 2 2 MySQL Relational, Multi-model 1,206.04 3 3 3 Microsoft SQL Server Relational, Multi-model 954.02 4 4 4 PostgreSQL Relational, Multi-model 608.21 5 5 5 MongoDB Document, Multi-model 484.67 6 6 7 Redis Key-value, Multi-model 173.54 7 7 6 IBM Db2 Relational, Multi-model 167.18 8 8 8 Elasticsearch Search engine, Multi­ 157.72 model 9 9 9 SQLite Relational 128.68 10 11 11 Microsoft Access Relational 125.99 11 10 10. Cassandra Wide column 119.2 12 12 12 MariaDB Relational, Multi-model 104.36 13 13 13 Splunk Search engine 94.32 14 15 16 Microsoft Azure SQL Relational, Multi-model 83.25 Database 15 14 15 Hive Relational 81.93 16 16 17 Amazon DynamoDB Multi-model 77.63 17 18 41 Snowflake Relational 71.03 18 17 14 Teradata Relational, Multi-model 70.29 19 19 19 Neo4j Graph 58.03 20 22 21 Solr Search engine, Multi­ 57.72 model