Statistics
5
Views
0
Downloads
0
Donations
Support
Share
Uploader

高宏飞

Shared on 2026-02-04

AuthorPethuru Raj, Alvaro Rocha, Simar Preet Singh, Pushan Kumar Dutta, B. Sundaravadivazhagan

No description

Tags
No tags
Publisher: Springer
Publish Year: 2025
Language: 英文
File Format: PDF
File Size: 6.3 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

Pethuru Raj · Alvaro Rocha · Simar Preet Singh · Pushan Kumar Dutta · B. Sundaravadivazhagan Editors Information Systems Engineering and Management 14 Building Embodied AI Systems: The Agents, the Architecture Principles, Challenges, and Application Domains
Information Systems Engineering and Management Volume 14 Series Editor Álvaro Rocha, ISEG, University of Lisbon, Lisbon, Portugal Editorial Board Abdelkader Hameurlain, Université Toulouse III Paul Sabatier, Toulouse, France Ali Idri, ENSIAS, Mohammed V University, Rabat, Morocco Ashok Vaseashta, International Clean Water Institute, Manassas, VA, USA Ashwani Kumar Dubey, Amity University, Noida, India Carlos Montenegro, Francisco José de Caldas District University, Bogota, Colombia Claude Laporte, University of Quebec, Québec, QC, Canada Fernando Moreira , Portucalense University, Berlin, Germany Francisco Peñalvo, University of Salamanca, Salamanca, Spain Gintautas Dzemyda , Vilnius University, Vilnius, Lithuania Jezreel Mejia-Miranda, CIMAT—Center for Mathematical Research, Zacatecas, Mexico Jon Hall, The Open University, Milton Keynes, UK Mário Piattini , University of Castilla-La Mancha, Albacete, Spain Maristela Holanda, University of Brasilia, Brasilia, Brazil Mincong Tang, Beijing Jaiotong University, Beijing, China Mirjana Ivanovíc , Department of Mathematics and Informatics, University of Novi Sad, Novi Sad, Serbia Mirna Muñoz, CIMAT Center for Mathematical Research, Progreso, Mexico Rajeev Kanth, University of Turku, Turku, Finland Sajid Anwar, Institute of Management Sciences, Peshawar, Pakistan Tutut Herawan, Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia Valentina Colla, TeCIP Institute, Scuola Superiore Sant’Anna, Pisa, Italy Vladan Devedzic, University of Belgrade, Belgrade, Serbia
The book series “Information Systems Engineering and Management” (ISEM) publishes innovative and original works in the various areas of planning, devel- opment, implementation, and management of information systems and technologies by enterprises, citizens, and society for the improvement of the socio-economic environment. The series is multidisciplinary, focusing on technological, organizational, and social domains of information systems engineering and management. Manuscripts published in this book series focus on relevant problems and research in the planning, analysis, design, implementation, exploration, and management of all types of infor- mation systems and technologies. The series contains monographs, lecture notes, edited volumes, pedagogical and technical books as well as proceedings volumes. Some topics/keywords to be considered in the ISEM book series are, but not limited to: Information Systems Planning; Information Systems Develop- ment; Exploration of Information Systems; Management of Information Systems; Blockchain Technology; Cloud Computing; Artificial Intelligence (AI) and Machine Learning; Big Data Analytics; Multimedia Systems; Computer Networks, Mobility and Pervasive Systems; IT Security, Ethics and Privacy; Cybersecurity; Digital Plat- forms and Services; Requirements Engineering; Software Engineering; Process and Knowledge Engineering; Security and Privacy Engineering, Autonomous Robotics; Human-Computer Interaction; Marketing and Information; Tourism and Informa- tion; Finance and Value; Decisions and Risk; Innovation and Projects; Strategy and People. Indexed by Google Scholar. All books published in the series are submitted for consideration in the Web of Science. For book or proceedings proposals please contact Alvaro Rocha (amrrocha@gmail. com).
Pethuru Raj · Alvaro Rocha · Simar Preet Singh · Pushan Kumar Dutta · B. Sundaravadivazhagan Editors Building Embodied AI Systems: The Agents, the Architecture Principles, Challenges, and Application Domains
Editors Pethuru Raj Reliance Jio Platforms Bengaluru, Karnataka, India Simar Preet Singh Bennett University Greater Noida, Uttar Pradesh, India B. Sundaravadivazhagan Department of Information and Technology University of Technology and Applied Sciences AlMussanah, Oman Alvaro Rocha University of Lisbon Lisbon, Portugal Pushan Kumar Dutta Department of Electronics and Communication Engineering Amity University Kolkata, West Bengal, India ISSN 3004-958X ISSN 3004-9598 (electronic) Information Systems Engineering and Management ISBN 978-3-031-68255-1 ISBN 978-3-031-68256-8 (eBook) https://doi.org/10.1007/978-3-031-68256-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland If disposing of this product, please recycle the paper.
Contents Building Embodied AI Systems: The Agents, the Architectural Principles, Challenges and Application Domains . . . . . . . . . . . . . . . . . . . . . 1 Anubhav Bewerwal, Amit Kumar, and Manoj Kumar Demystifying Embodied AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Sunil Sable and Mitesh Ikar Navigating the Nexus: Unravelling Challenges, Ethics, and Applications of Embodied AI in Drone Technology Through the Lens of Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 G. Arun Sampaul Thomas, S. Muthukaruppasamy, S. Sathish Kumar, Beulah J. Karthikeyan, and K. Saravanan Artificial Intelligence Algorithm Models for Agents of Embodiment for Drone Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Sateesh Kourav, Kirti Verma, and M. Sundararajan Artificial Intelligence Algorithms and Models for Embodied Agents: Enhancing Autonomy in Drones and Robots . . . . . . . . . . . . . . . . . . 103 Gnanasankaran Natarajan, Elakkiya Elango, B. Sundaravadivazhagan, and Sandha Rethinam Enhanced Security and Privacy from Industry 4.0 and 5.0 Vision . . . . . . 133 Tarun Kumar Vashishth, Vikas Sharma, Kewal Krishan Sharma, and Bhupendra Kumar Exploring Applications: Intelligent Drones and Robots in Industrial Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Divya Bansal, Naboshree Bhattacharya, and Priyanka Shandilya The Industrial Revolution: Harnessing Embodied AI Systems . . . . . . . . . 181 B. Sundaravadivazhagan, S. Lakshmisridevi, N. A. Natraj, M. Ramkumar, and K. Madhusudhana Rao v
vi Contents Synergistic Fusion: Vision-Language Models in Advancing Autonomous Driving and Intelligent Transportation Systems . . . . . . . . . . 205 Abha Kiran Rajpoot and Gaurav Agrawal Health Care Industry Use Cases of Embodied AI . . . . . . . . . . . . . . . . . . . . . 223 S. N. Kumar, Jomin Joy, Alen J. James, and Andrew Dixen Computing, Clouds, Analytics and Artificial Intelligence at the Edge . . . 241 Agha Urfi Mirza and Praveen Kumar Chandapeta Enhancing Law Enforcement Through Pose-Based Facial Recognition and Image Normalization Techniques . . . . . . . . . . . . . . . . . . . . 263 Özen Özer Embracing the Future: Navigating the Challenges and Solutions in Embodied Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Wasim Khan and Mohammad Ishrat Architecture and Advances in Unsupervised Models: A Conceptual Approach for 21st Century Smart Life Style . . . . . . . . . . . . . . . . . . . . . . . . . 301 Shivani Trivedi, Vanshika Aggarwal, and Rohit Rastogi Artificial Intelligence in 2D Games: Analysis on Customised Character Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Hemlata Parmar and Utsav Krishan Murari The Industrial Use Cases of Embodied AI Systems . . . . . . . . . . . . . . . . . . . . 343 M. Shanmuga Sundari and Kisara Rishitha The Industrial AI Revolution: A Guide to Embodied AI Systems . . . . . . . 377 Abhishek Choubey and Shruti Bhargava Choubey Illuminate Metaverse Multisensor Fusion and Dynamic Routing Technologies Across Web3-Powered for Autonomous Vehicles Shaping Efficient Urban Transport Solutions of Future in Smart City Era: Deep Dive into Protocols for Benefiting Society Lensing Prospects and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Bhupinder Singh, Komal Vig, Pushan Kumar Dutta, and Christian Kaunert Artificial Intelligence (AI) Algorithm and Models for Embodied Agents (Robots and Drones) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 P. Chitra and A. Saleem Raja Entertainment Recommendation and Rating System Based on Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 Geetraj Kumar Shaw, Romit Chattopadhyay, Anubhab Paul, Kumar Rounik, Sawan Kumar, Biswapa Saha, and Pabak Indu
Contents vii Securing Embodied AI: Addressing Cybersecurity Challenges in Physical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 Fatema khalifa said ALSaidi, Sabah Ali AL’Abd AL-Busaidi, and B. Sundaravadivazhagan The 5G Era: Transforming Connectivity and Enabling New Use Cases Across Industries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 Maad M. Mijwil, Mostafa Abotaleb, and Pushan Kumar Dutta
Building Embodied AI Systems: The Agents, the Architectural Principles, Challenges and Application Domains Anubhav Bewerwal, Amit Kumar, and Manoj Kumar Abstract The topic of embodied AI is experiencing tremendous growth due to recent advancements in computer vision, machine learning, and artificial intelligence. Intel Labs and Facebook AI Research (FAIR) have been leading new initiatives in the field of embodied AI. The definition of “embodied” is “giving an idea a tangible or visible form.” “Embodied AI” is just “AI for virtual robots,” to put it simply. More precisely, Embodied AI is the area of study that deals with AI challenges for virtual robots that can move, see, talk, and communicate with other virtual robots. These simulated robots are subsequently applied to real-world robots as solutions. AI subfields have been largely divided due to a variety of constraints. Embodied AI, on the other hand, unites a number of interdisciplinary disciplines, including robotics, computer vision, reinforcement learning, navigation, physics-based simulations and Natural Language Processing (NLP). Computer Vision techniques have played a major role in the evolution of Embodied AI as a research area, even if it needs to succeed in numerous AI subfields. Keywords Kernel point convolution · Residual neural networks · Contrastive language-image pretraining · MASK R-CNN · Proximal policy optimization · DDPPO A. Bewerwal (B) Department of Computer Science and Engineering, Graphic Era Hill University, Bhimtal Campus, Nainital, Uttarakhand, India e-mail: anubhavbewerwal@gehu.ac.in A. Kumar Department of Computer Science and Engineering, Ajay Kumar Garg Engineering College, Ghaziabad, UP, India e-mail: kumaramit2@akgec.ac.in M. Kumar School of Computer Science, FEIS, University of Wollongong in Dubai, Dubai Knowledge Park, Dubai, UAE MEU Research Unit, Middle East University, Amman 11831, Jordan M. Kumar e-mail: manojkumar@uowdubai.ac.ae © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 P. Raj et al. (eds.), Building Embodied AI Systems: The Agents, the Architecture Principles, Challenges, and Application Domains, Information Systems Engineering and Management 14, https://doi.org/10.1007/978-3-031-68256-8_1 1
2 A. Bewerwal et al. 1 The Connection Between Natural Language Processing and Computer Vision Natural language processing tasks are seen as more diverse from a technical stand- point compared to those in computer vision. This diversification includes the exam- ination of abstract semantics, morphology, and segmentation abilities, as well as the finding of variable syntax. Complex tasks in natural language processing include direct machine translation, dialogue interface learning, digital information extraction, and rapid key summarization. Three major, interconnected processes are involved in the coupling of computer vision with natural language processing: recognition, reconstruction, reorganization. Recognition: In this procedure, items in the image are given digital labels. For 2D objects, examples of identification include handwriting or facial recognition; for 3D assignments, problems like moving object recognition are addressed, which aids in autonomous robotic handling. Reconstruction: It is the process of generating a 3D scene using inputs from specific visual images by combining digital shading, numerous angles of view, and sensory depth information. A 3D digital model is the end product, which is subsequently utilized for additional processing. Reorganization: This procedure is the division of raw pixels into data groups that symbolize the blueprint for a preset setup. Tasks of higher level encompass semantic segmentation, that may have some overlap with identification processes. On the other hand, lower-level tasks involve detection of corners, edges and contours. 2 Algorithms Used in Semantic Segmentation in 3D Point Cloud Point cloud data is comprised of an extensive collection of positional coordinates, each representing a specific point in physical space. Due to its ability to retain spatial depth information, which is valuable for an agent, it serves as an excellent data format for Embodied AI. Enabling an agent to train a semantic vision system on point clouds can enhance the agent’s understanding of a scene. 2.1 Convolutional Neural Network (CNN) A convolution is a matrix operation in which a new matrix is created by convolving a kernel over an existing matrix. Because it takes into account each pixel’s neigh- bors, this process is useful for applying machine learning to photos. It is well-known
Building Embodied AI Systems: The Agents, the Architectural … 3 for producing filters that recognize various image edges and turn them into a repre- sentation that a thick neural network (Multi-Layer Perception Neural Network) can employ. 2.2 Kernel Point Convolution (KPC) It seemed sense that convolutional neural networks, which have been successfully used for perceptual purposes since the 2010s, would also perform well for point clouds. However, the implementation of a convolution in a sparse dataset of higher dimensional order was not clear-cut. The kernel point convolution (KPC) [1], which convolves around a portion of the point cloud and uses about 7 points as the kernel, proved to be successful. Other methodologies encompass multi-layer perceptron (MLP) networks, which directly employ an MLP on the data points, projection convolution networks, which initially transform point data into a voxel or image format, and graph convolution neural networks, which consider the interconnections among the data points [2]. 3 Visual Navigation in Embodied AI Computer vision is commonly employed in the field of Embodied AI with architec- tures such as Residual Neural Networks, Contrastive Language-Image Pretraining (CLIP), or MASKRCNN. 3.1 Residual Neural Networks (ResNet) It is now possible to construct convolutional neural networks (CNNs) with a large number of convolutional layers, even reaching thousands. These networks outper- form shallower networks due to the implementation of the Residual Network (ResNet) design [3], which effectively addresses the problem of ‘vanishing gradient’. Subsequent successful architectural designs in the field of convolutional neural networks (CNNs) have incorporated extra layers in order to effectively reduce the error rate. While this approach is efficient for lower layer counts, as the number of layers increases, a common problem in deep learning called the Vanishing/Exploding gradient becomes apparent. The gradient either approaches zero or gets quite large. Consequently, the error rates for both training and test increase as the number of layers increases. The plot above demonstrates that a 56-layer CNN exhibits a greater error rate on both training and testing datasets compared to a 20-layer CNN architecture. The
4 A. Bewerwal et al. Fig. 1 On CIFAR-10, training error (left) and test error (right) with 20-layer and 56-layer “plain” networks, respectively. Test error and training error are increased in the deeper network [3] authors of ResNet architecture were able to conclude that the vanishing/exploding gradient is the cause of the mistake rate after conducting additional analysis. The design proposes the concept of Residual Blocks to specifically tackle the issue of vanishing/exploding gradient. In this network, we utilize a technique known as skip connections. The skip connection directly connects the activations of one layer to the activations of following layers, bypassing some intermediate layers. Consequently, a residual block is formed. The residual blocks are arranged in a stack to form ResNets. The concept underpinning this network is to allow the network to adapt to the residual mapping instead of requiring the layers to learn the underlying mapping. So, let the network fit, rather than using, H(x) as the initial mapping, F(x) := H (x) − x which gives H (x) := F(x) + x. Adding this kind of skip connection has the benefit of allowing regularization to bypass any layers that degrade architecture performance. As a result, training a very deep neural network is achieved without the vanishing/exploding gradient issues. Network design: This network adds a shortcut connection to a 34-layer simple network design that was modelled by the VGG-19. The architecture is subsequently transformed into a residual network by these shortcut links. Fig. 2 Residual learning: a building block [3]
Building Embodied AI Systems: The Agents, the Architectural … 5 Fig. 3 Illustrative network structures for the ImageNet dataset. Left: the VGG-19 model [4] (19.6 billion FLOPs) as a reference. Middle: a plain network with 34 parameter layers (3.6 billion FLOPs). Right: a residual network with 34 parameter layers (3.6 billion FLOPs). The dotted shortcuts increase dimensions [3]
6 A. Bewerwal et al. A 152-layer ResNet was employed on the ImageNet dataset, which has less param- eters but is eight times deeper than VGG19. Using the ImageNet test set, an ensemble of these ResNets produced an error of just 3.7%, winning the ILSVRC 2015 compe- tition. Because of its extremely deep representation, it also produces a 28% relative improvement on the COCO object detection dataset. The outcome also demonstrated that shortcut connections would be able to address the issue brought on by adding layers, as the error rate on the ImageNet Validation Set similarly drops as the layers are increased from 18 to 34 layers, in contrast to the simple network. With the lowest top-5 error rate of 3.57%, according to the results on the ImageNet Test Set, ResNet’s architecture won the 2015 ImageNet classification competition. 3.2 Contrastive Language-Image Pretraining (CLIP) The comparatively tiny handcrafted and supervised datasets (ImageNet, COCO, etc.) that constrain traditional image classifiers. CLIP [5] recommends employing more publicly available picture and text pairs as training datasets (e.g., images with informative text captions). This method might employ roughly 400 million training instances from the internet, compared to ~100k training samples of MS-COCO. When an image is paired with a label that describes the related image in plain language, CLIP is trained on these pairs. CLIP is shown a set of (supposed as N) images and labels during training, which could result in N2 potential image-text pairs. The goal is to maximize the cosine similarity of the N pairings, each of which consists of an image and its true label (a written description). The goal is to minimize the cosine similarity for all other (N2-N) pairs. CLIP uses two essential building components to accomplish this: 1. A Vision Transformer (ViT)-based visual image encoder 2. A text transformer is used by a text encoder. In the first stage, referred known as ‘contrastive pre-training’, the visual and text encodings are simultaneously taught to predict the proper matches. The trained text encoder serves as a zero-shot classifier during the testing phase. The possible classes are the class names (or descriptions) of the target test dataset. The question, ‘A photo of a {object}’, is used with the class names, and {object} can be any of the various classes in the target dataset (e.g. plane, automobile, dog, bird, etc.). The prompt is needed because labels used for pre-training CLIP were picture descriptions, but standard image datasets like MS-COCO contain single word labels as class names (e.g., cat, dog). The link between these two is provided by a prompt. Furthermore, a prompt offers more information than a single label. As an aside, the authors point out that the absence of label context in many existing datasets causes ambiguity (take datasets with the term ‘boxer’ as an example).
Building Embodied AI Systems: The Agents, the Architectural … 7 Fig. 4 Contrastive pre-training jointly optimises image and text pairs [5] Fig. 5 Test phase single shot classification [5]
8 A. Bewerwal et al. 3.3 Mask R-CNN Mask R-CNN [6] is a cutting-edge Convolutional Neural Network (CNN) that excels in the task of photo segmentation. This form of Deep Neural Network utilizes advanced algorithms to detect and identify objects inside an image, generating highly accurate segmentation masks for each instance. Mask R-CNN, or Mask RCNN, is a cutting-edge Convolutional Neural Network (CNN) that excels at both picture and instance segmentation. Mask R-CNN was built upon the foundation of Faster R-CNN, a region-based convolutional neural network. Comprehending the concept of picture segmentation is a necessary requirement for understanding the functioning of Mask R-CNN. The procedure of dividing an image into several segments—set of pixels, also referred to as image objects—is known as image segmentation in computer vision. Locating objects and boundaries (lines, curves, etc.) is accomplished using segmentation. Within the category of Mask R-CNN image segmentation, there are two primary types: 1. Semantic Segmentation 2. Instance Segmentation Semantic Segmentation: Semantic Division With semantic segmentation, every pixel is categorized into a predetermined set without distinguishing between different instances of an object. Put differently, semantic segmentation pertains to the pixel- by-pixel recognition and grouping of related objects into a single class. Every object in the picture below was categorized as a single individual. Semantic segmentation, also known as background segmentation, differentiates the objects of an image from its background. Instance Segmentation: Instance segmentation, sometimes referred to as instance recognition, encompasses two tasks: the precise detection of every object in an image and the accurate segmentation of each individual instance. Therefore, it is the merging of object classification, object location, and object detection. To put it in other words, this type of segmentation takes an additional step in distinguishing each object that is classified as a similar example. In the context of Instance Segmentation, the sample image provided demonstrates that all objects depicted are humans. However, this Fig. 6 Differences between semantic segmentation and instance segmentation [6]
Building Embodied AI Systems: The Agents, the Architectural … 9 segmentation technique treats each person as a separate and distinct entity. Semantic segmentation, also known as foreground segmentation, focuses on extracting the subjects of a picture rather than the background. Working of Mask R-CNN: Mask R-CNN was built using Faster R-CNN as its foundation. Mask R-CNN introduces an additional branch that generates the object mask, while Faster R-CNN just generates two outputs for each candidate item: a class label and a bounding-box offset. Due to the discrepancy between the additional mask output and the class and box outputs, it is essential to fetch excerpt of a greater precise spatial arrangement of an object. A supplementary wing for foreseeing object masks (Region of Interest) is incorporated into the preexisting wing for recognizing bounding boxes in Mask R-CNN, which is an enhanced version of Faster R-CNN. Advantages of Mask R-CNN: • Simplicity: Training Mask R-CNN is an easy task. • Performance: On every task, Mask R-CNN performs better than any single-model entry now in use. • Efficiency: The technique adds very little overhead to Faster R-CNN and is incredibly efficient. • Flexibility: Mask R-CNN can be easily used to a variety of tasks. For instance, the same system may be used to estimate human pose using Mask R-CNN. The essential component of Mask R-CNN is the precise orientation of pixels, that is absent in Fast/Faster R-CNN. Mask R-CNN employs an identical two-step procedure, commencing with the initial RPN stage. In the second stage, Mask R- CNN produces a binary mask for each region of interest (ROI) while also predicting the class and box offset. Mask predictions serve as the foundation for categorization in most contemporary systems. Moreover, because to the flexibility of the Faster R-CNN framework in accommodating various customizable architectural designs, the creation and training of Mask R-CNN is straightforward. Furthermore, the mask Fig. 7 Mask R-CNN—The mask R-CNN structure for instance segmentation [6]
10 A. Bewerwal et al. branch’s little computing burden enables the implementation of a fast system and rapid experimentation. 4 Reinforcement Learning Reinforcement Learning (RL) is a crucial field in machine learning that focuses on training intelligent agents to optimize their anticipated future rewards within a certain environment. The agent’s interactions with the environment are documented and utilized to generate the agent’s training data. A notable differentiation between conventional machine learning and the current model predictions lies in the fact that the actions resulting from the predictions have the potential to impact the dataset of the agent. Reinforcement learning (RL) is a powerful learning technique that involves the agent sampling the environment to get training data. Thanks to this type of learning, engineers can effectively optimize agents across various problem domains. These areas encompass robotics, controls, navigation, AI alignment, and various other fields. Successful application of reinforcement learning results in the creation of an agent that explores all possible options and exploits the most beneficial ones. 4.1 Proximal Policy Optimization (PPO) With supervised learning, the cost function can be quickly put into practice, apply gradient descent to it, and be very sure that only minor adjustments are needed to the hyperparameters to achieve outstanding results. The path to success in reinforcement learning is less clear-cut since the algorithms contain numerous moving components that are challenging to debug and require significant adjusting to produce desirable outcomes. In order to minimize the cost function while guaranteeing that the depar- ture from the prior policy is minimal, PPO [7] attempts to compute an update at each step, striking a balance between sample complexity, ease of tweaking, and ease of implementation. PPO uses the following unique function: LCLIP(θ ) = Et[min(rt(θ ))At , clip(rt(θ ), 1 − ε, 1 + ε)At ) ] • θ is the policy parameter • Et denotes the empirical expectation over timesteps • rt is the ratio of the probability under the new and old policies, respectively • At is the estimated advantage at time t • ε is a hyperparameter, usually 0.1 or 0.2.
Building Embodied AI Systems: The Agents, the Architectural … 11 By eliminating the KL penalty and the requirement for adaptive updates, this aim streamlines the algorithm and implements a Trust Region update method that is consistent with stochastic gradient descent. Even though it is simpler to develop, this approach has demonstrated the best performance in testing on continuous control tasks. The pseudo code of the Proximal Policy Optimization (PPO) algorithm, as stated in the PPO documentation is as follows [8] 5 Deep Reinforcement Learning Deep Reinforcement Learning (DRL) is a reinforcement learning technique that approximates the value or policy function using deep neural networks. A type of deep reinforcement learning is called Decentralized Distributed Proximal Policy Optimization (DDPPO). 5.1 Decentralized Distributed Proximal Policy Optimization (DDPPO) The technique known as Decentralized Distributed Proximal Policy Optimization (DD-PPO) [9] is utilized for the decentralized reinforcement learning within artifi- cial environments with abundant resources. Because DD-PPO is synchronous (no
12 A. Bewerwal et al. computation is ever ‘stale’), decentralized (no centralized server is needed), and distributed (uses numerous machines), it is conceptually straightforward and simple to implement. For reinforcement learning, proximal policy optimization, or PPO, is a policy gradient technique. The goal was to create an algorithm that used only first-order optimization and had the data efficiency and dependable performance of TRPO. Let rt(θ ) denote the probability ratio rt(θ ) = πθ (at |st ) πθold (at |st ) so r(θold ) = 1. TRPO maximizes a ‘surrogate’ objective: Lv(θ ) = Êt[ πθ (at |st ) πθold (at |st ) ) ̂At] = ̂Et[rt(θ ) ̂At] DD-PPO is a general abstraction that involves the next steps: for any step k, a worker n possesses a copy of the parameters, θ k n , computes the gradient, δθ k n , and updates θ using this gradient. DD-PPO is a general abstraction that involves the next steps: at step k, a worker n possesses a copy of the parameters, θ k n , computes the gradient, δθ k n , and updates θ using this gradient: θ k+1 n = ParamUpdate(θ k n , AllReduce(δθ k 1 , . . . , δθ k N )) = ParamUpdate(θ k n , 1 N N∑ i=1 δθ k i ) where, ParamUpdate refers to any first-order optimization technique, such as gradient descent. AllReduce gives an operation for reducing, such as calculating the mean, over all instances of a variable and distributes the outcome to all workers. 6 A Look at the Prominent Vision Systems Used in the Past 6.1 AlexNet Architechture The Convolutional Neural Network Architecture that emerged victorious in the 2012 LSVRC competition was named AlexNet [10]. The AlexNet contains 8 layers with weights; 5 convolutional layers, and 3 fully connected layers. The ReLu activation function is applied to each layer, except for the final layer, which uses a softmax function to produce a probability distribution over the 1000 class labels. Dropout is applied in the initial two fully linked layers. Max-pooling is applied following the first, second, and fifth convolutional layers, as depicted in the image above. The kernels of the second, fourth, and fifth convolutional layers are only connected to the kernel maps from the previous layer that are stored on the same GPU. Each kernel map in the second layer is linked to the kernels of the third convolutional layer. Each neuron in the layer above is linked to every other neuron in the fully interconnected levels.