Practical Full Stack Machine Learning (Alok Kumar) (Z-Library)

(This page has no text content)

Practical Full-stack Machine Learning A Guide to Build Reliable, Reusable, And Production-Ready Full Stack ML Solutions Alok Kumar www.bpbonline.com

FIRST EDITION 2022 Copyright © BPB Publications, India ISBN: 978-93-91030-42-1 All Rights Reserved. No part of this publication may be reproduced, distributed or transmitted in any form or by any means or stored in a database or retrieval system, without the prior written permission of the publisher with the exception to the program listings which may be entered, stored and executed in a computer system, but they can not be reproduced by the means of publication, photocopy, recording, or by any electronic and mechanical means. LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY The information contained in this book is true to correct and the best of author’s and publisher’s knowledge. The author has made every effort to ensure the accuracy of these publications, but publisher cannot be held responsible for any loss or damage arising from any information in this book. All trademarks referred to in the book are acknowledged as properties of their respective owners but BPB Publications cannot guarantee the accuracy of this information.

www.bpbonline.com

About the Author Alok is an author, speaker, open source contributor and a ML practitioner. He is currently leading the India Innovation center at Publicis Sapient to leverage emerging technologies to solve real world challenges. He has extensive experience in leading strategic initiatives and driving cutting edge fast-paced data driven solutions ranging from products to platforms. His work has won several reputed awards. The inspiration to write the book on full-stack ML came from the observation of the struggle of scaling, productioning ML systems and teams. Beyond work, He is passionate about democratizing knowledge. He manages multiple not-for-profit learnings and creative groups in NCR. He can be reached at linkedin and twitter

About the Reviewer Abhĳeet Prakash has 6 years of extensive experience in Artificial Intelligence, Machine Learning and, Full Stack Development, using tools like Selenium, BS4, Google Colab, Apigee, FastAPI AWS, Google Cloud Platform with programming languages like Ruby, Node, Python, etc. Abhĳeet pursued MCA from the Department of Mathematical Science and Information Technology, B.U. He has worked with various banks, NBFCs, and Fin-Tech companies. He worked as Machine Learning Engineer and currently working as a Full Stack AI Engineer. Abhĳeet also wrote a book on Ethical Hacking.

Acknowledgements I would like to thank my family for all their help, patience, and support. Without their support and assistance, I couldn’t have imagined completing this book. All the tools used in the book are coming from open source community and I would like to take this opportunity to thank the community that has helped democratize the AI knowledge so much. Finally, I would really like to thank the team at BPB for all their help and support in my journey writing this book. It has been a pleasure to write this book and the team at BPB are certainly a big part of that. Many thanks to my development coordinator - Priyanka. Her continuous and regular follow ups helped me tremendously to keep us on track and focussed.

Preface A successful data science project is not just about building powerful models, but the efficient execution of the entire project lifecycle. Unfortunately, data science has been made like ART and data scientist as ARTIST that uses hard to guess and unexplainable tricks. If you find it difficult to decide on the correct initialization of hyper parameters or re-running someone else model training code, then you share the pain and frustration of data scientist community. Experience may teach you few tricks but there are limitations on how much we can remember and recall them accurately at the time of need. Also, in the grand scheme of things, ML training is a part of a bigger data machine. This means inputs and outputs will always be reliant on other parts of the system. We would like you to pause here and think about the question – How would you revert your productionized model that is failing to meet the requirements or a step before it – how would you know it is failing or the accuracy has gone down the accepted limit? Again, how many such tricks across the whole pipeline would you able to learn, remember and recall? The objective of this book is to introduce you to a collection of powerful, open-

Source tools and concepts needed to build an effective data science pipeline so that you don’t have to remember the tricks but only remember the right tools, which We guess and, in my experience, is much easier to do. To ensure that you share the excitement with me – consider this example. You want to buy stocks and hence forth decided to seek advice from advisors. Being an experienced investor, you want to take the advice from others as well. Here is how it may look like. The percentage is the accuracy rate. Financial Advisor – 75% Stock Market Trader – 70% Market Research Team – 75% Social Media Expert – 60% As clearly seen, all specialists’ predictions are below 75%, however if you combine all their predictions, you get a completely different picture Accuracy Rate = 1- (25%* 30% * 25% * 40%) = 99.25%

This is the power of “ensembling” and in machine learning world, ensemble learning is essentially a combination of multiple machine learning techniques performed together. Now, to experiment different ensembling techniques, you can write it from scratch, or you can use ML-Ensemble library. ML-Ensemble combines a Scikit-learn high-level API with a low-level computational graph framework to build memory efficient, maximally parallelized ensemble networks in as few lines of codes as possible. We hope this drives home the idea and purpose of this book. Since the book is about building effective pipelines and systems, we have organized the books around common steps of data science project. The steps look like this:

Figure 0.1: CRISM DM common steps I am sure you would have seen a picture or diagram like this before. The steps are so common that it doesn’t turns any head. It seems so logical that it is hard to believe that considerable effort was spent to build this intuitive process. Interestingly, this was created before data science became the sexiest job. … CRISP DM or Cross Industry standard process for data mining is a process methodology for data mining applications.

While there are other methodologies too, CRISP DM is one of the popular choices. You would find variations of process adapted by lot of data mining tools without giving any attribution to it. The objective of CRISP DM is to provide an Industry independent repeatable process for data mining work. Lately CRISP DM has slowly begun to fall aside but the underlying basics are still strong and useful. The book chapters are loosely organized around different steps of CRISP DM. The intent is to provide a frame to group the different tools and libraries. Our mind has the amazing ability to recall things easily that are grouped/connected. Here are how the chapters are organized Chapter 1: Organizing your data science project Data science projects are experimental in nature and how you organize your project has a huge impact on the ease and speed of your experiments. A machine learning model is code plus data and hence both need to be organized properly. Getting started is not just about organizing your project but also deciding on the environment, framework, baseline, target metrics and In this chapter, we will explore concepts, tools and ideas that would help put the best foot forward.

Chapter 2: Preparing your data for Data science project Description : Data collection and preparation are the foundation for trusted Machine learning/Deep learning models. a considerable amount of effort is spent at this step. The focus of this chapter will be to learn best practices, tools on data analysis and pre- processing for machine learning projects. Chapter 3: Building your architecture for your data science projects Building your architecture is not about using the latest popular and viral algorithm to build your model but training a model that meets the real-world expectations and challenges. The focus of this chapter will be to learn best practices on algorithm selection, hyperparameters initialization/ tuning and debugging techniques to enhance the model performance. Chapter 4: Bye-Bye Scheduler, welcome airflow Apache airflow is an open-source project to programmatically author, schedule and monitor workflows. The key benefit of machine learning pipelines is the automation it offers for different steps. Every new training dataset must go through the steps outlined in the CRISP DM process. Most of the team either do it manually or duct tape the steps making it extremely brittle. You need a pipeline if your model has users. If you are still not convinced then chap-5 will do that job. The objective of this chapter would be a gentle introduction of Airflow.

Chapter 5: Managing ML pipeline with MLflow Majority of us would have experienced the frustration of running someone else code plus model. The libraries dependencies, hidden configurations and undocumented setup steps make it extremely difficult to treat someone else model like a black box. MLflow is an open-source project that helps you train, reuse, and deploy models with any library and package them into reproducible steps that other data scientists can use as a “black box,” without even having to know which library you are using. The objective of this chapter is to introduce you to MLflow and how you can use it in your situations. Chapter 6: Feature stores for ML Feature store could be imagined as a warehouse of features. It is central vault for storing documented, access-controlled features. Feature store is an emerging concept with the objective of removing the challenges in taking ML models to production. The focus of this chapter will be to learn about feature stores via an open-source feature store called Chapter 7: Serving ML as API The focus of this chapter will be to learn how we can deploy ML model as an API. We will use fastAPI which is a modern, high- performance Python web framework perfect for building RESTful APIs. fastAPI can handle both synchronous and asynchronous

requests and has built-in support for data validation, JSON serialization, authentication, and authorization. As a bonus, you will also learn about The chapters are independent and can be read in any sequence. This is very helpful because maybe you are interested in few chapters or may be some chapters knowledge are required immediately. This may sound cliché, but it is true – the best way to learn and remember is to practice and repeat. Try preparing notes for each chapter and revisit them frequently. Repetition helps Now decide on the chapter you want to start with and get going. Happy learning!!

Downloading the code bundle and coloured images: Please follow the link to download the Code Bundle and the Coloured Images of the book: https://rebrand.ly/cd277f Errata We take immense pride in our work at BPB Publications and follow best practices to ensure the accuracy of our content to provide with an indulging reading experience to our subscribers. Our readers are our mirrors, and we use their inputs to reflect and improve upon human errors, if any, that may have occurred during the publishing processes involved. To let us maintain the quality and help us reach out to any readers who might be having difficulties due to any unforeseen errors, please write to us at : errata@bpbonline.com Your support, suggestions and feedbacks are highly appreciated by the BPB Publications’ Family.

Did you know that BPB offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.bpbonline.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at business@bpbonline.com for more details. At you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on BPB books and eBooks.

BPB is searching for authors like you If you're interested in becoming an author for BPB, please visit www.bpbonline.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea. The code bundle for the book is also hosted on GitHub at In case there's an update to the code, it will be updated on the existing GitHub repository. We also have other code bundles from our rich catalog of books and videos available at Check them out! PIRACY If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at business@bpbonline.com with a link to the material. If you are interested in becoming an author

If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit REVIEWS Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at BPB can understand what you think about our products, and our authors can see your feedback on their book. Thank you! For more information about BPB, please visit

Table of Contents 1. Organizing Your Data Science Project Structure Objective 1.1 Project folder and code organization 1.2 GPU 101 1.3 On-premises vs cloud 1.4 Deciding your framework 1.5 Deciding your targets 1.6 Preparing baseline 1.7 Managing workflow Observing an experiment Output: Conclusion Questions Points to remember Further reading 2. Preparing Your Data Structure Objective 2.1 Data exploration with facets 2.2 Missing data conundrum: Imputation techniques 2.3 Scaling data 2.4 Outlier treatment 2.5 Feature engineering 2.6 Data collection 2.7 Large scale data processing with Dask

Statistics

Uploader

Practical Full Stack Machine Learning (Alok Kumar) (Z-Library)

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Recommended for You

Statistics

Uploader

Practical Full Stack Machine Learning (Alok Kumar) (Z-Library)

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Reply to Comment

Edit Comment

Recommended for You