Statistics
6
Views
0
Downloads
0
Donations
Support
Share
Uploader

高宏飞

Shared on 2026-02-05

AuthorJohn Paul Mueller, Luca Massaron

No description

Tags
No tags
Publisher: For Dummies / Wiley
Publish Year: 2023
Language: 英文
File Format: PDF
File Size: 7.6 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

Wondershare PDFelement
Python® for Data Science For Dummies® To view this book's Cheat Sheet, simply go to www.dummies.com and search for “Python for Data Science For Dummies Cheat Sheet” in the Search box. Table of Contents Cover Title Page Copyright Introduction About This Book Foolish Assumptions Icons Used in This Book Beyond the Book Where to Go from Here Part 1: Getting Started with Data Science and Python Chapter 1: Discovering the Match between Data Science and Python Understanding Python as a Language Defining Data Science Creating the Data Science Pipeline Understanding Python’s Role in Data Science Learning to Use Python Fast Chapter 2: Introducing Python’s Capabilities and Wonders Working with Python Performing Rapid Prototyping and Experimentation Wondershare PDFelement
Considering Speed of Execution Visualizing Power Using the Python Ecosystem for Data Science Chapter 3: Setting Up Python for Data Science Working with Anaconda Installing Anaconda on Windows Installing Anaconda on Linux Installing Anaconda on Mac OS X Downloading the Datasets and Example Code Chapter 4: Working with Google Colab Defining Google Colab Working with Notebooks Performing Common Tasks Using Hardware Acceleration Executing the Code Viewing Your Notebook Sharing Your Notebook Getting Help Part 2: Getting Your Hands Dirty with Data Chapter 5: Working with Jupyter Notebook Using Jupyter Notebook Performing Multimedia and Graphic Integration Chapter 6: Working with Real Data Uploading, Streaming, and Sampling Data Accessing Data in Structured Flat-File Form Sending Data in Unstructured File Form Managing Data from Relational Databases Interacting with Data from NoSQL Databases Accessing Data from the Web Wondershare PDFelement
Chapter 7: Processing Your Data Juggling between NumPy and pandas Validating Your Data Manipulating Categorical Variables Dealing with Dates in Your Data Dealing with Missing Data Slicing and Dicing: Filtering and Selecting Data Concatenating and Transforming Aggregating Data at Any Level Chapter 8: Reshaping Data Using the Bag of Words Model to Tokenize Data Working with Graph Data Chapter 9: Putting What You Know into Action Contextualizing Problems and Data Considering the Art of Feature Creation Performing Operations on Arrays Part 3: Visualizing Information Chapter 10: Getting a Crash Course in Matplotlib Starting with a Graph Setting the Axis, Ticks, and Grids Defining the Line Appearance Using Labels, Annotations, and Legends Chapter 11: Visualizing the Data Choosing the Right Graph Creating Advanced Scatterplots Plotting Time Series Plotting Geographical Data Visualizing Graphs Part 4: Wrangling Data Wondershare PDFelement
Chapter 12: Stretching Python’s Capabilities Playing with Scikit-learn Using Transformative Functions Considering Timing and Performance Running in Parallel on Multiple Cores Chapter 13: Exploring Data Analysis The EDA Approach Defining Descriptive Statistics for Numeric Data Counting for Categorical Data Creating Applied Visualization for EDA Understanding Correlation Working with Cramér's V Modifying Data Distributions Chapter 14: Reducing Dimensionality Understanding SVD Performing Factor Analysis and PCA Understanding Some Applications Chapter 15: Clustering Clustering with K-means Performing Hierarchical Clustering Discovering New Groups with DBScan Chapter 16: Detecting Outliers in Data Considering Outlier Detection Examining a Simple Univariate Method Developing a Multivariate Approach Part 5: Learning from Data Chapter 17: Exploring Four Simple and Effective Algorithms Guessing the Number: Linear Regression Wondershare PDFelement
Moving to Logistic Regression Making Things as Simple as Naïve Bayes Learning Lazily with Nearest Neighbors Chapter 18: Performing Cross-Validation, Selection, and Optimization Pondering the Problem of Fitting a Model Cross-Validating Selecting Variables Like a Pro Pumping Up Your Hyperparameters Chapter 19: Increasing Complexity with Linear and Nonlinear Tricks Using Nonlinear Transformations Regularizing Linear Models Fighting with Big Data Chunk by Chunk Understanding Support Vector Machines Playing with Neural Networks Chapter 20: Understanding the Power of the Many Starting with a Plain Decision Tree Getting Lost in a Random Forest Boosting Predictions Part 6: The Part of Tens Chapter 21: Ten Essential Data Resources Discovering the News with Reddit Getting a Good Start with KDnuggets Locating Free Learning Resources with Quora Gaining Insights with Oracle’s AI & Data Science Blog Accessing the Huge List of Resources on Data Science Central Discovering New Beginner Data Science Methodologies at Data Science 101 Wondershare PDFelement
Obtaining the Most Authoritative Sources at Udacity Receiving Help with Advanced Topics at Conductrics Obtaining the Facts of Open Source Data Science from Springboard Zeroing In on Developer Resources with Jonathan Bower Chapter 22: Ten Data Challenges You Should Take Removing Personally Identifiable Information Creating a Secure Data Environment Working with a Multiple-Data-Source Problem Honing Your Overfit Strategies Trudging Through the MovieLens Dataset Locating the Correct Data Source Working with Handwritten Information Working with Pictures Indentifying Data Lineage Interacting with a Huge Graph Index About the Authors Connect with Dummies End User License Agreement List of Tables Chapter 10 TABLE 10-1 Matplotlib Line Styles TABLE 10-2 Matplotlib Colors TABLE 10-3 Matplotlib Markers Chapter 18 TABLE 18-1 Regression Evaluation Measures TABLE 18-2 Classification Evaluation Measures Wondershare PDFelement
Chapter 19 TABLE 19-1 The SVM Module of Learning Algorithms TABLE 19-2 The Loss, Penalty, and Dual Constraints List of Illustrations Chapter 1 FIGURE 1-1: Loading data into variables so that you can manipulate it. FIGURE 1-2: Using the variable content to train a linear regression model. FIGURE 1-3: Outputting a result as a response to the model. Chapter 3 FIGURE 3-1: Tell the wizard how to install Anaconda on your system. FIGURE 3-2: Specify an installation location. FIGURE 3-3: Configure the advanced installation options. FIGURE 3-4: Create a folder to use to hold the book’s code. FIGURE 3-5: Notebook uses cells to store your code. FIGURE 3-6: The housing object contains the loaded dataset. Chapter 4 FIGURE 4-1: Create a new Python 3 Notebook using the same techniques as normal. FIGURE 4-2: Use this dialog box to open existing notebooks. FIGURE 4-3: Colab maintains a history of the revisions for your project. FIGURE 4-4: Using GitHub means storing your data in a repository. FIGURE 4-5: Use gists to store individual files or other resources. FIGURE 4-6: Colab code cells contain a few extras not found in Notebook. Wondershare PDFelement
FIGURE 4-7: Use the Editor tab of the Settings dialog box to modify ... FIGURE 4-8: Colab code cells contain a few extras not found in Notebook. FIGURE 4-9: Use the GUI to make formatting your text easier. FIGURE 4-10: Hardware acceleration speeds code execution. FIGURE 4-11: Use the table of contents to navigate your notebook. FIGURE 4-12: The notebook information includes both size and settings. FIGURE 4-13: Send a message or obtain a link to share your notebook. FIGURE 4-14: Use code snippets to write your applications more quickly. Chapter 5 FIGURE 5-1: Notebook makes adding styles to your work easy. FIGURE 5-2: Adding headings makes separating content in your notebooks easy. FIGURE 5-3: The Help menu contains a selection of common help topics. FIGURE 5-4: Take your time going through the magic function help, which has a l... FIGURE 5-5: Embedding images can dress up your notebook presentation. Chapter 6 FIGURE 6-1: The test image is 100 pixels high and 100 pixels long. FIGURE 6-2: The raw format of a CSV file is still text and quite readable. FIGURE 6-3: Use an application such as Excel to create a formatted CSV presenta... FIGURE 6-4: An Excel file is highly formatted and might contain information of ... Wondershare PDFelement
FIGURE 6-5: The image appears onscreen after you render and show it. FIGURE 6-6: Cropping the image makes it smaller. FIGURE 6-7: XML is a hierarchical format that can become quite complex. Chapter 8 FIGURE 8-1: Plotting the original graph. FIGURE 8-2: Plotting the graph addition. Chapter 10 FIGURE 10-1: Creating a basic plot that shows just one line. FIGURE 10-2: Defining a plot that contains multiple lines. FIGURE 10-3: Specifying how the axes should appear to the viewer. FIGURE 10-4: Adding grids makes the values easier to read. FIGURE 10-5: Line styles help differentiate between plots. FIGURE 10-6: Markers help to emphasize individual values. FIGURE 10-7: Use labels to identify the axes. FIGURE 10-8: Annotation can identify points of interest. FIGURE 10-9: Use legends to identify individual lines. Chapter 11 FIGURE 11-1: Bar charts make it easier to perform comparisons. FIGURE 11-2: Histograms let you see distributions of numbers. FIGURE 11-3: Use boxplots to present groups of numbers. FIGURE 11-4: Use scatterplots to show groups of data points and their associate... FIGURE 11-5: Color arrays can make the scatterplot groups stand out better. FIGURE 11-6: Scatterplot trendlines can show you the general data direction. Wondershare PDFelement
FIGURE 11-7: Use line graphs to show the flow of data over time. FIGURE 11-8: Add a trendline to show the average direction of change in a chart... FIGURE 11-9: Maps can illustrate data in ways other graphics can't. FIGURE 11-10: Undirected graphs connect nodes to form patterns. FIGURE 11-11: Use directed graphs to show direction between nodes. Chapter 12 FIGURE 12-1: The output from the memory test shows memory usage for each line o... Chapter 13 FIGURE 13-1: A contingency table based on groups and binning. FIGURE 13-2: A boxplot comparing all the standardized variables. FIGURE 13-3: A boxplot of body mass arranged by penguin groups. FIGURE 13-4: Parallel coordinates anticipate whether groups are easily separabl... FIGURE 13-5: Flipper length distribution and density. FIGURE 13-6: Histograms can detail better distributions. FIGURE 13-7: A scatterplot reveals how two variables relate to each other. FIGURE 13-8: A matrix of scatterplots displays more information at one time. FIGURE 13-9: A covariance matrix of the Palmer Penguins dataset. FIGURE 13-10: A correlation matrix of the Palmer Penguins dataset. FIGURE 13-11: The distribution of bill depth transformed into a uniform distrib... Wondershare PDFelement
FIGURE 13-12: The distribution of bill depth transformed into a normal distribu... Chapter 14 FIGURE 14-1: The resulting projection of the handwritten data by the t-SNE algo... FIGURE 14-2: The example application would like to find similar photos. FIGURE 14-3: The output shows the results that resemble the test image. Chapter 15 FIGURE 15-1: Cross-tabulation of ground truth and K-means clusters. FIGURE 15-2: Rate of change of inertia for solutions up to k=20. FIGURE 15-3: Cross-tabulation of ground truth and Ward method’s agglomerative c... FIGURE 15-4: A clustering hierarchical tree obtained from agglomerative cluster... FIGURE 15-5: Cross-tabulation of ground truth and DBScan. Chapter 16 FIGURE 16-1: Descriptive statistics for the Diabetes DataFrame from Scikit-lear... FIGURE 16-2: Boxplots. FIGURE 16-3: The first two and last two components from the PCA. Chapter 18 FIGURE 18-1: Spatial distribution of house prices in California. FIGURE 18-2: Boxplot of house prices, grouped by clusters. FIGURE 18-3: Validation curves. Chapter 19 FIGURE 19-1: A slow descent optimizing squared error. FIGURE 19-2: Dividing two groups. Wondershare PDFelement
FIGURE 19-3: A viable SVM solution for the problem of the two groups and more. FIGURE 19-4: The first ten handwritten digits from the digits dataset. FIGURE 19-5: The training and test scores of the neural network as it learns fr... Chapter 20 FIGURE 20-1: A tree model of survival rates from the Titanic disaster. FIGURE 20-2: A tree model of the Mushroom dataset using a depth of five splits. FIGURE 20-3: Verifying the impact of the number of estimators on Random Forest. Wondershare PDFelement
Wondershare PDFelement
Python® for Data Science For Dummies®, 3rd Edition Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com Copyright © 2024 by John Wiley & Sons, Inc., Hoboken, New Jersey Media and software compilation copyright © 2023 by John Wiley & Sons, Inc. All rights reserved. Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions. Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. Python is a registered trademark of Python Software Foundation Corporation. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book. LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: WHILE THE PUBLISHER AND AUTHORS HAVE USED THEIR BEST EFFORTS IN PREPARING THIS WORK, THEY MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED Wondershare PDFelement
OR EXTENDED BY SALES REPRESENTATIVES, WRITTEN SALES MATERIALS OR PROMOTIONAL STATEMENTS FOR THIS WORK. THE FACT THAT AN ORGANIZATION, WEBSITE, OR PRODUCT IS REFERRED TO IN THIS WORK AS A CITATION AND/OR POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE PUBLISHER AND AUTHORS ENDORSE THE INFORMATION OR SERVICES THE ORGANIZATION, WEBSITE, OR PRODUCT MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING PROFESSIONAL SERVICES. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR YOUR SITUATION. YOU SHOULD CONSULT WITH A SPECIALIST WHERE APPROPRIATE. FURTHER, READERS SHOULD BE AWARE THAT WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ. NEITHER THE PUBLISHER NOR AUTHORS SHALL BE LIABLE FOR ANY LOSS OF PROFIT OR ANY OTHER COMMERCIAL DAMAGES, INCLUDING BUT NOT LIMITED TO SPECIAL, INCIDENTAL, CONSEQUENTIAL, OR OTHER DAMAGES. For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit https://hub.wiley.com/community/support/dummies. Wiley publishes in a variety of print and electronic formats and by print- on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com. Wondershare PDFelement
Library of Congress Control Number: 2023946155 ISBN 978-1-394-21314-6 (pbk); ISBN 978-1-394-21308-5 (ebk); ISBN ePDF 978-1-394-21309-2 (ebk) Wondershare PDFelement
Introduction The growth of the internet has been phenomenal. According to Internet World Stats (https://www.internetworldstats.com/emarketing.htm), 69 percent of the world is now connected in some way to the internet, including developing countries. North America has the highest penetration rate 93.4 percent, which means you now have access to nearly everyone just by knowing how to manipulate data. Data science turns this huge amount of data into capabilities that you use absolutely every day to perform an amazing array of tasks or to obtain services from someone else. You’ve probably used data science in ways that you never expected. For example, when you used your favorite search engine this morning to look for something, it made suggestions on alternative search terms. Those terms are supplied by data science. When you went to the doctor last week and discovered that the lump you found wasn’t cancer, the doctor likely made the prognosis with the help of data science. In fact, you may work with data science every day and not even know it. Even though many of the purposes of data science elude attention, you have probably become more aware of the data you generate, and with that awareness comes a desire for control over aspects of your life, such as when and where to shop, or whether to have someone perform the task for you. In addition to all its other uses, data science enables you to add that level of control that you, like many people, are looking for today. Python for Data Science For Dummies, 3rd Edition not only gets you started using data science to perform a wealth of practical tasks but also helps you realize just how many places data science is used. By knowing how to answer data science problems and where to employ data science, you gain a significant advantage over everyone else, increasing your chances at promotion or that new job you really want. Wondershare PDFelement
About This Book The main purpose of Python for Data Science For Dummies, 3rd Edition, is to take the scare factor out of data science by showing you that data science is not only really interesting but also quite doable using Python. You may assume that you need to be a computer science genius to perform the complex tasks normally associated with data science, but that’s far from the truth. Python comes with a host of useful libraries that do all the heavy lifting for you in the background. You don’t even realize how much is going on, and you don’t need to care. All you really need to know is that you want to perform specific tasks, and Python makes these tasks quite accessible. Part of the emphasis of this book is on using the right tools. You start with either Jupyter Notebook (on desktop systems) or Google Colab (on the web) — two tools that take the sting out of working with Python. The code you place in Jupyter Notebook or Google Colab is presentation quality, and you can mix a number of presentation elements right there in your document. It’s not really like using a traditional development environment at all. You also discover some interesting techniques in this book. For example, you can create plots of all your data science experiments using Matplotlib, and this book gives you all the details for doing that. This book also spends considerable time showing you available resources (such as packages) and how you can use Scikit-learn to perform some very interesting calculations. Many people would like to know how to perform handwriting recognition, and if you’re one of them, you can use this book to get a leg up on the process. Of course, you may still be worried about the whole programming environment issue, and this book doesn’t leave you in the dark there, either. At the beginning, you find complete methods you need to get started with data science using Jupyter Notebook or Google Colab. The emphasis is on getting you up and running as quickly as possible, and to make examples straightforward and simple so that the code doesn’t become a stumbling block to learning. Wondershare PDFelement
This third edition of the book provides you with updated examples using Python 3.x so that you’re using the most modern version of Python while reading. In addition, you find a stronger emphasis on making examples simpler, but also making the environment more inclusive by adding material on deep learning. More important, this edition of the book contains updated datasets that better demonstrate how data science works today. This edition of the book also touches on modern concerns, such as removing personally identifiable information and enhancing data security. Consequently, you get a lot more out of this edition of the book as a result of the input provided by thousands of readers before you. To make absorbing the concepts even easier, this book uses the following conventions: Text that you’re meant to type just as it appears in the book is in bold. The exception is when you’re working through a list of steps: Because each step is bold, the text to type is not bold. When you see words in italics as part of a typing sequence, you need to replace that value with something that works for you. For example, if you see “Type Your Name and press Enter,” you need to replace Your Name with your actual name. Web addresses and programming code appear in monofont. If you're reading a digital version of this book on a device connected to the internet, note that you can click the web address to visit that website, like this: http://www.dummies.com. When you need to type command sequences, you see them separated by a special arrow, like this: File  ⇒    New File. In this example, you go to the File menu first and then select the New File entry on that menu. Foolish Assumptions You may find it difficult to believe that we've assumed anything about you — after all, we haven’t even met you yet! Although most Wondershare PDFelement