<< Previous Next >>

[primer guide] Practical Statistics for Data Scientists, 3rd edition (Various Authors)(Z-Library)

Author: Various Authors

数据

Statistical methods are a key part of data science, yet few data scientists have formal statistical training. Courses and books on basic statistics rarely cover the topic from a data science perspective. And many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you're familiar with the R or Python programming languages and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.

📄 File Format: PDF
💾 File Size: 3.0 MB
15
Views
0
Downloads
0.00
Total Donations

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

📄 Page 1
(This page has no text content)
📄 Page 2
Practical Statistics for Data Scientists THIRD EDITION 50+ Essential Concepts Using R, Python, & GenAI With Early Release ebooks, you get books in their earliest form—the authors’ raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles. Peter Bruce, Andrew Bruce, and Peter Gedeck
📄 Page 3
Practical Statistics for Data Scientists by Peter Bruce, Andrew Bruce, and Peter Gedeck Copyright © 2026 Peter Bruce, Andrew Bruce, and Peter Gedeck. All rights reserved. Published by O’Reilly Media, Inc., 141 Stony Circle, Suite 195, Santa Rosa, CA 95401. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (https://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Michelle Smith Development Editor: Corbin Collins Production Editor: Ashley Stussy Interior Designer: David Futato Interior Illustrator: Kate Dullea May 2017: First Edition May 2020: Second Edition May 2026: Third Edition Revision History for the Early Release 2026-01-12: First Release See https://oreilly.com/catalog/errata.csp?isbn=9798341666283 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Practical Statistics for Data Scientists, the cover image, and related trade
📄 Page 4
dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 979-8-341-66624-5 [FILL IN]
📄 Page 5
Brief Table of Contents (Not Yet Final) Chapter 1: Exploratory Data Analysis (available) Chapter 2: Data and Sampling Distributions (unavailable) Chapter 3: Statistical Experiments and Significance Testing (unavailable) Chapter 4: Regression and Prediction (unavailable) Chapter 5: Classification (unavailable) Chapter 6: Statistical Machine Learning (unavailable) Chapter 7: Unsupervised Learning (unavailable) Chapter 8: Neural Networks (unavailable) Chapter 9: Deep Learning and Reinforcement Learning (unavailable) Chapter 10: LLMs and Generative AI (available) Chapter 11: Caveats and Concerns (unavailable)
📄 Page 6
Chapter 1. Exploratory Data Analysis A NOTE FOR EARLY RELEASE READERS With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles. This will be the 1st chapter of the final book. Please note that the GitHub repo will be made active later on. If you’d like to be actively involved in reviewing and commenting on this draft, please reach out to the editor at ccollins@oreilly.com. This chapter focuses on the first step in any data science project: exploring the data. Classical statistics focused almost exclusively on inference, a sometimes complex set of procedures for drawing conclusions about large populations based on small samples. In 1962, John W. Tukey (Figure 1-1) called for a reformation of statistics in his seminal paper “The Future of Data Analysis”. He proposed a new scientific discipline called data analysis that included statistical inference as just one component. Tukey forged links to the engineering and computer science communities (he coined the terms bit, short for binary digit, and software), and his original tenets are surprisingly durable and form part of the foundation for data science. The field of exploratory data analysis was established with Tukey’s 1977 now-classic book Exploratory Data Analysis. Tukey presented simple plots (e.g., boxplots, scatterplots) that, along with summary statistics (mean, median, quantiles, etc.), help paint a picture of a data set. With the ready availability of computing power and expressive data analysis software, exploratory data analysis has evolved well beyond its original scope. Key drivers of this discipline have been the rapid development of new technology, access to more and bigger data, and the greater use of quantitative
📄 Page 7
analysis in a variety of disciplines. David Donoho, professor of statistics at Stanford University and former undergraduate student of Tukey’s, authored an excellent article based on his presentation at the Tukey Centennial workshop in Princeton, New Jersey. Donoho traces the genesis of data science back to Tukey’s pioneering work in data analysis. Figure 1-1. John Tukey, the eminent statistician whose ideas developed over 50 years ago form the foundation of data science Elements of Structured Data Data comes from many sources: sensor measurements, events, text, images, and videos. The Internet of Things (IoT) is spewing out streams of information. Much of this data is unstructured: images are a collection of pixels, with each pixel containing RGB (red, green, blue) color information. Texts are sequences of words and nonword characters, often organized by sections, subsections, and so on. Clickstreams are sequences of actions by a user interacting with an app or a web page. In fact, a major challenge of data science is to harness this torrent of raw data into actionable information. To apply the statistical concepts covered in this book, unstructured raw data must be processed and manipulated into a structured form. One of the commonest forms of structured data is a table with rows and columns—as data might emerge from a relational database or be collected for a study. There are two basic types of structured data: numeric and categorical. Numeric data comes in two forms: continuous, such as wind speed or time
📄 Page 8
duration, and discrete, such as the count of the occurrence of an event. Categorical data takes only a fixed set of values, such as a type of TV screen (plasma, LCD, LED, etc.) or a state name (Alabama, Alaska, etc.). Binary data is an important special case of categorical data that takes on only one of two values, such as 0/1, yes/no, or true/false. Another useful type of categorical data is ordinal data in which the categories are ordered; an example of this is a numerical rating (1, 2, 3, 4, or 5). Why do we bother with a taxonomy of data types? It turns out that for the purposes of data analysis and predictive modeling, the data type is important to help determine the type of visual display, data analysis, or statistical model. In fact, data science software, such as R and Python, uses these data types to improve computational performance. More important, the data type for a variable determines how software will handle computations for that variable.
📄 Page 9
KEY TERMS FOR DATA TYPES Numeric Data that are expressed on a numeric scale. Continuous Data that can take on any value in an interval. (Synonyms: interval, float, numeric) Discrete Data that can take on only integer values, such as counts. (Synonyms: integer, count) Categorical Data that can take on only a specific set of values representing a set of possible categories. (Synonyms: enums, enumerated, factors, nominal) Binary A special case of categorical data with just two categories of values, e.g., 0/1, true/false. (Synonyms: dichotomous, logical, indicator, boolean) Ordinal Categorical data that has an explicit ordering. (Synonym: ordered factor) Software engineers and database programmers may wonder why we even need the notion of categorical and ordinal data for analytics. After all, categories are merely a collection of text (or numeric) values, and the underlying database automatically handles the internal representation. However, explicit identification of data as categorical, as distinct from text, does offer some advantages:
📄 Page 10
Knowing that data is categorical can act as a signal telling software how statistical procedures, such as producing a chart or fitting a model, should behave. In particular, ordinal data can be represented as an ordered.factor in R, preserving a user-specified ordering in charts, tables, and models. In Python, scikit-learn supports ordinal data with the sklearn.preprocessing.OrdinalEncoder. Storage and indexing can be optimized (as in a relational database). The possible values a given categorical variable can take are enforced in the software (like an enum). The third “benefit” can lead to unintended or unexpected behavior: the default behavior of data import functions in R (e.g., read.csv) is to automatically convert a text column into a factor. Subsequent operations on that column will assume that the only allowable values for that column are the ones originally imported, and assigning a new text value will introduce a warning and produce an NA (missing value). The pandas package in Python will not make such a conversion automatically. However, you can specify a column as categorical explicitly in the read_csv function. KEY IDEAS Data is typically classified in software by type. Data types include numeric (continuous, discrete) and categorical (binary, ordinal). Data typing in software acts as a signal to the software on how to process the data. Further Reading The pandas documentation describes the different data types and how they can be manipulated in Python.
📄 Page 11
Data types can be confusing, since types may overlap, and the taxonomy in one software may differ from that in another. The R Tutorial website covers the taxonomy for R. Databases are more detailed in their classification of data types, incorporating considerations of precision levels, fixed- or variable- length fields, and more; see the W3Schools guide to SQL. Data Structures 1 The typical frame of reference for an analysis in data science is a rectangular data object, like a spreadsheet or database table. Rectangular Data Rectangular data is the general term for a two-dimensional matrix with rows indicating records (cases) and columns indicating features (variables); data frame is the specific format in R and Python. The data doesn’t always start in this form: unstructured data (e.g., text) must be processed and manipulated so that it can be represented as a set of features in the rectangular data (see “Elements of Structured Data”). Data in relational databases must be extracted and put into a single table for most data analysis and modeling tasks.
📄 Page 12
KEY TERMS FOR DATA STRUCTURES Data frame Rectangular data (like a spreadsheet) is the basic data structure for statistical and machine learning models. Feature A column within a table is commonly referred to as a feature. Synonyms attribute, input, predictor, variable Outcome Many data science projects involve predicting an outcome—often a yes/no outcome (in Table 1-1, it is “auction was competitive or not”). The features are sometimes used to predict the outcome in an experiment or a study. Synonyms dependent variable, response, target, output Records A row within a table is commonly referred to as a record. Synonyms case, example, instance, observation, pattern, sample Data dictionary A document that describes the data, their source, and the meaning of variables (features).
📄 Page 13
Synonyms data codebook, metadata Data catalog A structured inventory of data assets within an organization, including their location, format, and usage. Synonyms data inventory, data registry Table 1-1. A typical data frame format Category currency sellerRating Duration en Music/Movie/Game US 3249 5 M Music/Movie/Game US 3249 5 M Automotive US 3115 7 Tu Automotive US 3115 7 Tu Automotive US 3115 7 Tu Automotive US 3115 7 Tu Automotive US 3115 7 Tu Automotive US 3115 7 Tu
📄 Page 14
In Table 1-1, there is a mix of measured or counted data (e.g., duration and price) and categorical data (e.g., category and currency). As mentioned earlier, a special form of categorical variable is a binary (yes/no or 0/1) variable, seen in the rightmost column in Table 1-1—an indicator variable showing whether an auction was competitive (had multiple bidders) or not. This indicator variable also happens to be an outcome variable, when the scenario is to predict whether an auction is competitive or not. Data Frames and Indexes Traditional database tables have one or more columns designated as an index, essentially a row number. This can vastly improve the efficiency of certain database queries. In Python, with the pandas library, the basic rectangular data structure is a DataFrame object. By default, an automatic integer index is created for a DataFrame based on the order of the rows. In pandas, it is also possible to set multilevel/hierarchical indexes to improve the efficiency of certain operations. In R, the basic rectangular data structure is a data.frame object. A data.frame also has an implicit integer index based on the row order. The native R data.frame does not support user-specified or multilevel indexes, though a custom key can be created through the row.names attribute. To overcome this deficiency, several packages have gained widespread use: data.table and the collection of tidyverse packages. The data.table package supports multilevel indexes and offers significant speedups in working with a data.frame. The tidyverse package collection provides a wide variety of tools for data manipulation. Due to its popularity, we will use tidyverse’s tibble in this book.
📄 Page 15
TERMINOLOGY DIFFERENCES Terminology for rectangular data can be confusing. Statisticians and data scientists use different terms for the same thing. For a statistician, predictor variables are used in a model to predict a response or dependent variable. For a data scientist, features are used to predict a target. One synonym is particularly confusing: computer scientists will use the term sample for a single row; a sample to a statistician means a collection of rows. Nonrectangular Data Structures There are other data structures besides rectangular data. Time series data records successive measurements of the same variable. It is the raw material for statistical forecasting methods, and it is also a key component of the data produced by devices—the Internet of Things. Spatial data structures, which are used in mapping and location analytics, are more complex and varied than rectangular data structures. In the object representation, the focus of the data is an object (e.g., a house) and its spatial coordinates. The field view, by contrast, focuses on small units of space and the value of a relevant metric (pixel brightness, for example). Graph (or network) data structures are used to represent physical, social, and abstract relationships. For example, a graph of a social network, such as Facebook or LinkedIn, may represent connections between people on the network. Distribution hubs connected by roads are an example of a physical network. Graph structures are useful for certain types of problems, such as network optimization and recommender systems. Each of these data types has its specialized methodology in data science. The focus of this book is on rectangular data, the fundamental building block of predictive modeling.
📄 Page 16
GRAPHS IN STATISTICS In computer science and information technology, the term graph typically refers to a depiction of the connections among entities, and to the underlying data structure. In statistics, graph is used to refer to a variety of plots and visualizations, not just of connections among entities, and the term applies only to the visualization, not to the data structure. Data Dictionaries and Catalogs From the feature names in Table 1-1 you can make a reasonable guess about what the feature represents, but feature names are always short, often cryptic, and sometimes carry no information at all. It is good practice to capture additional information in so-called data dictionaries. A data dictionary is a document that describes the data, its source, and the meaning of the variables (features). They are particularly important for the correct analysis and interpretation of data science results. For example (referring to Table 1-1): The currency column contains the currency a particular auction used. It is categorical and each row has one of three values: US which stands for US dollar, GBP for english pound, and EUR for the european euro. The Duration column defines the length of the auction. However, we don’t know this for sure without consulting a data dictionary In general, a data dictionary should contain information about the dataset as a whole, as well as information about each variable (feature). It can take many forms; it could be a simple table or a more complex document. A sample data dictionary for the dataset in Table 1-1 is shown below. In this example, we formatted the information using YAML, a machine readable format. This semi-structured format makes it particularly useful in AI based data analysis, while still readable by humans. name: ebay Auctions description: > The dataset eBayAuctions.csv contains information on 1972 auctions that
📄 Page 17
transacted on eBay.com during May-June 2004. source: Copyright 2016 Galit Shmueli and Peter Bruce size: observations: 1972 features: 8 features: Category: Category of the auctioned item currency: "US: US dollar, GBP: English pound, EUR: Euro" sellerRating: > a rating by eBay, as a function of the number of "good" and "bad" transactions the seller had on eBay Duration: Number of days the auction lasted (set by seller at auction start) endDay: Day of week that the auction closed ClosePrice: Price item sold at (converted into USD) OpenPrice: Initial price set by the seller (converted into USD) Competitive?: whether the auction had a single bid (0) or more (1) Name, description, source, and size are metadata about the dataset as a whole. The features section gives a short description for each features. A more complete version would also include information about the data type, the unit of the features, and permissible values. Data catalogs are a more comprehensive description of the data within an organization. A data catalog can be a collection of data dictionaries, but also contains additional information about the data, such as where it is stored, who has access to it, and how it can be used. A data catalog is a useful tool for data discovery and data governance. An LLM can help you to create a data dictionary from scratch or improve a given data dictionary. We use the Python langchain package as a high- level interface to LLMs. The communication with the LLM is done through a prompt. Here is an example prompt that tasks an LLM to improve a data dictionary with additional information. In addition to the data dictionary, we provide a sample of the data in CSV format to give the LLM additional context.
📄 Page 18
PROMPT = """ Improve the attached data dictionary and return it in YAML format. In addition to the description of features, provide name, description, source, and size of the dataset. For each feature, provide the following information: - feature name - readable variable name (if required) - definition of the variable - data type - measurement units - allowed values (for categorical or nominal features, but only if you are sure) Here is the data dictionary in YAML format: {data_dictionary} First ten lines of the dataset in CSV format: {data_sample} """ {data_dictionary} and {data_sample} are placeholders which will be replaced with actual data. In this example, we use the GPT-5 model from OpenAI. To execute the code, you will need an API key from OpenAI that gives you access to this model.2 from langchain_openai import ChatOpenAI from langchain.prompts import PromptTemplate chain = ( PromptTemplate.from_template(PROMPT) | ChatOpenAI(model="gpt-5") )
📄 Page 19
The text template is converted into a langchain prompt template. The vertical line | is used to chain the prompt template with the LLM. This means after processing the prompt template, the resulting prompt is passed to the ChatOpenAI object. This specifies the use of the OpenAI model. There are similar interfaces for other LLM providers. The chain object can now be used to generate an improved data dictionary. We provide the existing data dictionary and a sample of the data as input. import mlba data_dictionary = """ name: ebay Auctions description: > The dataset eBayAuctions.csv contains information on 1972 auctions that transacted oneBay.com during May-June 2004. source: Copyright 2016 Galit Shmueli and Peter Bruce size: observations: 1972 features: 8 features: Category: Category of the auctioned item currency: "US: US dollar, GBP: English pound, EUR: Euro" sellerRating: > a rating by eBay, as a function of the number of "good" and "bad" transactions the seller had on eBay Duration: Number of days the auction lasted (set by seller at auction start) endDay: Day of week that the auction closed ClosePrice: Price item sold at (converted into USD) OpenPrice: Initial price set by the seller (converted into USD) Competitive?: whether the auction had a single bid (0) or more (1) """
📄 Page 20
ebayAuctions = mlba.load_data("ebayAuctions") data_sample = ebayAuctions.sample(10, random_state=123).to_csv(index=False) context = { "data_dictionary": data_dictionary, "data_sample": data_sample, } result = chain.invoke(context) print(result.content) The output is an improved data dictionary in YAML format. Here is an excerpt: name: eBay Auctions description: > This dataset contains information on 1,972 auctions that transacted on eBay.com during May-June 2004. It includes item category, listing currency, seller reputation, auction duration and closing weekday, opening and closing prices (in USD), and an indicator of whether the auction attracted more than one bid. source: Copyright 2016 Galit Shmueli and Peter Bruce size: observations: 1972 features: 8 notes: - All monetary amounts (OpenPrice and ClosePrice) are converted to U.S. dollars (USD). - The currency field records the original listing currency. - Auction durations in this dataset are the standard eBay options. features: - name: Category readable_name: category
The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00
Total Amount (¥)
0
Donation Count

Login to support the author

Login Now

Recommended for You

Loading recommended books...
Failed to load, please try again later
Back to List