Python Data Science Essentials
Alberto Boschetti Luca Massaron更新时间:2021-08-13 15:20:17
最新章节:Leave a review - let other readers know what you think封面
Title Page
Copyright and Credits
Python Data Science Essentials Third Edition
Packt Upsell
Why subscribe?
Packt.com
Contributors
About the authors
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
First Steps
Introducing data science and Python
Installing Python
Python 2 or Python 3?
Step-by-step installation
Installing the necessary packages
Package upgrades
Scientific distributions
Anaconda
Leveraging conda to install packages
Enthought Canopy
WinPython
Explaining virtual environments
Conda for managing environments
A glance at the essential packages
NumPy
SciPy
pandas
pandas-profiling
Scikit-learn
Jupyter
JupyterLab
Matplotlib
Seaborn
Statsmodels
Beautiful Soup
NetworkX
NLTK
Gensim
PyPy
XGBoost
LightGBM
CatBoost
TensorFlow
Keras
Introducing Jupyter
Fast installation and first test usage
Jupyter magic commands
Installing packages directly from Jupyter Notebooks
Checking the new JupyterLab environment
How Jupyter Notebooks can help data scientists
Alternatives to Jupyter
Datasets and code used in this book
Scikit-learn toy datasets
The MLdata.org and other public repositories for open source data
LIBSVM data examples
Loading data directly from CSV or text files
Scikit-learn sample generators
Summary
Data Munging
The data science process
Data loading and preprocessing with pandas
Fast and easy data loading
Dealing with problematic data
Dealing with big datasets
Accessing other data formats
Putting data together
Data preprocessing
Data selection
Working with categorical and textual data
A special type of data – text
Scraping the web with Beautiful Soup
Data processing with NumPy
NumPy's n-dimensional array
The basics of NumPy ndarray objects
Creating NumPy arrays
From lists to unidimensional arrays
Controlling memory size
Heterogeneous lists
From lists to multidimensional arrays
Resizing arrays
Arrays derived from NumPy functions
Getting an array directly from a file
Extracting data from pandas
NumPy fast operation and computations
Matrix operations
Slicing and indexing with NumPy arrays
Stacking NumPy arrays
Working with sparse arrays
Summary
The Data Pipeline
Introducing EDA
Building new features
Dimensionality reduction
The covariance matrix
Principal component analysis
PCA for big data – RandomizedPCA
Latent factor analysis
Linear discriminant analysis
Latent semantical analysis
Independent component analysis
Kernel PCA
T-SNE
Restricted Boltzmann Machine
The detection and treatment of outliers
Univariate outlier detection
EllipticEnvelope
OneClassSVM
Validation metrics
Multilabel classification
Binary classification
Regression
Testing and validating
Cross-validation
Using cross-validation iterators
Sampling and bootstrapping
Hyperparameter optimization
Building custom scoring functions
Reducing the grid search runtime
Feature selection
Selection based on feature variance
Univariate selection
Recursive elimination
Stability and L1-based selection
Wrapping everything in a pipeline
Combining features together and chaining transformations
Building custom transformation functions
Summary
Machine Learning
Preparing tools and datasets
Linear and logistic regression
Naive Bayes
K-Nearest Neighbors
Nonlinear algorithms
SVM for classification
SVM for regression
Tuning SVM
Ensemble strategies
Pasting by random samples
Bagging with weak classifiers
Random Subspaces and Random Patches
Random Forests and Extra-Trees
Estimating probabilities from an ensemble
Sequences of models – AdaBoost
Gradient tree boosting (GTB)
XGBoost
LightGBM
CatBoost
Dealing with big data
Creating some big datasets as examples
Scalability with volume
Keeping up with velocity
Dealing with variety
An overview of Stochastic Gradient Descent (SGD)
A peek into natural language processing (NLP)
Word tokenization
Stemming
Word tagging
Named entity recognition (NER)
Stopwords
A complete data science example – text classification
An overview of unsupervised learning
K-means
DBSCAN – a density-based clustering technique
Latent Dirichlet Allocation (LDA)
Summary
Visualization Insights and Results
Introducing the basics of matplotlib
Trying curve plotting
Using panels for clearer representations
Plotting scatterplots for relationships in data
Histograms
Bar graphs
Image visualization
Selected graphical examples with pandas
Working with boxplots and histograms
Plotting scatterplots
Discovering patterns by parallel coordinates
Wrapping up matplotlib's commands
Introducing Seaborn
Enhancing your EDA capabilities
Advanced data learning representation
Learning curves
Validation curves
Feature importance for RandomForests
Gradient Boosting Trees partial dependence plotting
Creating a prediction server with machine-learning-as-a-service
Summary
Social Network Analysis
Introduction to graph theory
Graph algorithms
Types of node centrality
Partitioning a network
Graph loading dumping and sampling
Summary
Deep Learning Beyond the Basics
Approaching deep learning
Classifying images with CNN
Using pre-trained models
Working with temporal sequences
Summary
Spark for Big Data
From a standalone machine to a bunch of nodes
Making sense of why we need a distributed framework
The Hadoop ecosystem
Hadoop architecture
Hadoop Distributed File System
MapReduce
Introducing Apache Spark
PySpark
Starting with PySpark
Setting up your local Spark instance
Experimenting with Resilient Distributed Datasets
Sharing variables across cluster nodes
Read-only broadcast variables
Write-only accumulator variables
Broadcast and accumulator variables together—an example
Data preprocessing in Spark
CSV files and Spark DataFrames
Dealing with missing data
Grouping and creating tables in-memory
Writing the preprocessed DataFrame or RDD to disk
Working with Spark DataFrames
Machine learning with Spark
Spark on the KDD99 dataset
Reading the dataset
Feature engineering
Training a learner
Evaluating a learner's performance
The power of the machine learning pipeline
Manual tuning
Cross-validation
Final cleanup
Summary
Strengthen Your Python Foundations
Your learning list
Lists
Dictionaries
Defining functions
Classes objects and object-oriented programming
Exceptions
Iterators and generators
Conditionals
Comprehensions for lists and dictionaries
Learn by watching reading and doing
Massive open online courses (MOOCs)
PyCon and PyData
Interactive Jupyter
Don't be shy take a real challenge
Other Books You May Enjoy
Leave a review - let other readers know what you think
更新时间:2021-08-13 15:20:17