Cleaning Data for Effective Data Science Doing the other 80% of the work with Python, R, and command-line tools.

Data in its raw state is rarely ready for productive analysis. This book not only teaches you data preparation, but also what questions you should ask of your data. It focuses on the thought processes necessary for successful data cleansing as much as on concise and precise code examples that expres...

Full description

Saved in:
Bibliographic Details
Main Author: David Mertz, Mertz
Format: eBook
Language:English
Published: Packt Publishing 2021.
Subjects:
LEADER 04685cam a2200397Ma 4500
001 cc040378-17cc-4518-a4a8-ed83bc026316
005 20250121000000.0
006 m o d
007 cr |||||||||||
008 210331s2021 enk o 000 0 eng d
040 |a UKAHL  |b eng  |c UKAHL  |d EBLCP  |d CZL  |d OCLCO  |d OCLCF 
020 |a 9781801074407  |q (e-book) 
020 |a 1801074402 
035 |a (OCoLC)1245420921 
050 4 |a QA76.9.D3  |b .M478 2021 
082 0 4 |a 005.7565  |2 23 
049 |a MAIN 
100 1 |a David Mertz, Mertz. 
245 1 0 |a Cleaning Data for Effective Data Science  |b Doing the other 80% of the work with Python, R, and command-line tools.  |c David Mertz, Mertz. 
260 |b Packt Publishing  |c 2021. 
300 |a 1 online resource 
505 0 |a Cover -- Copyright -- Contributors -- Table of Contents -- Preface -- Part I -- Data Ingestion -- Chapter 1: Tabular Formats -- Tidying Up -- CSV -- Sanity Checks -- The Good, the Bad, and the Textual Data -- The Bad -- The Good -- Spreadsheets Considered Harmful -- SQL RDBMS -- Massaging Data Types -- Repeating in R -- Where SQL Goes Wrong (and How to Notice It) -- Other Formats -- HDF5 and NetCDF-4 -- Tools and Libraries -- SQLite -- Apache Parquet -- Data Frames -- Spark/Scala -- Pandas and Derived Wrappers -- Vaex -- Data Frames in R (Tidyverse) -- Data Frames in R (data.table) 
505 8 |a Bash for Fun -- Exercises -- Tidy Data from Excel -- Tidy Data from SQL -- Denouement -- Chapter 2: Hierarchical Formats -- JSON -- What JSON Looks Like -- NaN Handling and Data Types -- JSON Lines -- GeoJSON -- Tidy Geography -- JSON Schema -- XML -- User Records -- Keyhole Markup Language -- Configuration Files -- INI and Flat Custom Formats -- TOML -- Yet Another Markup Language -- NoSQL Databases -- Document-Oriented Databases -- Missing Fields -- Denormalization and Its Discontents -- Key/Value Stores -- Exercises -- Exploring Filled Area -- Create a Relational Model -- Denouement 
505 8 |a Chapter 3: Repurposing Data Sources -- Web Scraping -- HTML Tables -- Non-Tabular Data -- Command-Line Scraping -- Portable Document Format -- Image Formats -- Pixel Statistics -- Channel Manipulation -- Metadata -- Binary Serialized Data Structures -- Custom Text Formats -- A Structured Log -- Character Encodings -- Exercises -- Enhancing the NPY Parser -- Scraping Web Traffic -- Denouement -- Part II -- The Vicissitudes of Error -- Chapter 4: Anomaly Detection -- Missing Data -- SQL -- Hierarchical Formats -- Sentinels -- Miscoded Data -- Fixed Bounds -- Outliers -- Z-Score 
505 8 |a Interquartile Range -- Multivariate Outliers -- Exercises -- A Famous Experiment -- Misspelled Words -- Denouement -- Chapter 5: Data Quality -- Missing Data -- Biasing Trends -- Understanding Bias -- Detecting Bias -- Comparison to Baselines -- Benford's Law -- Class Imbalance -- Normalization and Scaling -- Applying a Machine Learning Model -- Scaling Techniques -- Factor and Sample Weighting -- Cyclicity and Autocorrelation -- Domain Knowledge Trends -- Discovered Cycles -- Bespoke Validation -- Collation Validation -- Transcription Validation -- Exercises -- Data Characterization 
505 8 |a Oversampled Polls -- Denouement -- Part III -- Rectification and Creation -- Chapter 6: Value Imputation -- Typical-Value Imputation -- Typical Tabular Data -- Locality Imputation -- Trend Imputation -- Types of Trends -- A Larger Coarse Time Series -- Understanding the Data -- Removing Unusable Data -- Imputing Consistency -- Interpolation -- Non-Temporal Trends -- Sampling -- Undersampling -- Oversampling -- Exercises -- Alternate Trend Imputation -- Balancing Multiple Features -- Denouement -- Chapter 7: Feature Engineering -- Date/Time Fields -- Creating Datetimes -- Imposing Regularity 
520 |a Data in its raw state is rarely ready for productive analysis. This book not only teaches you data preparation, but also what questions you should ask of your data. It focuses on the thought processes necessary for successful data cleansing as much as on concise and precise code examples that express these thoughts. 
590 |b PALCIMontclair 
650 0 |a Database management. 
650 0 |a Data integrity. 
650 6 |a Bases de données  |x Gestion. 
650 6 |a Intégrité des données. 
650 7 |a Data integrity.  |2 fast  |0 (OCoLC)fst01746571 
650 7 |a Database management.  |2 fast  |0 (OCoLC)fst00888037 
655 0 |a Electronic books. 
999 1 0 |i cc040378-17cc-4518-a4a8-ed83bc026316  |l on1245420921  |s US-NJUPM  |m cleaning_data_for_effective_data_sciencedoing_the_other_80_of_the_work_____2021_______packta________________________________________david_mertz__mertz_________________e