Cleaning Data for Effective Data Science Doing the other 80% of the work with Python, R, and command-line tools.
Data in its raw state is rarely ready for productive analysis. This book not only teaches you data preparation, but also what questions you should ask of your data. It focuses on the thought processes necessary for successful data cleansing as much as on concise and precise code examples that expres...
Saved in:
Main Author: | |
---|---|
Format: | eBook |
Language: | English |
Published: |
Packt Publishing
2021.
|
Subjects: |
LEADER | 04685cam a2200397Ma 4500 | ||
---|---|---|---|
001 | cc040378-17cc-4518-a4a8-ed83bc026316 | ||
005 | 20250121000000.0 | ||
006 | m o d | ||
007 | cr ||||||||||| | ||
008 | 210331s2021 enk o 000 0 eng d | ||
040 | |a UKAHL |b eng |c UKAHL |d EBLCP |d CZL |d OCLCO |d OCLCF | ||
020 | |a 9781801074407 |q (e-book) | ||
020 | |a 1801074402 | ||
035 | |a (OCoLC)1245420921 | ||
050 | 4 | |a QA76.9.D3 |b .M478 2021 | |
082 | 0 | 4 | |a 005.7565 |2 23 |
049 | |a MAIN | ||
100 | 1 | |a David Mertz, Mertz. | |
245 | 1 | 0 | |a Cleaning Data for Effective Data Science |b Doing the other 80% of the work with Python, R, and command-line tools. |c David Mertz, Mertz. |
260 | |b Packt Publishing |c 2021. | ||
300 | |a 1 online resource | ||
505 | 0 | |a Cover -- Copyright -- Contributors -- Table of Contents -- Preface -- Part I -- Data Ingestion -- Chapter 1: Tabular Formats -- Tidying Up -- CSV -- Sanity Checks -- The Good, the Bad, and the Textual Data -- The Bad -- The Good -- Spreadsheets Considered Harmful -- SQL RDBMS -- Massaging Data Types -- Repeating in R -- Where SQL Goes Wrong (and How to Notice It) -- Other Formats -- HDF5 and NetCDF-4 -- Tools and Libraries -- SQLite -- Apache Parquet -- Data Frames -- Spark/Scala -- Pandas and Derived Wrappers -- Vaex -- Data Frames in R (Tidyverse) -- Data Frames in R (data.table) | |
505 | 8 | |a Bash for Fun -- Exercises -- Tidy Data from Excel -- Tidy Data from SQL -- Denouement -- Chapter 2: Hierarchical Formats -- JSON -- What JSON Looks Like -- NaN Handling and Data Types -- JSON Lines -- GeoJSON -- Tidy Geography -- JSON Schema -- XML -- User Records -- Keyhole Markup Language -- Configuration Files -- INI and Flat Custom Formats -- TOML -- Yet Another Markup Language -- NoSQL Databases -- Document-Oriented Databases -- Missing Fields -- Denormalization and Its Discontents -- Key/Value Stores -- Exercises -- Exploring Filled Area -- Create a Relational Model -- Denouement | |
505 | 8 | |a Chapter 3: Repurposing Data Sources -- Web Scraping -- HTML Tables -- Non-Tabular Data -- Command-Line Scraping -- Portable Document Format -- Image Formats -- Pixel Statistics -- Channel Manipulation -- Metadata -- Binary Serialized Data Structures -- Custom Text Formats -- A Structured Log -- Character Encodings -- Exercises -- Enhancing the NPY Parser -- Scraping Web Traffic -- Denouement -- Part II -- The Vicissitudes of Error -- Chapter 4: Anomaly Detection -- Missing Data -- SQL -- Hierarchical Formats -- Sentinels -- Miscoded Data -- Fixed Bounds -- Outliers -- Z-Score | |
505 | 8 | |a Interquartile Range -- Multivariate Outliers -- Exercises -- A Famous Experiment -- Misspelled Words -- Denouement -- Chapter 5: Data Quality -- Missing Data -- Biasing Trends -- Understanding Bias -- Detecting Bias -- Comparison to Baselines -- Benford's Law -- Class Imbalance -- Normalization and Scaling -- Applying a Machine Learning Model -- Scaling Techniques -- Factor and Sample Weighting -- Cyclicity and Autocorrelation -- Domain Knowledge Trends -- Discovered Cycles -- Bespoke Validation -- Collation Validation -- Transcription Validation -- Exercises -- Data Characterization | |
505 | 8 | |a Oversampled Polls -- Denouement -- Part III -- Rectification and Creation -- Chapter 6: Value Imputation -- Typical-Value Imputation -- Typical Tabular Data -- Locality Imputation -- Trend Imputation -- Types of Trends -- A Larger Coarse Time Series -- Understanding the Data -- Removing Unusable Data -- Imputing Consistency -- Interpolation -- Non-Temporal Trends -- Sampling -- Undersampling -- Oversampling -- Exercises -- Alternate Trend Imputation -- Balancing Multiple Features -- Denouement -- Chapter 7: Feature Engineering -- Date/Time Fields -- Creating Datetimes -- Imposing Regularity | |
520 | |a Data in its raw state is rarely ready for productive analysis. This book not only teaches you data preparation, but also what questions you should ask of your data. It focuses on the thought processes necessary for successful data cleansing as much as on concise and precise code examples that express these thoughts. | ||
590 | |b PALCIMontclair | ||
650 | 0 | |a Database management. | |
650 | 0 | |a Data integrity. | |
650 | 6 | |a Bases de données |x Gestion. | |
650 | 6 | |a Intégrité des données. | |
650 | 7 | |a Data integrity. |2 fast |0 (OCoLC)fst01746571 | |
650 | 7 | |a Database management. |2 fast |0 (OCoLC)fst00888037 | |
655 | 0 | |a Electronic books. | |
999 | 1 | 0 | |i cc040378-17cc-4518-a4a8-ed83bc026316 |l on1245420921 |s US-NJUPM |m cleaning_data_for_effective_data_sciencedoing_the_other_80_of_the_work_____2021_______packta________________________________________david_mertz__mertz_________________e |