───✱*.。:。✱*.:。✧*.。✰*.:。✧*.。:。*.。✱ ───
Data preprocessing is the process of transforming the raw data in a useful and efficient format. We want our data to be free from noise, missing data, and discrepancies
Why Preprocess?
- Data ≠ knowledge
- Knowledge acquisition is heavily dependent on data quality
- There are several measures for data quality
- Accuracy → correct or wrong
- Completeness → not recorded, unavailable
- Consistency → some modified but some not
- Timeliness → timely updates
- Believability → how trustable is the data
- Interpretability → how easy the data is to understand
Importance of Preprocessing
- Not only is it necessary to perform meaningful analysis, but it will also take up the largest chunk of a data scientist’s time
- We’ve seen how quickly Python can perform the analysis components ensuring the data is cleansed will require more from the analyst
- It’s a safe assumption that data loading/preprocessing will take around 30-40% of your time
Forms of Preprocessing
- Data Cleaning → fill in missing values, smooth noisy data, identify or remove outliers
- Data Integration → integration of multiple databases
- Data Reduction → compression, dimensionality and numerosity reduction
- Data Transformation/Discretization → normalization, concept hierarchy normalization
Data Cleaning
- Data in the real world is often dirty, there is a lot of potentially incorrect data in each data set
- Incomplete → lacking values
- Salary = NaN
- Noisy → containing noise (errors)
- Salary = -10
- Inconsistent → containing discrepancies
- Gender = “M”, Gender = “Male”
- Outliers → data that’s outside of normal range
- 1, 2, 1000000, 3, 4
- Incomplete → lacking values
───✱*.。:。✱*.:。✧*.。✰*.:。✧*.。:。*.。✱ ───