───✱*.。:。✱*.:。✧*.。✰*.:。✧*.。:。*.。✱ ───
Data preprocessing is the process of transforming the raw data in a useful and efficient format. We want our data to be free from noise, missing data, and discrepancies
Why Preprocess?
- Data ≠ knowledge
- Knowledge acquisition is heavily dependent on data quality
 
 - There are several measures for data quality
- Accuracy → correct or wrong
 - Completeness → not recorded, unavailable
 - Consistency → some modified but some not
 - Timeliness → timely updates
 - Believability → how trustable is the data
 - Interpretability → how easy the data is to understand
 
 
Importance of Preprocessing
- Not only is it necessary to perform meaningful analysis, but it will also take up the largest chunk of a data scientist’s time
- We’ve seen how quickly Python can perform the analysis components ensuring the data is cleansed will require more from the analyst
 
 - It’s a safe assumption that data loading/preprocessing will take around 30-40% of your time
 
Forms of Preprocessing
- Data Cleaning → fill in missing values, smooth noisy data, identify or remove outliers
 - Data Integration → integration of multiple databases
 - Data Reduction → compression, dimensionality and numerosity reduction
 - Data Transformation/Discretization → normalization, concept hierarchy normalization
 
Data Cleaning
- Data in the real world is often dirty, there is a lot of potentially incorrect data in each data set
- Incomplete → lacking values
- Salary = NaN
 
 - Noisy → containing noise (errors)
- Salary = -10
 
 - Inconsistent → containing discrepancies
- Gender = “M”, Gender = “Male”
 
 - Outliers → data that’s outside of normal range
- 1, 2, 1000000, 3, 4
 
 
 - Incomplete → lacking values
 
───✱*.。:。✱*.:。✧*.。✰*.:。✧*.。:。*.。✱ ───