`⎚⩊⎚´ -✧

❯

❯

❯

Data Preprocessing

Data Preprocessing

Sep 22, 20252 min read

csci-4047

───✱*.｡:｡✱*.:｡✧*.｡✰*.:｡✧*.｡:｡*.｡✱ ───

Data preprocessing is the process of transforming the raw data in a useful and efficient format. We want our data to be free from noise, missing data, and discrepancies

Why Preprocess?

Data ≠ knowledge
- Knowledge acquisition is heavily dependent on data quality
There are several measures for data quality
- Accuracy → correct or wrong
- Completeness → not recorded, unavailable
- Consistency → some modified but some not
- Timeliness → timely updates
- Believability → how trustable is the data
- Interpretability → how easy the data is to understand

Importance of Preprocessing

Not only is it necessary to perform meaningful analysis, but it will also take up the largest chunk of a data scientist’s time
- We’ve seen how quickly Python can perform the analysis components ensuring the data is cleansed will require more from the analyst
It’s a safe assumption that data loading/preprocessing will take around 30-40% of your time

Forms of Preprocessing

Data Cleaning → fill in missing values, smooth noisy data, identify or remove outliers
Data Integration → integration of multiple databases
Data Reduction → compression, dimensionality and numerosity reduction
Data Transformation/Discretization → normalization, concept hierarchy normalization

Data Cleaning

Data in the real world is often dirty, there is a lot of potentially incorrect data in each data set
- Incomplete → lacking values
  - Salary = NaN
- Noisy → containing noise (errors)
  - Salary = -10
- Inconsistent → containing discrepancies
  - Gender = “M”, Gender = “Male”
- Outliers → data that’s outside of normal range
  - 1, 2, 1000000, 3, 4

───✱*.｡:｡✱*.:｡✧*.｡✰*.:｡✧*.｡:｡*.｡✱ ───

Graph View

Why Preprocess?
Importance of Preprocessing
Forms of Preprocessing
Data Cleaning

Created with Quartz v4.5.1 © 2025