───✱*.。:。✱*.:。✧*.。✰*.:。✧*.。:。*.。✱ ───

Data preprocessing is the process of transforming the raw data in a useful and efficient format. We want our data to be free from noise, missing data, and discrepancies

Why Preprocess?

  • Data ≠ knowledge
    • Knowledge acquisition is heavily dependent on data quality
  • There are several measures for data quality
    • Accuracy → correct or wrong
    • Completeness → not recorded, unavailable
    • Consistency → some modified but some not
    • Timeliness → timely updates
    • Believability → how trustable is the data
    • Interpretability → how easy the data is to understand

Importance of Preprocessing

  • Not only is it necessary to perform meaningful analysis, but it will also take up the largest chunk of a data scientist’s time
    • We’ve seen how quickly Python can perform the analysis components ensuring the data is cleansed will require more from the analyst
  • It’s a safe assumption that data loading/preprocessing will take around 30-40% of your time

Forms of Preprocessing

  • Data Cleaning → fill in missing values, smooth noisy data, identify or remove outliers
  • Data Integration → integration of multiple databases
  • Data Reduction → compression, dimensionality and numerosity reduction
  • Data Transformation/Discretization → normalization, concept hierarchy normalization

Data Cleaning

  • Data in the real world is often dirty, there is a lot of potentially incorrect data in each data set
    • Incomplete → lacking values
      • Salary = NaN
    • Noisy → containing noise (errors)
      • Salary = -10
    • Inconsistent → containing discrepancies
      • Gender = “M”, Gender = “Male”
    • Outliers → data that’s outside of normal range
      • 1, 2, 1000000, 3, 4

───✱*.。:。✱*.:。✧*.。✰*.:。✧*.。:。*.。✱ ───