───✱*.。:。✱*.:。✧*.。✰*.:。✧*.。:。*.。✱ ───

Data and Goal

  • Data → a set of records (examples, instances, cases, rows, etc)
    • 𝑘 attributes → 𝐴1,𝐴2,,𝐴𝑘
    • A class → each example is labelled with a pre-defined class
  • Goal → to learn a classification model from the data that can be used to predict the classes of new cases/instances

Classification Process

  • Model construction involves describing a set of predetermined classes
    • Each record is is assumed to belong to one predefined class
    • The set of records used for model construction is the training set
    • The model is represented as classification rules, decisions, or probabilistic models
  • Model usage is for classifying future or unknown objects
    • Estimate the accuracy of the model
      • The classified result compared with the known label of the test set
      • Accuracy rate is the percentage of test set examples that are correct
      • Test set must be independent of training set, otherwise over-fitting will occur
      • If the accuracy is acceptable, use the model to classify future data whose class labels are not known

Decision Tree Construction Algorithm

  • Attributes are categorical
  • Start at the rot and work down
  • For each level in the tree are selected based on some statistical measure (typically Gini index)
  • You stoop picking attributes when
    • All samples for the given node belong to the same class
    • There are no remaining attributes for further partitioning

Gini Impurity

  • Gini impurity is a measurement of likelihood for an incorrect classification of a new instance of a random variable
  • An attribute with lower Gini impurity is more accurate than attributes with higher Gini impurity

───✱*.。:。✱*.:。✧*.。✰*.:。✧*.。:。*.。✱ ───