───✱*.｡:｡✱*.:｡✧*.｡✰*.:｡✧*.｡:｡*.｡✱ ───

Data and Goal

Data → a set of records (examples, instances, cases, rows, etc)
- attributes →
- A class → each example is labelled with a pre-defined class
Goal → to learn a classification model from the data that can be used to predict the classes of new cases/instances

Classification Process

Model construction involves describing a set of predetermined classes
- Each record is is assumed to belong to one predefined class
- The set of records used for model construction is the training set
- The model is represented as classification rules, decisions, or probabilistic models
Model usage is for classifying future or unknown objects
- Estimate the accuracy of the model
  - The classified result compared with the known label of the test set
  - Accuracy rate is the percentage of test set examples that are correct
  - Test set must be independent of training set, otherwise over-fitting will occur
  - If the accuracy is acceptable, use the model to classify future data whose class labels are not known

Attributes are categorical
Start at the rot and work down
For each level in the tree are selected based on some statistical measure (typically Gini index)
You stoop picking attributes when
- All samples for the given node belong to the same class
- There are no remaining attributes for further partitioning

Gini impurity is a measurement of likelihood for an incorrect classification of a new instance of a random variable
An attribute with lower Gini impurity is more accurate than attributes with higher Gini impurity

───✱*.｡:｡✱*.:｡✧*.｡✰*.:｡✧*.｡:｡*.｡✱ ───