───✱*.。:。✱*.:。✧*.。✰*.:。✧*.。:。*.。✱ ───
Data and Goal
- Data → a set of records (examples, instances, cases, rows, etc)
- attributes →
- A class → each example is labelled with a pre-defined class
- Goal → to learn a classification model from the data that can be used to predict the classes of new cases/instances
Classification Process
- Model construction involves describing a set of predetermined classes
- Each record is is assumed to belong to one predefined class
- The set of records used for model construction is the training set
- The model is represented as classification rules, decisions, or probabilistic models
- Model usage is for classifying future or unknown objects
- Estimate the accuracy of the model
- The classified result compared with the known label of the test set
- Accuracy rate is the percentage of test set examples that are correct
- Test set must be independent of training set, otherwise over-fitting will occur
- If the accuracy is acceptable, use the model to classify future data whose class labels are not known
- Estimate the accuracy of the model
Decision Tree Construction Algorithm
- Attributes are categorical
- Start at the rot and work down
- For each level in the tree are selected based on some statistical measure (typically Gini index)
- You stoop picking attributes when
- All samples for the given node belong to the same class
- There are no remaining attributes for further partitioning
Gini Impurity
- Gini impurity is a measurement of likelihood for an incorrect classification of a new instance of a random variable
- An attribute with lower Gini impurity is more accurate than attributes with higher Gini impurity
───✱*.。:。✱*.:。✧*.。✰*.:。✧*.。:。*.。✱ ───