ml unit3
ml unit3
A decision tree, which has a hierarchical structure made up of root, branches, internal, and leaf
nodes, is a non-parametric supervised learning approach used for classification and regression
applications.
• A decision tree (DT) is a common machine learning structure that splits a dataset into
subsets to improve purity and reduce entropy.
• It is a feature-based decision-making model that provides transparency and is easy to
interpret.
1. Greedy Nature:
o The best feature for purity in splitting is used at the root node.
o Subsequent feature selection at child nodes depends on the parent node’s
feature.
o This sequential process may not always be optimal.
2. Computational Cost:
o Feature selection depends on the number of features and pattern complexity.
o Simple tests are required at each node to reduce computational demand.
o Large dimensionality increases computational difficulty.
3. Overfitting Risk:
o Deep decision trees (DTs) tend to overfit training data and perform poorly on
validation/test data.
o Overfitting can be managed through pruning, where deeper subtrees are
trimmed based on performance on validation data.
• Root Node: The initial node at the beginning of a decision tree, where the entire population
or dataset starts dividing based on various features or conditions.
• Decision Nodes: Nodes resulting from the splitting of root nodes are known as decision
nodes. These nodes represent intermediate decisions or conditions within the tree.
• Leaf Nodes: Nodes where further splitting is not possible, often indicating the final
classification or outcome. Leaf nodes are also referred to as terminal nodes.
• Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a sub-section of a
these tree is referred to as a sub-tree. It represents a specific portion of the decision tree.
• Pruning: The process of removing or cutting down specific nodes in a tree to prevent
overfitting and simplify the model.
• Branch / Sub-Tree: A subsection of the entire is referred to as a branch or sub-tree. It
represents a specific path of decisions and outcomes within the tree.
• Parent and Child Node: In a decision tree, a node that is divided into sub-nodes is known as
a parent node, and the sub-nodes emerging from it are referred to as child nodes. The
parent node represents a decision or condition, while the child nodes represent the
potential outcomes or further decisions based on that condition.
Decision tree:
Popular Combinational ML Models( multiple decision trees):
1. Random Forest:
o A collection of decision trees.
o The final prediction is based on the combined results of multiple trees.
2. AdaBoost:
o Uses multiple weak learners to create a strong model.
o A weighted majority voting approach is used to improve overall accuracy.
3. Gradient Boosting:
o An advanced version of AdaBoost.
o Uses previous model errors to improve the next model.
These models are preferred in large-scale applications due to their ability to handle high-dimensional
data.
Splitting Criteria:
where:
Example 1:
Example:
Show that for a C-class problem, the maximum entropy is log(c)?