Home / Blog / Data Science / Decision Tree in a Cheat Sheet

Decision Tree in a Cheat Sheet

July 05, 2023
46

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Decision Trees are represented as Nodes:

Root Node represented as a Rectangle or a Square: ▭ or □
Branch/ Internal Node represented as a Circle: ○
Leaf /Terminal Node represented as a Triangle or a dot: △ or ○

Click here to explore 360DigiTMG.

Learn the core concepts of Data Science Course video on Youtube:

Information Gain:

After the dataset is divided based on an attribute, the information gain is based on the decrease in entropy. It has a value between 0 and 1.

Entropy before - after is the formula for information gain (IG).

Entropy:

It is the measure of impurity, it is also called a measure of uncertainty.

Its value ranges between 0 to 1

Decision Tree in a Cheat Sheet

Gini Index:

The purity is measured by the Gini Index. Gini Index is used by the CART algorithm for decision trees. It has a value between 0 and 1

Decision Tree in a Cheat Sheet

Stacking: A meta-classifier or a meta-regression is used in the ensemble learning approach known as stacking to merge many classification or regression models.
Voting: Voting combines the predictions from multiple machine learning algorithms
Hard Voting: The class that gained the most votes in this case will be selected as the output class.
Soft Voting: In this, the probability values for each predicted class are added and taken an average, the one with more average is considered.
Bagging: Bagging is aggregation in Bootstrap. It improves accuracy and decreases over-fitting.
Random Forest: Random Forest is an extension to Bagging. IT minimizes the overfit
Ada Boost: Ada Boost seeks to create a powerful classifier by merging many weak classifiers. Improve the weak classifier's accuracy.
Gradient Boosting: Gradient Boosting is used to define the loss function and reduce it. It works well with categorical and count data and also handles the missing data well
XG Boost: Gradient boosting is improved by XG Boost, which can be applied to both classifiers and regression models.

Decision Tree in a Cheat Sheet

Libraries to install in Python for Decision Tree and Ensemble

from sklearn.preprocessing import LabelEncoder - Used for one-hot encoding on the data
from sklearn.preprocessing import scale - Data preprocessing for standardization
from sklearn.model_selection import train_test_split - To split the data into Train and Test
from sklearn.tree import DecisionTreeClassifier as DT - Used in multiclass classification
from sklearn import tree - Used to generate and draw trees
from sklearn.metrics import accuracy_score - Multilabel classification for subset accuracy
from sklearn.metrics import confusion_matrix - Used to evaluate the quality of o/p classifier
from sklearn.ensemble import VotingClassifier - Used for prediction based on the most frequent one
from sklearn.ensemble import BaggingClassifier - Used on the base classifier on random subsets of the original dataset and aggregate individual predictions
from sklearn.ensemble import RandomForestClassifier - Used in both classification and regression models
from sklearn.ensemble import AdaBoostClassifier - It uses multiple classifiers to increase the accuracy of the classifier
from sklearn.ensemble import GradientBoostingClassifier - Gradient Boosting classifiers is to minimize the loss
import xgboost as xgb - XGB is an extension of GB used for speed and performance

Watch Free Videos on Youtube

Libraries to install in R for Decision Tree and Ensemble

library(caTools) -Used for basic utility functions
library(C50) - C5.0 classification model for Decision Tree
library(rpart) - R implementation in Recursive Partitioning And Regression Trees
library(gmodels) - For model fitting
library(caret) - For Classification and Regression
library(randomForest) - Algorithm for Classification and Regression
library(adabag) - AdaBoost for classification with bagging and boosting
library(gbm) - Gradient Boosting Machine for Regression models
library(xgboost) - It’s an extension to GB and it supports both classification and regression models

Hyperparameters in Decision Tree
Hyper Parameters	Input Values	Default Value
max_depth	Integer or None, Optional	None
min_samples_split	Integer, Float, Optional	2
min_samples_leaf	Integer, Float, Optional	1
min_weight_fraction_leaf	Float, Optional	0
max_features	Integer, Float, string or None, Option	None
random_state	Integer, RSI or None, Optional	None
min_impurity_decrease	Float, Optional	0
base_estimator	Int	Decision Tree
n_estimators	Int	10
random_state	seed	None
n_jobs	Int, None	None
Criterion	Integer, float	Gini
min_samples_leaf	Integer	1
oob_score	Boolean	False
learning_rate	Integer	1
colsample_byleve	Integer, float	1
colsample_bytree	Integer, float	1
Subsample	Integer, float	1
Eta	Integer, float	0.3
min_child_weight	Integer	1
Gamma	Integer, Float	0
Alpha	Integer, float	0
Lambda	Integer, float	1