Data Mining Lab Manual COMPLETE GMR
Data Mining Lab Manual COMPLETE GMR
INDEX
S.NO Name of the Experiment Page No
1 Fundamentals of Data Mining 3
2 Introduction to WEKA 5
3 Attribute Relation File Format (ARFF) 9
4 Comma Separated Value (CSV) 10
5 Credit Risk Assessment 11
LAB CYCLE TASKS
6 Task1 16
7 Task2 18
8 Task3 19
9 Task4 29
10 Task5 31
11 Task6 32
12 Task7 39
13 Task8 41
14 Task9 46
15 Task10 47
16 Task11 48
17 Task12 52
18 Generate Association rules for the given transactional 54
database using Apriori algorithm.
19 Generate classification rules for the given data base 56
using decision tree (J48).
1
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Data mining refers to extracting or mining knowledge from large amounts of data.
Data mining can also be referred as knowledge mining from data, knowledge extraction, data
archeology and data dredging.
These functionalities are used to specify the kind of patterns to be found in data mining tasks.
Data mining tasks can be classified into 2 categories:
Descriptive
Predictive
Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that appear in a data
set frequently.
Construct models that describe and distinguish classes or concepts for future prediction.
Predicts some unknown or missing numerical values.
Cluster analysis:
2
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Maximizing intra-class similarity and minimizing inter-class similarity.
Outlier analysis:
Outlier: a data object that does not comply with the general behavior of data.
Noise or exception but is quite useful in fraud detection, rare events analysis.
3
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Introduction to WEKA
A collection of open source of many data mining and machine learning algorithms, including
pre-processing on data
Classification
clustering
association rule extraction
Created by researchers at the University of Waikato in New Zealand
Java based (also open source).
4
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Explorer
Experimenter
Knowledge Flow
Explorer
Simple Command-line
5
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Preprocessing Filters
Attribute selection
Classification/Regression
Clustering
Association discovery
Visualization
6
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
WEKA data formats
binary
Example:
@RELATION STUDENT
@ATTRIBUTE SNO NUMERIC
@ATTRIBUTE NAME STRING
7
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
@ATTRIBUTE AGE NUMERIC
@ATTRIBUTE CITY {HYD,DELHI,MUMBAI}
@ATTRIBUTE BRANCH {CSE,IT,ECE,EEE}
@ATTRIBUTE MARKS NUMERIC
@ATTRIBUTE CLASS {PASS,FAIL}
@DATA
1,DEEPIKA,22,HYD,CSE,76,PASS
2,RADHIKA,23,DELHI,IT,34,FAIL
3,PRADEEP,21,MUMBAI,EEE,45,PASS
4,KRISHNA,22,HYD,ECE,23,FAIL
5,RISHI,21,DELHI,IT,88,PASS
6,SHARAN,21,MUMBAI,EEE,92,PASS
7,SHREYANSH,22,HYD,CSE,26,FAIL
8,SUGUNA,23,MUMBAI,ECE,65,PASS
8
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Example:
SNO,NAME,AGE,CITY,BRANCH,MARKS,CLASS
1,DEEPIKA,22,HYD,CSE,76,PASS
2,RADHIKA,23,DELHI,IT,34,FAIL
3,PRADEEP,21,MUMBAI,EEE,45,PASS
4,KRISHNA,22,HYD,ECE,23,FAIL
5,RISHI,21,DELHI,IT,88,PASS
6,SHARAN,21,MUMBAI,EEE,92,PASS
7,SHREYANSH,22,HYD,CSE,26,FAIL
8,SUGUNA,23,MUMBAI,ECE,65,PASS
Description: The business of banks is making loans. Assessing the credit worthiness of an applicant is
of crucial importance. You have to develop a system to help a loan officer decide whether the credit of
a customer is good, or bad. A bank's business rules regarding loans must consider two opposing factors.
On the one hand, a bank wants to make as many loans as possible. Interest on these loans is the banks
profit source. On the other hand, a bank cannot afford to make too many bad loans. Too many bad loans
could lead to the collapse of the bank. The bank's loan policy must involve a compromise: not too
strict, and not too lenient.
To do the assignment, you first and foremost need some knowledge about the world of credit. You can
acquire such knowledge in a number of ways.
1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent
her knowledge in the form of production rules.
2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance.
Translate this knowledge from text form to production rule form.
3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used
to judge the credit worthiness of a loan applicant.
4. Case histories. Find records of actual cases where competent loan officers correctly judged when,
and when not to, approve a loan application.
The German Credit Data: Actual historical credit data is not always easy to come by because of
confidentiality rules. Here is one such dataset, consisting of 1000 actual cases collected in Germany.
Credit dataset (original) Excel spreadsheet version of the German credit data. (Down load from web) In
spite of the fact that the data is German, you should probably make use of it for this assignment.
(Unless you really can consult a real loan officer!)
A few notes on the German dataset
9
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
DM stands for Deutsche Mark, the unit of currency, worth about 90 cents Canadian (but looks and
acts like a quarter).
owns telephone. German phone rates are much higher than in Canada so fewer people own
telephones.
Foreign worker. There are millions of these in Germany (many from Turkey). It is very hard to get
German citizenship if you were not born of German parents.
There are 20 attributes used in judging a loan applicant. The goal is the classify the applicant into one
of two categories, good or bad.
Procedure
Download German dataset from the internet (save data as arff format).
The description of data is as follows:
Description of the German credit dataset.
2. Source Information
Attribute 1: (qualitative)
Status of existing checking account
A11 : ... < 0 DM
A12 : 0 <= ... < 200 DM
A13 : ... >= 200 DM /
salary assignments for at least 1 year
A14 : no checking account
10
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Attribute 2: (numerical)
Duration in month
Attribute 3: (qualitative)
Credit history
A30 : no credits taken/
all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/
other credits existing (not at this bank)
Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others
Attribute 5: (numerical)
Credit amount
Attibute 6: (qualitative)
Savings account/bonds
A61 : ... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account
Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 : ... < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 : .. >= 7 years
Attribute 8: (numerical)
11
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Installment rate in percentage of disposable income
Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single
12
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Attribute 17: (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer
13
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
14
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
1. List all the categorical (or nominal) attributes and the real-valued attributes separately.
SOLUTION:
Count all qualitative and numerical attributes
The solution is:-
Number of Attributes german: 21 (7 numerical, 14 categorical)
Attributes:-
1. checking_status
2. duration
3. credit history
4. purpose
5. credit amount
6. savings_status
7. employment duration
8. installment rate
9. personal status
10. debitors
11. residence_since
12. property
15. housing
17. job
18. num_dependents
19. telephone
21. class
15
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
2. credit history
3. purpose
4. savings_status
5. employment_since
6. personal status
7. debtors
8. property
9. installment plans
10. housing
11. job
12. telephone
2. credit amount
3. installment rate
4. residence_since
5. age
6. existing credits
7. num_dependents
16
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
2. What attributes do you think might be crucial in making the credit assessment? Come up with
some simple rules in plain English using your selected attributes.
SOLUTION:
According to me the following attributes may be crucial in making the credit risk assessment.
1. Credit_history
2. Employment
3. Property_magnitude
4. job
5. duration
6. crdit_amount
7. installment
8. existing credit
Basing on the above attributes, we can make a decision whether to give credit or not.
17
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
3. One type of model that you can create is a Decision Tree - train a Decision Tree using the
complete dataset as the training data. Report the model obtained after training.
Procedure:
Step-1 Save credit-g2.xls as .csv file type and put it in one location
Step-3 Click on open file tab and take the file from the desired location(German data of type csv
which was saved before)
Step-6 Click on classify tab which is on top headers tab then choose classifier as choose
->tress->j48 and then click ok.
Eg-Visualize tree
J48 pruned tree
------------------
checking_status = <0
| foreign_worker = yes
| | duration <= 11
| | | existing_credits <= 1
| | duration > 11
| | | | purpose = furniture/equipment
| purpose = radio/tv
19
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| purpose = vacation: bad (0.0)
job = skilled
| other_parties = none
| | duration <= 30
| | | savings_status = <100
| | | | | own_telephone = none
| | | | | | existing_credits <= 1
| | | savings_status = 100<=X<500
| | | | existing_credits <= 1
checking_status = 0<=X<200
| | savings_status = <100
| | | other_parties = none
| | | | duration <= 42
| | | | | | purpose = furniture/equipment
| | | | | | | duration > 10
| | | | | | purpose = business
22
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | | other_parties = co applicant: good (2.0)
| | | other_parties = guarantor
| | savings_status = 100<=X<500
| | | purpose = business
| | | | housing = rent
Outputs
24
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
25
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
26
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
27
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
4. Suppose you use your above model trained on the complete dataset, and classify credit
good/bad for each of the examples in the dataset. What % of examples can you classify correctly?
(This is also called testing on the training set) Why do you think you cannot get 100 % training
accuracy?
SOLUTION :-
Resubstitution (N ; N)
Leave-one-out (N-1 ; 1)
REMEMBER: we must know the classification (class attribute values) of all instances (records) used in
the test procedure.
Basic Concepts
Error rate: a percentage of errors made over the whole set of instances (records) used for testing
Predictive Accuracy: a percentage of well classified data in the testing data set.
Example:
28
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Error rate:
2 errors: #2 and #5
In the same way training data set will not get 100% accuracy for out data
In the above model we trained complete dataset and we classified credit good/bad for each
For example:
IF
purpose=vacation THEN
credit=bad
ELSE
purpose=business THEN
Credit=good
We classified 85.5% of examples correctly and the remaining 14.5% of examples are
incorrectly classified. We cant get 100% training accuracy because out of the 20 attributes, we have
some unnecessary attributes which are also been analyzed and trained.
Due to this the accuracy is affected and hence we cant get 100% training accuracy.
29
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
5. Is testing on the training set as you did above a good idea? Why or Why not?
SOLUTION:
According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as
training set and the remaining 1/3 as test set. But here in the above model we have taken complete
This is done for the analyzing and training of the unnecessary attributes which does not
make a crucial role in credit risk assessment. And by this complexity is increasing and finally it leads to
If some part of the dataset is used as a training set and the remaining as test set then it
leads to the accurate results and the time for computation will be less.
Second step: use each subset in turn for testing, the remainder for training (repeating cross-validation)
The error (predictive accuracy) estimates are averaged to yield an overall error (predictive accuracy)
estimate
Why 10?
Extensive experiments have shown that this is the best choice to get an accurate estimate. There is
also some theoretical evidence for this. So interesting!
30
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
6. One approach for solving the problem encountered in the previous question is using cross-
validation? Describe what cross-validation is briefly. Train a Decision Tree again using cross-
validation and report your results. Does your accuracy increase/decrease? Why?
Cross validation:-
In k-fold cross-validation, the initial data are randomly portioned into k mutually exclusive subsets or
folds D1, D2, D3, . . . . . ., Dk. Each of approximately equal size. Training and testing is performed k
times. In iteration I, partition Di is reserved as the test set and the remaining partitions are collectively
used to train the model. That is in the first iteration subsets D2, D3, . . . . . ., Dk collectively serve as the
training set in order to obtain as first model. Which is tested on Di. The second trained on the subsets
Click on classify tab which is on top headers tab then choose classifier as choose
| foreign_worker = yes
| | duration <= 11
| | | existing_credits <= 1
31
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | | existing_credits > 1: good (14.0)
| | duration > 11
| | | | purpose = furniture/equipment
| | | | purpose = radio/tv
32
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | | job = skilled
| | | | other_parties = none
| | | | | duration <= 30
| | | | | | savings_status = <100
| | | | | | | | | existing_credits <= 1
| | | | | | | | | | property_magnitude = car
| | | | | | savings_status = 100<=X<500
33
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | | | | | | credit_history = delayed previously: good (0.0)
| | | | | | | existing_credits <= 1
checking_status = 0<=X<200
| | savings_status = <100
| | | other_parties = none
| | | | duration <= 42
| | | | | | purpose = furniture/equipment
| | | | | | | duration > 10
| | | | | | purpose = business
| | | other_parties = guarantor
35
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | | | purpose = radio/tv: good (18.0/1.0)
| | | purpose = business
| | | | housing = rent
37
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
7. Check to see if the data shows a bias against "foreign workers" (attribute 20),or "personal-
status"(attribute 9). One way to do this (perhaps rather simple minded) is to remove these
attributes from the dataset and see if the decision tree created in those cases is significantly
different from the full dataset case which you have already done. To remove an attribute you can
use the reprocess tab in Weka's GUI Explorer. Did removing these attributes have any significant
effect? Discuss.
This increase in accuracy is because thus two attributes are not much important in training and
analyzing by removing this, the time has been reduced to some extent and then it results in
The decision tree which is created is very large compared to the decision tree which we have trained
now. This is the main difference between these two decision trees.
Solution:
Step-1
Step-2
Step-3
Then click on classify tab and take test options as cross validation and then click on start.
Step-4
Visualize results:-
38
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Visualize classifier errors:
Visualize tree
39
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
8. Another question might be, do you really need to input so many attributes to get good results?
Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17
(and 21, the class attribute (naturally)). Try out some combinations. (You had removed two
attributes in problem 7. Remember to reload the arff data file to get all the attributes initially
before you start selecting the ones you want.)
| employment = unemployed
| employment = <1
| employment = >=7
| employment = unemployed
| employment = <1
| employment = 4<=X<7
| employment = >=7
Number of Leaves : 27
| employment = unemployed
| employment = <1
| employment = >=7
| employment = unemployed
| employment = <1
| employment = 4<=X<7
| employment = >=7
Number of Leaves : 27
44
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
9. Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be
higher than accepting an applicant who has bad credit (case 2). Instead of counting the
misclassifications equally in both cases, give a higher cost to the first case (say cost 5) and lower
cost to the second case. You can do this by using a cost matrix in Weka. Train your Decision Tree
again and report the Decision Tree and cross-validation results. Are they significantly different
from results obtained in problem 6 (using equal cost)?
In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider
Case 1:
Procedure:
Step1: Click on weka from desktop.
Step2: Choose explorer from weka GUI chooser.
Step3: Click on open file and open credit_g2.csv data file.
Step4: Click on classify and click on choose button to choose CostSensitiveClassifier
from meta.
Step5: Right click on text box and select show properties.
Step6: Click on classifier button to choose j48 from trees.
Step7: click on costMatrix and set classes to 2 and click on Resize button.
Step8: Change FP value to 5.0 and press enter and click on ok.
Step9: Click on start to see classifier output.
Step10: Right click on result to see the visualization tree.
45
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
46
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
47
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
property_magnitude
age
other_payment_plans
housing
existing_credits
job
num_dependents
own_telephone
foreign_worker
class
Test mode: evaluate on training data
weka.classifiers.trees.J48 -C 0.25 -M 2
Classifier Model
J48 pruned tree
------------------
checking_status = <0
| duration <= 11
| | existing_credits <= 1: bad (21.82/5.91)
| | existing_credits > 1: good (8.64)
| duration > 11: bad (339.55/48.64)
checking_status = 0<=X<200
| other_parties = none
| | savings_status = <100: bad (170.91/30.0)
| | savings_status = 100<=X<500: bad (55.45/10.0)
| | savings_status = 500<=X<1000
| | | installment_commitment <= 2: good (2.27)
| | | installment_commitment > 2: bad (8.18/1.36)
| | savings_status = >=1000
| | | own_telephone = none: bad (9.09/2.27)
| | | own_telephone = yes: good (2.27)
| | savings_status = no known savings
| | | existing_credits <= 1
| | | | credit_history = no credits/all paid: good (0.45)
| | | | credit_history = all paid: good (0.45)
| | | | credit_history = existing paid
| | | | | property_magnitude = real estate: bad (6.36/1.82)
| | | | | property_magnitude = life insurance: bad (8.18/1.36)
| | | | | property_magnitude = car: good (3.64)
| | | | | property_magnitude = no known property: bad (5.91/1.36)
| | | | credit_history = delayed previously: good (2.27)
| | | | credit_history = critical/other existing credit: bad (0.0)
48
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | | existing_credits > 1: good (4.55)
| other_parties = co applicant: bad (15.0/1.36)
| other_parties = guarantor
| | duration <= 16: good (7.27)
| | duration > 16: bad (10.91/1.82)
checking_status = >=200
| property_magnitude = real estate
| | duration <= 7: good (2.27)
| | duration > 7: bad (21.82/3.64)
| property_magnitude = life insurance: good (5.45)
| property_magnitude = car
| | employment = unemployed: good (0.0)
| | employment = <1: bad (5.0/0.45)
| | employment = 1<=X<4
| | | installment_commitment <= 3: good (4.09)
| | | installment_commitment > 3: bad (2.73/0.45)
| | employment = 4<=X<7: good (0.91)
| | employment = >=7: good (2.27)
| property_magnitude = no known property
| | credit_amount <= 1323: bad (7.27/0.45)
| | credit_amount > 1323: good (2.27)
checking_status = no checking
| other_payment_plans = bank: bad (48.18/14.09)
| other_payment_plans = stores
| | savings_status = <100: bad (11.82/2.73)
| | savings_status = 100<=X<500: good (0.45)
| | savings_status = 500<=X<1000: good (0.45)
| | savings_status = >=1000: good (0.45)
| | savings_status = no known savings: good (2.27)
| other_payment_plans = none
| | credit_history = no credits/all paid: good (1.82)
| | credit_history = all paid: good (0.45)
| | credit_history = existing paid
| | | existing_credits <= 1
| | | | purpose = new car
| | | | | age <= 27: bad (8.64/1.82)
| | | | | age > 27: good (13.18)
| | | | purpose = used car: good (10.45)
| | | | purpose = furniture/equipment
| | | | | personal_status = male div/sep: good (0.45)
| | | | | personal_status = female div/dep/mar
| | | | | | age <= 27: good (2.73)
| | | | | | age > 27: bad (7.27/0.45)
| | | | | personal_status = male single: good (3.18)
| | | | | personal_status = male mar/wid: bad (2.73/0.45)
| | | | | personal_status = female single: bad (0.0)
| | | | purpose = radio/tv
| | | | | age <= 23: bad (5.91/1.36)
49
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | | | | age > 23: good (18.18)
| | | | purpose = domestic appliance: good (1.36)
| | | | purpose = repairs: bad (2.73/0.45)
| | | | purpose = education: bad (4.09/1.82)
| | | | purpose = vacation: good (0.0)
| | | | purpose = retraining: good (0.91)
| | | | purpose = business: good (2.73)
| | | | purpose = other: good (0.0)
| | | existing_credits > 1
| | | | installment_commitment <= 2: bad (11.82/0.45)
| | | | installment_commitment > 2
| | | | | credit_amount <= 9157: good (4.55)
| | | | | credit_amount > 9157: bad (2.27)
| | credit_history = delayed previously
| | | installment_commitment <= 3
| | | | residence_since <= 1: bad (2.27)
| | | | residence_since > 1: good (7.27)
| | | installment_commitment > 3: bad (17.73/4.09)
| | credit_history = critical/other existing credit: good (66.36/6.82)
Number of Leaves : 65
Cost Matrix
01
50
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.564 0.010 0.992 0.564 0.719 0.519 0.820 0.918 good
0.990 0.436 0.493 0.990 0.659 0.519 0.820 0.565 bad
Weighted Avg. 0.692 0.138 0.843 0.692 0.701 0.519 0.820 0.812
a b <-- classified as
395 305 | a = good
3 297 | b = bad
Use the above procedure for case2 by setting cost as 2. And see the output
51
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
10. Do you think it is a good idea to prefer simple decision trees instead of having long complex
decision trees? How does the complexity of a Decision Tree relate to the bias of the model?
When we consider long complex decision trees, we will have many unnecessary attributes in
the tree which results in increase of the bias of the model. Because of this, the accuracy of the model
This problem can be reduced by considering simple decision tree. The attributes will be less
and it decreases the bias of the model. Due to this the result will be more accurate.
So it is a good idea to prefer simple decision trees instead of long complex trees.
52
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
11. You can make your Decision Trees simpler by pruning the nodes. One approach is to use
Reduced Error Pruning - Explain this idea briefly. Try reduced
error pruning for training your Decision Trees using cross-validation (you can do
this in Weka) and report the Decision Tree you obtain ? Also, report your accuracy using the
pruned model. Does your accuracy increase ?
Reduced-error pruning :-
The idea of using a separate pruning set for pruningwhich is applicable to decision trees as
well as rule setsis called reduced-error pruning. The variant described previously prunes a rule
immediately after it has been grown and is called incremental reduced-error pruning. Another
possibility is to build a full, unpruned rule set first, pruning it afterwards by discarding individual
tests.
In reduced error pruning each of the node in the decision tree is considered for pruning. Pruning
a decision node consists of removing the subtree rooted at that node, making it a leaf node and
assigning it a most common classification of the training example affiliated with that node. Nodes are
removed only if resulting pruned tree performs no worse than the original over the validation set.
However, this method is much slower. Of course, there are many different ways to assess the
worth of a rule based on the pruning set. A simple measure is to consider how well the rule would do at
discriminating the predicted class from other classes if it were the only rule in the theory, operating
under the closed world assumption. If it gets p instances right out of the t instances that it covers, and
there are P instances of this class out of a total T of instances altogether, then it gets p positive
instances right. The instances that it does not cover include N - n negative ones, where n = t p
is the number of negative instances that the rule covers and N = T - P is the total number of negative
instances. Thus the rule has an overall success ratio of [p +(N - n)] T , and this quantity, evaluated on
the test set, has been used to evaluate the success of a rule when using reduced-error pruning.
53
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Procedure:
Scheme: weka.classifiers.trees.J48 -R -N 3 -Q 1 -M 2
Relation: german_credit
Instances: 1000
Attributes: 21
checking_status
duration
credit_history
purpose
credit_amount
savings_status
employment
installment_commitment
personal_status
other_parties
residence_since
property_magnitude
age
other_payment_plans
housing
existing_credits
job
num_dependents
own_telephone
foreign_worker
class
Test mode: evaluate on training data
checking_status = <0
| foreign_worker = yes
54
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | credit_history = no credits/all paid: bad (11.0/3.0)
| | credit_history = all paid: bad (9.0/1.0)
| | credit_history = existing paid
| | | other_parties = none
| | | | savings_status = <100
| | | | | existing_credits <= 1
| | | | | | purpose = new car: bad (17.0/4.0)
| | | | | | purpose = used car: good (3.0/1.0)
| | | | | | purpose = furniture/equipment: good (22.0/11.0)
| | | | | | purpose = radio/tv: good (18.0/8.0)
| | | | | | purpose = domestic appliance: bad (2.0)
| | | | | | purpose = repairs: bad (1.0)
| | | | | | purpose = education: bad (5.0/1.0)
| | | | | | purpose = vacation: bad (0.0)
| | | | | | purpose = retraining: bad (0.0)
| | | | | | purpose = business: good (3.0/1.0)
| | | | | | purpose = other: bad (0.0)
| | | | | existing_credits > 1: bad (5.0)
| | | | savings_status = 100<=X<500: bad (8.0/3.0)
| | | | savings_status = 500<=X<1000: good (1.0)
| | | | savings_status = >=1000: good (2.0)
| | | | savings_status = no known savings
| | | | | job = unemp/unskilled non res: bad (0.0)
| | | | | job = unskilled resident: good (2.0)
| | | | | job = skilled
| | | | | | own_telephone = none: bad (4.0)
| | | | | | own_telephone = yes: good (3.0/1.0)
| | | | | job = high qualif/self emp/mgmt: bad (3.0/1.0)
| | | other_parties = co applicant: good (4.0/2.0)
| | | other_parties = guarantor: good (8.0/1.0)
| | credit_history = delayed previously: bad (7.0/2.0)
| | credit_history = critical/other existing credit: good (38.0/10.0)
| foreign_worker = no: good (12.0/2.0)
checking_status = 0<=X<200
| other_parties = none
| | credit_history = no credits/all paid
| | | other_payment_plans = bank: good (2.0/1.0)
| | | other_payment_plans = stores: bad (0.0)
| | | other_payment_plans = none: bad (7.0)
| | credit_history = all paid: bad (10.0/4.0)
| | credit_history = existing paid
| | | credit_amount <= 8858: good (70.0/21.0)
| | | credit_amount > 8858: bad (8.0)
| | credit_history = delayed previously: good (25.0/6.0)
| | credit_history = critical/other existing credit: good (26.0/7.0)
| other_parties = co applicant: bad (7.0/1.0)
| other_parties = guarantor: good (18.0/4.0)
checking_status = >=200: good (44.0/9.0)
55
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
checking_status = no checking
| other_payment_plans = bank: good (30.0/10.0)
| other_payment_plans = stores: good (12.0/2.0)
| other_payment_plans = none
| | credit_history = no credits/all paid: good (4.0)
| | credit_history = all paid: good (1.0)
| | credit_history = existing paid
| | | existing_credits <= 1: good (92.0/7.0)
| | | existing_credits > 1
| | | | installment_commitment <= 2: bad (4.0/1.0)
| | | | installment_commitment > 2: good (5.0)
| | credit_history = delayed previously: good (22.0/6.0)
| | credit_history = critical/other existing credit: good (92.0/3.0)
Number of Leaves : 47
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.946 0.587 0.790 0.946 0.861 0.447 0.811 0.897 good
0.413 0.054 0.765 0.413 0.537 0.447 0.811 0.649 bad
Weighted Avg. 0.786 0.427 0.783 0.786 0.764 0.447 0.811 0.823
a b <-- classified as
56
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
662 38 | a = good
176 124 | b = bad
57
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
12. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your
own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist
different classifiers that output the model in the form of rules - one such classifier in Weka is
rules.PART, train this model and report the set of rules obtained. Sometimes just one attribute
can be good enough in making the decision, yes, just one ! Can you predict what attribute that
might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the
attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank
the performance of j48, PART and oneR.
In weka, rules.PART is one of the classifier which converts the decision trees into IF-THEN-ELSE
rules.
: yes (3.0)
Number of Rules : 4
Yes, sometimes just one attribute can be good enough in making the decision.
In this dataset (Weather), Single attribute for making the decision is outlook
outlook:
sunny -> no
With respect to the time, the oneR classifier has higher ranking and J48 is in 2nd
58
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
TIME (sec) 0.12 0.14 0.04
RANK II III I
But if you consider the accuracy, The J48 classifier has higher ranking, PART gets second
RANK I II III
59
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Extra Experiments:
13. Generate Association rules for the following transactional database using Apriori algorithm.
Step-1
Create Excel Document of the above data and save it as (.CSV delimited) file type
Tid I1 I2 I3 I4 I5
T100 yes yes no no yes
T200 no yes no yes no
T300 no yes yes no no
T400 yes yes no yes no
T500 yes no yes no no
T600 no yes yes no no
T700 yes no yes no no
T800 yes yes yes no yes
T900 yes yes yes no no
Step-2
Open WEKA tool and click on WEKA tool and then click on explorer
Step-3
Click on open file tab and take the file from the desired location, the file should be of (.csv) which was
saved earlier in one location.
Step-4
Click on association tab which is on top headers tab then choose apriori and then click ok.
Step-5
Then click start button and see the generated association rules results:
60
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
Relation: customer
Instances: 9
Attributes: 6
Tid
I1
I2
I3
I4
I5
=== Associator model (full training set) ===
Apriori
=======
61
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
14. Generate classification rules for the following data base using decision tree (J48).
Step1:
Create Excel Document of the above data and save it as (.CSV delimited) file type
Step-2
Open WEKA tool and click on WEKA tool and then click on explorer
Step-3
Click on open file tab and take the file from the desired location, the file should be of (.csv) which was
saved earlier in one location.
Step-4
Click on classify tab which is on top headers tab then choose classifier as j48 and select test option as
Use training set.
Step-5
Then click start button and see the generated classification rules results:
Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: buys
Instances: 14
Attributes: 6
Rid
age
62
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
income
student
credit_rating
class:buys_computer
Test mode:evaluate on training data
age = youth
| student = no: no (3.0)
| student = yes: yes (2.0)
age = middle_aged: yes (4.0)
age = senior
| credit_rating = fair: yes (3.0)
| credit_rating = excellent: no (2.0)
Number of Leaves : 5
a b <-- classified as
63
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
5 0 | a = no
0 9 | b = yes
Step6:
Right click on Result list to visualize the tree
Output:
64
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
VIVA QUESTIONS:
4. Define Classification.
It is the process of finding set of models that describe and distinguish data classes or concepts for the
purpose of being able to use the model to predict the class objects whose class label is unknown. The
derived model is based on the analysis of a set of training data.
65
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
66