0% found this document useful (0 votes)
1K views

Data Mining Lab Manual COMPLETE GMR

This document provides an overview of the data mining lab tasks and experiments to be performed using Weka. It includes an index of 19 tasks, descriptions of fundamental data mining concepts like classification and clustering. It also introduces the Weka data mining tool, describing its features, interfaces, functions and how to load and prepare data in common formats like ARFF and CSV for analysis in Weka.

Uploaded by

Hanuma Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Data Mining Lab Manual COMPLETE GMR

This document provides an overview of the data mining lab tasks and experiments to be performed using Weka. It includes an index of 19 tasks, descriptions of fundamental data mining concepts like classification and clustering. It also introduces the Weka data mining tool, describing its features, interfaces, functions and how to load and prepare data in common formats like ARFF and CSV for analysis in Weka.

Uploaded by

Hanuma Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

INDEX
S.NO Name of the Experiment Page No
1 Fundamentals of Data Mining 3
2 Introduction to WEKA 5
3 Attribute Relation File Format (ARFF) 9
4 Comma Separated Value (CSV) 10
5 Credit Risk Assessment 11
LAB CYCLE TASKS
6 Task1 16
7 Task2 18
8 Task3 19
9 Task4 29
10 Task5 31
11 Task6 32
12 Task7 39
13 Task8 41
14 Task9 46
15 Task10 47
16 Task11 48
17 Task12 52
18 Generate Association rules for the given transactional 54
database using Apriori algorithm.
19 Generate classification rules for the given data base 56
using decision tree (J48).

1
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

Fundamentals of Data Mining


Definition of Data Mining:

Data mining refers to extracting or mining knowledge from large amounts of data.
Data mining can also be referred as knowledge mining from data, knowledge extraction, data
archeology and data dredging.

Applications of Data Mining:

Business Intelligence applications


Insurance
Banking
Medicine
Retail/Marketing etc.

Functionalities of Data Mining:

These functionalities are used to specify the kind of patterns to be found in data mining tasks.
Data mining tasks can be classified into 2 categories:

Descriptive
Predictive

The following are the functionalities of data mining:

Concept/Class description: Characterization and Discrimination:

Generalize, summarize and contrast data characteristics.

Mining frequent patterns, Associations and Correlations

Frequent patterns are patterns (such as itemsets, subsequences, or substructures) that appear in a data
set frequently.

Classification and Prediction:

Construct models that describe and distinguish classes or concepts for future prediction.
Predicts some unknown or missing numerical values.

Cluster analysis:

Class label is unknown. Group data to form new classes.

2
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Maximizing intra-class similarity and minimizing inter-class similarity.

Outlier analysis:

Outlier: a data object that does not comply with the general behavior of data.
Noise or exception but is quite useful in fraud detection, rare events analysis.

3
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

Introduction to WEKA
A collection of open source of many data mining and machine learning algorithms, including
pre-processing on data
Classification
clustering
association rule extraction
Created by researchers at the University of Waikato in New Zealand
Java based (also open source).

Weka Main Features

49 data preprocessing tools


76 classification/regression algorithms
8 clustering algorithms
15 attribute/subset evaluators +
10 search algorithms for feature selection.
3 algorithms for finding association rules
3 graphical user interfaces
The Explorer (exploratory data analysis)
The Experimenter (experimental environment)
The KnowledgeFlow (new process model inspired interface)

Weka: Download and Installation

Download Weka (the stable version) from http://www.cs.waikato.ac.nz/ml/weka/


Choose a self-extracting executable (including Java VM)
(If you are interested in modifying/extending weka there is a developer version that includes
the source code)
After download is completed, run the self- extracting file to install Weka, and use the default
set-ups.

Start the Weka

From windows desktop,


click Start, choose All programs,
Choose Weka 3.6 to start Weka
Then the first interface window appears:

Weka GUI Chooser.

4
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

Fig 1. Weka GUI Chooser

Weka Application Interfaces

Explorer

preprocessing, attribute selection, learning, visualiation

Experimenter

testing and evaluating machine learning algorithms

Knowledge Flow

visual design of KDD process

Explorer

Simple Command-line

A simple interface for typing commands

5
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

Fig 2. Weka Application Interfaces

Weka Functions and tools

Preprocessing Filters

Attribute selection

Classification/Regression

Clustering

Association discovery

Visualization

Load data file

Load data file in formats: ARFF, CSV, C4.5,


binary

Import from URL or SQL database (using JDBC)

6
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
WEKA data formats

Data can be imported from a file in various


formats:

ARFF :(Attribute Relation File Format) has two sections:


the Header information defines attribute name, type and
relations.
the Data section lists the data records.

CSV: Comma Separated Values (text file)

C4.5: A format used by a decision induction algorithm


C4.5, requires two separated files

Name file: defines the names of the attributes

Date file: lists the records (samples)

binary

Data can also be read from a URL or from an


SQL database (using JDBC)

Attribute Relation File Format (arff)


An ARFF file consists of two distinct sections:

the Header section defines attribute name, type


and relations, start with a keyword.
@Relation <data-name>
@attribute <attribute-name> <type> or {range}
the Data section lists the data records, starts with
@Data
list of data instances
Any line start with % is the comments.

Data types supported by ARFF:


numeric
string
nominal specification
date

Example:

@RELATION STUDENT
@ATTRIBUTE SNO NUMERIC
@ATTRIBUTE NAME STRING

7
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
@ATTRIBUTE AGE NUMERIC
@ATTRIBUTE CITY {HYD,DELHI,MUMBAI}
@ATTRIBUTE BRANCH {CSE,IT,ECE,EEE}
@ATTRIBUTE MARKS NUMERIC
@ATTRIBUTE CLASS {PASS,FAIL}
@DATA
1,DEEPIKA,22,HYD,CSE,76,PASS
2,RADHIKA,23,DELHI,IT,34,FAIL
3,PRADEEP,21,MUMBAI,EEE,45,PASS
4,KRISHNA,22,HYD,ECE,23,FAIL
5,RISHI,21,DELHI,IT,88,PASS
6,SHARAN,21,MUMBAI,EEE,92,PASS
7,SHREYANSH,22,HYD,CSE,26,FAIL
8,SUGUNA,23,MUMBAI,ECE,65,PASS

Write the file in notepad


save the file with .arff extension
save it in All Files

CSV(Comma Separated Value)

The CSV File Format

Each record is one line


Fields are separated with commas.
Example John,Doe,120 any st.,"Anytown, WW",08123
Leading and trailing space-characters adjacent to comma field separators are ignored.
So John , Doe ,... resolves to "John" and "Doe", etc. Space characters can be spaces, or tabs.
Fields with embedded commas must be delimited with double-quote characters.
In the above example. "Anytown, WW" had to be delimited in double quotes because it had an
embedded comma.
Fields that contain double quote characters must be surounded by double-quotes, and the
embedded double-quotes must each be represented by a pair of consecutive double quotes.
So, John "Da Man" Doe would convert to "John ""Da Man""",Doe, 120 any st.,...
A field that contains embedded line-breaks must be surounded by double-quotes
So:
Note that this is a single CSV record, even though it takes up more than one line in the CSV
file. This works because the line breaks are embedded inside the double quotes of the field.
Fields with leading or trailing spaces must be delimited with double-quote characters.
So to preserve the leading and trailing spaces around the last name above: John ," Doe ",...
The delimiters will always be discarded.
The first record in a CSV file may be a header record containing column (field) names

8
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Example:
SNO,NAME,AGE,CITY,BRANCH,MARKS,CLASS
1,DEEPIKA,22,HYD,CSE,76,PASS
2,RADHIKA,23,DELHI,IT,34,FAIL
3,PRADEEP,21,MUMBAI,EEE,45,PASS
4,KRISHNA,22,HYD,ECE,23,FAIL
5,RISHI,21,DELHI,IT,88,PASS
6,SHARAN,21,MUMBAI,EEE,92,PASS
7,SHREYANSH,22,HYD,CSE,26,FAIL
8,SUGUNA,23,MUMBAI,ECE,65,PASS

Write the file in notepad


save the file with .csv extension
save it in All Files

Credit Risk Assessment

Description: The business of banks is making loans. Assessing the credit worthiness of an applicant is
of crucial importance. You have to develop a system to help a loan officer decide whether the credit of
a customer is good, or bad. A bank's business rules regarding loans must consider two opposing factors.
On the one hand, a bank wants to make as many loans as possible. Interest on these loans is the banks
profit source. On the other hand, a bank cannot afford to make too many bad loans. Too many bad loans
could lead to the collapse of the bank. The bank's loan policy must involve a compromise: not too
strict, and not too lenient.
To do the assignment, you first and foremost need some knowledge about the world of credit. You can
acquire such knowledge in a number of ways.

1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent
her knowledge in the form of production rules.

2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance.
Translate this knowledge from text form to production rule form.

3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used
to judge the credit worthiness of a loan applicant.

4. Case histories. Find records of actual cases where competent loan officers correctly judged when,
and when not to, approve a loan application.

The German Credit Data: Actual historical credit data is not always easy to come by because of
confidentiality rules. Here is one such dataset, consisting of 1000 actual cases collected in Germany.
Credit dataset (original) Excel spreadsheet version of the German credit data. (Down load from web) In
spite of the fact that the data is German, you should probably make use of it for this assignment.
(Unless you really can consult a real loan officer!)
A few notes on the German dataset
9
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

DM stands for Deutsche Mark, the unit of currency, worth about 90 cents Canadian (but looks and
acts like a quarter).

owns telephone. German phone rates are much higher than in Canada so fewer people own
telephones.

Foreign worker. There are millions of these in Germany (many from Turkey). It is very hard to get
German citizenship if you were not born of German parents.

There are 20 attributes used in judging a loan applicant. The goal is the classify the applicant into one
of two categories, good or bad.

Procedure

Download German dataset from the internet (save data as arff format).
The description of data is as follows:
Description of the German credit dataset.

1. Title: German Credit data

2. Source Information

Professor Dr. Hans Hofmann


Institut f"ur Statistik und "Okonometrie
Universit"at Hamburg
FB Wirtschaftswissenschaften
Von-Melle-Park 5
2000 Hamburg 13

3. Number of Instances: 1000

4. Number of Attributes german: 21 (7 numerical, 14 categorical)

5. Attribute description for german

Attribute 1: (qualitative)
Status of existing checking account
A11 : ... < 0 DM
A12 : 0 <= ... < 200 DM
A13 : ... >= 200 DM /
salary assignments for at least 1 year
A14 : no checking account
10
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

Attribute 2: (numerical)
Duration in month

Attribute 3: (qualitative)
Credit history
A30 : no credits taken/
all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/
other credits existing (not at this bank)

Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others

Attribute 5: (numerical)
Credit amount

Attibute 6: (qualitative)
Savings account/bonds
A61 : ... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account

Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 : ... < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 : .. >= 7 years

Attribute 8: (numerical)
11
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Installment rate in percentage of disposable income

Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single

Attribute 10: (qualitative)


Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor

Attribute 11: (numerical)


Present residence since

Attribute 12: (qualitative)


Property
A121 : real estate
A122 : if not A121 : building society savings agreement/
life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property

Attribute 13: (numerical)


Age in years

Attribute 14: (qualitative)


Other installment plans
A141 : bank
A142 : stores
A143 : none

Attribute 15: (qualitative)


Housing
A151 : rent
A152 : own
A153 : for free

Attribute 16: (numerical)


Number of existing credits at this bank

12
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Attribute 17: (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer

Attribute 18: (numerical)


Number of people being liable to provide maintenance for

Attribute 19: (qualitative)


Telephone
A191 : none
A192 : yes, registered under the customers name

Attribute 20: (qualitative)


foreign worker
A201 : yes
A202 : no
Attribute 21: (qualitative)
class
A211 :Good
A212 :Bad

13
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

LAB CYCLE TASKS

14
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
1. List all the categorical (or nominal) attributes and the real-valued attributes separately.

SOLUTION:
Count all qualitative and numerical attributes
The solution is:-
Number of Attributes german: 21 (7 numerical, 14 categorical)
Attributes:-
1. checking_status

2. duration

3. credit history

4. purpose

5. credit amount

6. savings_status

7. employment duration

8. installment rate

9. personal status

10. debitors

11. residence_since

12. property

13. age in years

14. installment plans

15. housing

16. existing credits

17. job

18. num_dependents

19. telephone

20. foreign worker

21. class
15
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

Categorical or Nomianal attributes:-


1. checking_status

2. credit history

3. purpose

4. savings_status

5. employment_since

6. personal status

7. debtors

8. property

9. installment plans

10. housing

11. job

12. telephone

13. foreign worker

14. class label

Real valued attributes:-


1. duration

2. credit amount

3. installment rate

4. residence_since

5. age

6. existing credits

7. num_dependents

16
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

2. What attributes do you think might be crucial in making the credit assessment? Come up with
some simple rules in plain English using your selected attributes.

SOLUTION:

According to me the following attributes may be crucial in making the credit risk assessment.

1. Credit_history

2. Employment

3. Property_magnitude

4. job

5. duration

6. crdit_amount

7. installment

8. existing credit

Basing on the above attributes, we can make a decision whether to give credit or not.

17
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
3. One type of model that you can create is a Decision Tree - train a Decision Tree using the
complete dataset as the training data. Report the model obtained after training.

Procedure:

German Data set

Step-1 Save credit-g2.xls as .csv file type and put it in one location

Step-2 Open WEKA tool and then click on explorer

Step-3 Click on open file tab and take the file from the desired location(German data of type csv
which was saved before)

Step-4 After selecting the file click on open.

Step-5 Select all attributes tab

Step-6 Click on classify tab which is on top headers tab then choose classifier as choose
->tress->j48 and then click ok.

Step-7 Then click on use training set as test options.

Step-8 Then click on start button.

Then we will be getting confusion matrix as follows:

=== Confusion Matrix ===


a b <-- classified as
669 31 | a = 1
114 186 | b = 2

Step-9 Right click on desired results

Eg-Visualize tree
J48 pruned tree
------------------
checking_status = <0

| foreign_worker = yes

| | duration <= 11

| | | existing_credits <= 1

| | | | property_magnitude = real estate: good (8.0/1.0)

| | | | property_magnitude = life insurance


18
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

| | | | | own_telephone = none: bad (2.0)

| | | | | own_telephone = yes: good (4.0)

| | | | property_magnitude = car: good (2.0/1.0)

| | | | property_magnitude = no known property: bad (3.0)

| | | existing_credits > 1: good (14.0)

| | duration > 11

| | | job = unemp/unskilled non res: bad (5.0/1.0)

| | | job = unskilled resident

| | | | purpose = new car

| | | | | own_telephone = none: bad (10.0/2.0)

| | | | | own_telephone = yes: good (2.0)

| | | | purpose = used car: bad (1.0)

| | | | purpose = furniture/equipment

| | employment = unemployed: good (0.0)

| | employment = <1: bad (3.0)

| | employment = 1<=X<4: good (4.0)

| | employment = 4<=X<7: good (1.0)


| | employment = >=7: good (2.0)

| purpose = radio/tv

| | existing_credits <= 1: bad (10.0/3.0)

| | existing_credits > 1: good (2.0)

| purpose = domestic appliance: bad (1.0)

| purpose = repairs: bad (1.0)

| purpose = education: bad (1.0)

19
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| purpose = vacation: bad (0.0)

| purpose = retraining: good (1.0)

| purpose = business: good (3.0)

| purpose = other: good (1.0)

job = skilled

| other_parties = none

| | duration <= 30

| | | savings_status = <100

| | | | credit_history = no credits/all paid: bad (8.0/1.0)

| | | | credit_history = all paid: bad (6.0)

| | | | credit_history = existing paid

| | | | | own_telephone = none

| | | | | | existing_credits <= 1

| | | | | | | property_magnitude = real estate

| | | | | | | | age <= 26: bad (5.0)

| | | | | | | | age > 26: good (2.0)

| | | | | | | property_magnitude = life insurance: bad (7.0/2.0)


| | | | | | | property_magnitude = car

| | | | | | | | credit_amount <= 1386: bad (3.0)

| | | | | | | | credit_amount > 1386: good (11.0/1.0)

| | | | | | | property_magnitude = no known property: good (2.0)

| | | | | | existing_credits > 1: bad (3.0)

| | | | | own_telephone = yes: bad (5.0)

| | | | credit_history = delayed previously: bad (4.0)

| | | | credit_history = critical/other existing credit: good (14.0/4.0)


20
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

| | | savings_status = 100<=X<500

| | | | credit_history = no credits/all paid: good (0.0)

| | | | credit_history = all paid: good (1.0)

| | | | credit_history = existing paid: bad (3.0)

| | | | credit_history = delayed previously: good (0.0)

| | | | credit_history = critical/other existing credit: good (2.0)

| | | savings_status = 500<=X<1000: good (4.0/1.0)

| | | savings_status = >=1000: good (4.0)

| | | savings_status = no known savings

| | | | existing_credits <= 1

| | | | | own_telephone = none: bad (9.0/1.0)

| | | | | own_telephone = yes: good (4.0/1.0)

| | | | | | | existing_credits > 1: good (2.0)

| | | | | duration > 30: bad (30.0/3.0)

| | | | other_parties = co applicant: bad (7.0/1.0)

| | | | other_parties = guarantor: good (12.0/3.0)

| | | job = high qualif/self emp/mgmt: good (30.0/8.0)

| foreign_worker = no: good (15.0/2.0)

checking_status = 0<=X<200

| credit_amount <= 9857

| | savings_status = <100

| | | other_parties = none

| | | | duration <= 42

| | | | | personal_status = male div/sep: bad (8.0/2.0)


21
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

| | | | | personal_status = female div/dep/mar

| | | | | | purpose = new car: bad (5.0/1.0)

| | | | | | purpose = used car: bad (1.0)

| | | | | | purpose = furniture/equipment

| | | | | | | duration <= 10: bad (3.0)

| | | | | | | duration > 10

| | | | | | | | duration <= 21: good (6.0/1.0)

| | | | | | | | duration > 21: bad (2.0)

| | | | | | purpose = radio/tv: good (8.0/2.0)

| | | | | | purpose = domestic appliance: good (0.0)

| | | | | | purpose = repairs: good (1.0)

| | | | | | purpose = education: good (4.0/2.0)

| | | | | | purpose = vacation: good (0.0)

| | | | | | purpose = retraining: good (0.0)

| | | | | | purpose = business

| | | | | | | residence_since <= 2: good (3.0)


| | | | | | | residence_since > 2: bad (2.0)

| | | | | | purpose = other: good (0.0)

| | | | | personal_status = male single: good (52.0/15.0)

| | | | | personal_status = male mar/wid

| | | | | | duration <= 10: good (6.0)

| | | | | | duration > 10: bad (10.0/3.0)

| | | | | personal_status = female single: good (0.0)

| | | | duration > 42: bad (7.0)

22
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | | other_parties = co applicant: good (2.0)

| | | other_parties = guarantor

| | | | purpose = new car: bad (2.0)

| | | | purpose = used car: good (0.0)

| | | | purpose = furniture/equipment: good (0.0)

| | | | purpose = radio/tv: good (18.0/1.0)

| | | | purpose = domestic appliance: good (0.0)

| | | | purpose = repairs: good (0.0)

| | | | purpose = education: good (0.0)

| | | | purpose = vacation: good (0.0)

| | | | purpose = retraining: good (0.0)

| | | | purpose = business: good (0.0)

| | | | purpose = other: good (0.0)

| | savings_status = 100<=X<500

| | | purpose = new car: bad (15.0/5.0)

| | | purpose = used car: good (3.0)


| | | purpose = furniture/equipment: bad (4.0/1.0)
| | | purpose = radio/tv: bad (8.0/2.0)

| | | purpose = domestic appliance: good (0.0)


| | | purpose = repairs: good (2.0)

| | | purpose = education: good (0.0)

| | | purpose = vacation: good (0.0)

| | | purpose = retraining: good (0.0)

| | | purpose = business

| | | | housing = rent

| | | | | existing_credits <= 1: good (2.0)


23
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

| | | | | existing_credits > 1: bad (2.0)

| | | | housing = own: good (6.0)

| | | | housing = for free: bad (1.0)

| | | purpose = other: good (1.0)

| | savings_status = 500<=X<1000: good (11.0/3.0)

| | savings_status = >=1000: good (13.0/3.0)

| | savings_status = no known savings: good (41.0/5.0)

| credit_amount > 9857: bad (20.0/3.0)

checking_status = >=200: good (63.0/14.0)

checking_status = no checking: good (394.0/46.0)

Number of Leaves : 103


Size of the tree : 140
Time taken to build model: 0.03 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 855 85.5%
Incorrectly Classified Instances 145 14.5%
Kappa statistic 0.6251
Mean absolute error 0.2312
Root mean squared error 0.34
Relative absolute error 55.0377 %
Root relative squared error 74.2015 %
Total Number of Instances 1000

Outputs

Tree:To visualize fully grown tree right click->click on auto scale

24
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

Visualize classifier errors:

Visualize margin curve

25
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

Visualize threshold curve

Cost benefit analysis

26
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

Visualize cost curve

27
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

4. Suppose you use your above model trained on the complete dataset, and classify credit
good/bad for each of the examples in the dataset. What % of examples can you classify correctly?
(This is also called testing on the training set) Why do you think you cannot get 100 % training
accuracy?

SOLUTION :-

Predictive Accuracy Evaluation

The main methods of predictive accuracy evaluations are:

Resubstitution (N ; N)

Holdout (2N/3 ; N/3)

x-fold cross-validation (N-N/x ; N/x)

Leave-one-out (N-1 ; 1)

Where N is the number of records (instances) in the dataset

Training and Testing

REMEMBER: we must know the classification (class attribute values) of all instances (records) used in
the test procedure.
Basic Concepts

Success: instance (record) class is predicted correctly

Error: instance class is predicted incorrectly

Error rate: a percentage of errors made over the whole set of instances (records) used for testing

Predictive Accuracy: a percentage of well classified data in the testing data set.

Training and Testing

Example:

Testing Rules (testing record #1) = record #1.class - Succ

Testing Rules (testing record #2) not= record #2.class - Error

Testing Rules (testing record #3) = record #3.class - Succ

28
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

Testing Rules (testing record #4) = instance #4.class - Succ

Testing Rules (testing record #5) not= record #5.class - Error

Error rate:

2 errors: #2 and #5

Error rate = 2/5=40%

Predictive Accuracy: 3/5 = 60%

In the same way training data set will not get 100% accuracy for out data

In the above model we trained complete dataset and we classified credit good/bad for each

of the examples in the dataset.

For example:

IF

purpose=vacation THEN

credit=bad

ELSE

purpose=business THEN

Credit=good

In this way we classified each of the examples in the dataset.

We classified 85.5% of examples correctly and the remaining 14.5% of examples are

incorrectly classified. We cant get 100% training accuracy because out of the 20 attributes, we have

some unnecessary attributes which are also been analyzed and trained.

Due to this the accuracy is affected and hence we cant get 100% training accuracy.

29
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

5. Is testing on the training set as you did above a good idea? Why or Why not?

SOLUTION:

According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as

training set and the remaining 1/3 as test set. But here in the above model we have taken complete

dataset as training set which results only 85.5% accuracy.

This is done for the analyzing and training of the unnecessary attributes which does not

make a crucial role in credit risk assessment. And by this complexity is increasing and finally it leads to

the minimum accuracy.

If some part of the dataset is used as a training set and the remaining as test set then it

leads to the accurate results and the time for computation will be less.

This is why, we prefer not to take complete dataset as training set.

In some of the cases it is good, it is better to go with cross validation

X-fold cross-validation (N-N/x; N/x)

The cross-validation is used to prevent the overlap of the test sets


First step: split data into x disjoint subsets of equal size

Second step: use each subset in turn for testing, the remainder for training (repeating cross-validation)

As resulting rules (if applies) we take the sum of all rules.

The error (predictive accuracy) estimates are averaged to yield an overall error (predictive accuracy)

estimate

Standard cross-validation: 10-fold cross-validation

Why 10?

Extensive experiments have shown that this is the best choice to get an accurate estimate. There is
also some theoretical evidence for this. So interesting!

30
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
6. One approach for solving the problem encountered in the previous question is using cross-
validation? Describe what cross-validation is briefly. Train a Decision Tree again using cross-
validation and report your results. Does your accuracy increase/decrease? Why?

Cross validation:-

In k-fold cross-validation, the initial data are randomly portioned into k mutually exclusive subsets or

folds D1, D2, D3, . . . . . ., Dk. Each of approximately equal size. Training and testing is performed k

times. In iteration I, partition Di is reserved as the test set and the remaining partitions are collectively

used to train the model. That is in the first iteration subsets D2, D3, . . . . . ., Dk collectively serve as the

training set in order to obtain as first model. Which is tested on Di. The second trained on the subsets

D1, D3, . . . . . ., Dk and test on the D2 and so on.

Click on classify tab which is on top headers tab then choose classifier as choose

->tress->j48 and then click ok.

Then click on cross-validation folds-10 tab.

Then click on start button.

J48 pruned tree :-


------------------
checking_status = <0

| foreign_worker = yes

| | duration <= 11

| | | existing_credits <= 1

| | | | property_magnitude = real estate: good (8.0/1.0)

| | | | property_magnitude = life insurance

| | | | | own_telephone = none: bad (2.0)

| | | | | own_telephone = yes: good (4.0)

| | | | property_magnitude = car: good (2.0/1.0)

| | | | property_magnitude = no known property: bad (3.0)

31
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | | existing_credits > 1: good (14.0)

| | duration > 11

| | | job = unemp/unskilled non res: bad (5.0/1.0)

| | | job = unskilled resident

| | | | purpose = new car

| | | | | own_telephone = none: bad (10.0/2.0)

| | | | | own_telephone = yes: good (2.0)

| | | | purpose = used car: bad (1.0)

| | | | purpose = furniture/equipment

| | | | | employment = unemployed: good (0.0)

| | | | | employment = <1: bad (3.0)

| | | | | employment = 1<=X<4: good (4.0)

| | | | | employment = 4<=X<7: good (1.0)

| | | | | employment = >=7: good (2.0)

| | | | purpose = radio/tv

| | | | | existing_credits <= 1: bad (10.0/3.0)

| | | | | existing_credits > 1: good (2.0)

| | | | purpose = domestic appliance: bad (1.0)

| | | | purpose = repairs: bad (1.0)

| | | | purpose = education: bad (1.0)

| | | | purpose = vacation: bad (0.0)

| | | | purpose = retraining: good (1.0)

| | | | purpose = business: good (3.0)

| | | | purpose = other: good (1.0)

32
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | | job = skilled

| | | | other_parties = none

| | | | | duration <= 30

| | | | | | savings_status = <100

| | | | | | | credit_history = no credits/all paid: bad (8.0/1.0)

| | | | | | | credit_history = all paid: bad (6.0)

| | | | | | | credit_history = existing paid


| | | | | | | | own_telephone = none

| | | | | | | | | existing_credits <= 1

| | | | | | | | | | property_magnitude = real estate

| | | | | | | | | | | age <= 26: bad (5.0)

| | | | | | | | | | | age > 26: good (2.0)

| | | | | | | | | | property_magnitude = life insurance: bad (7.0/2.0)

| | | | | | | | | | property_magnitude = car

| | | | | | | | | | | credit_amount <= 1386: bad (3.0)

| | | | | | | | | | | credit_amount > 1386: good (11.0/1.0)

| | | | | | | | | | property_magnitude = no known property: good (2.0)

| | | | | | | | | existing_credits > 1: bad (3.0)

| | | | | | | | own_telephone = yes: bad (5.0)

| | | | | | | credit_history = delayed previously: bad (4.0)

| | | | | | | credit_history = critical/other existing credit: good (14.0/4.0)

| | | | | | savings_status = 100<=X<500

| | | | | | | credit_history = no credits/all paid: good (0.0)


| | | | | | | credit_history = all paid: good (1.0)

| | | | | | | credit_history = existing paid: bad (3.0)

33
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | | | | | | credit_history = delayed previously: good (0.0)

| | | | | | | credit_history = critical/other existing credit: good (2.0)

| | | | | | savings_status = 500<=X<1000: good (4.0/1.0)

| | | | | | savings_status = >=1000: good (4.0)

| | | | | | savings_status = no known savings

| | | | | | | existing_credits <= 1

| | | | | | | | own_telephone = none: bad (9.0/1.0)


| | | | | | | | own_telephone = yes: good (4.0/1.0)

| | | | | | | existing_credits > 1: good (2.0)

| | | | | duration > 30: bad (30.0/3.0)

| | | | other_parties = co applicant: bad (7.0/1.0)

| | | | other_parties = guarantor: good (12.0/3.0)

| | | job = high qualif/self emp/mgmt: good (30.0/8.0)

| foreign_worker = no: good (15.0/2.0)

checking_status = 0<=X<200

| credit_amount <= 9857

| | savings_status = <100

| | | other_parties = none

| | | | duration <= 42

| | | | | personal_status = male div/sep: bad (8.0/2.0)

| | | | | personal_status = female div/dep/mar

| | | | | | purpose = new car: bad (5.0/1.0)

| | | | | | purpose = used car: bad (1.0)

| | | | | | purpose = furniture/equipment

| | | | | | | duration <= 10: bad (3.0)


34
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

| | | | | | | duration > 10

| | | | | | | | duration <= 21: good (6.0/1.0)

| | | | | | | | duration > 21: bad (2.0)

| | | | | | purpose = radio/tv: good (8.0/2.0)

| | | | | | purpose = domestic appliance: good (0.0)

| | | | | | purpose = repairs: good (1.0)

| | | | | | purpose = education: good (4.0/2.0)


| | | | | | purpose = vacation: good (0.0)

| | | | | | purpose = retraining: good (0.0)

| | | | | | purpose = business

| | | | | | | residence_since <= 2: good (3.0)

| | | | | | | residence_since > 2: bad (2.0)

| | | | | | purpose = other: good (0.0)

| | | | | personal_status = male single: good (52.0/15.0)

| | | | | personal_status = male mar/wid

| | | | | | duration <= 10: good (6.0)

| | | | | | duration > 10: bad (10.0/3.0)

| | | | | personal_status = female single: good (0.0)

| | | | duration > 42: bad (7.0)

| | | other_parties = co applicant: good (2.0)

| | | other_parties = guarantor

| | | | purpose = new car: bad (2.0)

| | | | purpose = used car: good (0.0)

| | | | purpose = furniture/equipment: good (0.0)

35
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | | | purpose = radio/tv: good (18.0/1.0)

| | | | purpose = domestic appliance: good (0.0)

| | | | purpose = repairs: good (0.0)

| | | | purpose = education: good (0.0)

| | | | purpose = vacation: good (0.0)

| | | | purpose = retraining: good (0.0)

| | | | purpose = business: good (0.0)

| | | | purpose = other: good (0.0)


| | savings_status = 100<=X<500

| | | purpose = new car: bad (15.0/5.0)

| | | purpose = used car: good (3.0)

| | | purpose = furniture/equipment: bad (4.0/1.0)

| | | purpose = radio/tv: bad (8.0/2.0)

| | | purpose = domestic appliance: good (0.0)

| | | purpose = repairs: good (2.0)

| | | purpose = education: good (0.0)

| | | purpose = vacation: good (0.0)

| | | purpose = retraining: good (0.0)

| | | purpose = business

| | | | housing = rent

| | | | | existing_credits <= 1: good (2.0)

| | | | | existing_credits > 1: bad (2.0)

| | | | housing = own: good (6.0)

| | | | housing = for free: bad (1.0)

| | | purpose = other: good (1.0)


36
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

| | savings_status = 500<=X<1000: good (11.0/3.0)

| | savings_status = >=1000: good (13.0/3.0)

| | savings_status = no known savings: good (41.0/5.0)

| credit_amount > 9857: bad (20.0/3.0)

checking_status = >=200: good (63.0/14.0)

checking_status = no checking: good (394.0/46.0)

Number of Leaves : 103

Size of the tree : 140


Time taken to build model: 0.07 seconds

=== Stratified cross-validation ===


=== Summary ===

Correctly Classified Instances 705 70.5 %


Incorrectly Classified Instances 295 29.5 %
Kappa statistic 0.2467
Mean absolute error 0.3467
Root mean squared error 0.4796
Relative absolute error 82.5233 %
Root relative squared error 104.6565 %
Total Number of Instances 1000

=== Detailed Accuracy By Class ===


TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.84 0.61 0.763 0.84 0.799 0.639 good


0.39 0.16 0.511 0.39 0.442 0.639 bad

Weighted Avg. 0.705 0.475 0.687 0.705 0.692 0.639

=== Confusion Matrix ===


a b <-- classified as
588 112 | a = good
183 117 | b = bad

37
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
7. Check to see if the data shows a bias against "foreign workers" (attribute 20),or "personal-
status"(attribute 9). One way to do this (perhaps rather simple minded) is to remove these
attributes from the dataset and see if the decision tree created in those cases is significantly
different from the full dataset case which you have already done. To remove an attribute you can
use the reprocess tab in Weka's GUI Explorer. Did removing these attributes have any significant
effect? Discuss.

This increase in accuracy is because thus two attributes are not much important in training and

analyzing by removing this, the time has been reduced to some extent and then it results in

increase in the accuracy.

The decision tree which is created is very large compared to the decision tree which we have trained

now. This is the main difference between these two decision trees.

Solution:

Step-1

After generating the above exercise click on preprocess tab.

Step-2

Remove attributes 20 and 9

Step-3

Then click on classify tab and take test options as cross validation and then click on start.

Step-4

Visualize results:-

38
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Visualize classifier errors:

Visualize tree

The Difference what we observed is accuracy had improved.

39
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
8. Another question might be, do you really need to input so many attributes to get good results?
Maybe only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17
(and 21, the class attribute (naturally)). Try out some combinations. (You had removed two
attributes in problem 7. Remember to reload the arff data file to get all the attributes initially
before you start selecting the ones you want.)

=== Classifier model (full training set) ===

J48 pruned tree


------------------

credit_history = no credits/all paid: bad (40.0/15.0)

credit_history = all paid

| employment = unemployed

| | duration <= 36: bad (3.0)

| | duration > 36: good (2.0)

| employment = <1

| | duration <= 26: bad (7.0/1.0)

| | duration > 26: good (2.0)

| employment = 1<=X<4: good (15.0/6.0)

| employment = 4<=X<7: bad (10.0/4.0)

| employment = >=7

| | job = unemp/unskilled non res: bad (0.0)

| | job = unskilled resident: good (3.0)

| | job = skilled: bad (3.0)

| | job = high qualif/self emp/mgmt: bad (4.0)

credit_history = existing paid

| credit_amount <= 8648

| | duration <= 40: good (476.0/130.0)

| | duration > 40: bad (27.0/8.0)


40
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

| credit_amount > 8648: bad (27.0/7.0)

credit_history = delayed previously

| employment = unemployed

| | credit_amount <= 2186: bad (4.0/1.0)

| | credit_amount > 2186: good (2.0)

| employment = <1

| | duration <= 18: good (2.0)

| | duration > 18: bad (10.0/2.0)

| employment = 1<=X<4: good (33.0/6.0)

| employment = 4<=X<7

| | credit_amount <= 4530

| | | credit_amount <= 1680: good (3.0)

| | | credit_amount > 1680: bad (3.0)

| | credit_amount > 4530: good (11.0)

| employment = >=7

| | job = unemp/unskilled non res: good (0.0)

| | job = unskilled resident: good (2.0/1.0)

| | job = skilled: good (14.0/4.0)

| | job = high qualif/self emp/mgmt: bad (4.0/1.0)

credit_history = critical/other existing credit: good (293.0/50.0)

Number of Leaves : 27

Size of the tree : 40

Time taken to build model: 0.01 seconds

=== Evaluation on training set ===


41
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
=== Summary ===

Correctly Classified Instances 764 76.4 %

Incorrectly Classified Instances 236 23.6 %

Kappa statistic 0.3386

Mean absolute error 0.3488

Root mean squared error 0.4176

Relative absolute error 83.0049 %

Root relative squared error 91.1243 %

Total Number of Instances 1000

=== Classifier model (full training set) ===

J48 pruned tree


------------------

credit_history = no credits/all paid: bad (40.0/15.0)

credit_history = all paid

| employment = unemployed

| | duration <= 36: bad (3.0)

| | duration > 36: good (2.0)

| employment = <1

| | duration <= 26: bad (7.0/1.0)

| | duration > 26: good (2.0)

| employment = 1<=X<4: good (15.0/6.0)

| employment = 4<=X<7: bad (10.0/4.0)

| employment = >=7

| | job = unemp/unskilled non res: bad (0.0)

| | job = unskilled resident: good (3.0)


42
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

| | job = skilled: bad (3.0)

| | job = high qualif/self emp/mgmt: bad (4.0)

credit_history = existing paid

| credit_amount <= 8648

| | duration <= 40: good (476.0/130.0)

| | duration > 40: bad (27.0/8.0)

| credit_amount > 8648: bad (27.0/7.0)

credit_history = delayed previously

| employment = unemployed

| | credit_amount <= 2186: bad (4.0/1.0)

| | credit_amount > 2186: good (2.0)

| employment = <1

| | duration <= 18: good (2.0)

| | duration > 18: bad (10.0/2.0)

| employment = 1<=X<4: good (33.0/6.0)

| employment = 4<=X<7

| | credit_amount <= 4530

| | | credit_amount <= 1680: good (3.0)

| | | credit_amount > 1680: bad (3.0)

| | credit_amount > 4530: good (11.0)

| employment = >=7

| | job = unemp/unskilled non res: good (0.0)

| | job = unskilled resident: good (2.0/1.0)

| | job = skilled: good (14.0/4.0)


43
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

| | job = high qualif/self emp/mgmt: bad (4.0/1.0)

credit_history = critical/other existing credit: good (293.0/50.0)

Number of Leaves : 27

Size of the tree : 40

Time taken to build model: 0.01 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 703 70.3 %

Incorrectly Classified Instances 297 29.7 %

Kappa statistic 0.1759

Mean absolute error 0.3862

Root mean squared error 0.4684

Relative absolute error 91.9029 %

Root relative squared error 102.2155 %

Total Number of Instances 1000

44
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
9. Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be
higher than accepting an applicant who has bad credit (case 2). Instead of counting the
misclassifications equally in both cases, give a higher cost to the first case (say cost 5) and lower
cost to the second case. You can do this by using a cost matrix in Weka. Train your Decision Tree
again and report the Decision Tree and cross-validation results. Are they significantly different
from results obtained in problem 6 (using equal cost)?

In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider

two cases with different cost.

Let us take cost 5 in case 1 and cost 2 in case 2.

Case 1:
Procedure:
Step1: Click on weka from desktop.
Step2: Choose explorer from weka GUI chooser.
Step3: Click on open file and open credit_g2.csv data file.
Step4: Click on classify and click on choose button to choose CostSensitiveClassifier
from meta.
Step5: Right click on text box and select show properties.
Step6: Click on classifier button to choose j48 from trees.
Step7: click on costMatrix and set classes to 2 and click on Resize button.
Step8: Change FP value to 5.0 and press enter and click on ok.
Step9: Click on start to see classifier output.
Step10: Right click on result to see the visualization tree.

45
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

46
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

=== Run information ===

Scheme: weka.classifiers.meta.CostSensitiveClassifier -cost-matrix "[0.0 1.0; 5.0 0.0]" -S 1 -W


weka.classifiers.trees.J48 -- -C 0.25 -M 2
Relation: german_credit
Instances: 1000
Attributes: 21
checking_status
duration
credit_history
purpose
credit_amount
savings_status
employment
installment_commitment
personal_status
other_parties
residence_since

47
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
property_magnitude
age
other_payment_plans
housing
existing_credits
job
num_dependents
own_telephone
foreign_worker
class
Test mode: evaluate on training data

=== Classifier model (full training set) ===

CostSensitiveClassifier using reweighted training instances

weka.classifiers.trees.J48 -C 0.25 -M 2

Classifier Model
J48 pruned tree
------------------

checking_status = <0
| duration <= 11
| | existing_credits <= 1: bad (21.82/5.91)
| | existing_credits > 1: good (8.64)
| duration > 11: bad (339.55/48.64)
checking_status = 0<=X<200
| other_parties = none
| | savings_status = <100: bad (170.91/30.0)
| | savings_status = 100<=X<500: bad (55.45/10.0)
| | savings_status = 500<=X<1000
| | | installment_commitment <= 2: good (2.27)
| | | installment_commitment > 2: bad (8.18/1.36)
| | savings_status = >=1000
| | | own_telephone = none: bad (9.09/2.27)
| | | own_telephone = yes: good (2.27)
| | savings_status = no known savings
| | | existing_credits <= 1
| | | | credit_history = no credits/all paid: good (0.45)
| | | | credit_history = all paid: good (0.45)
| | | | credit_history = existing paid
| | | | | property_magnitude = real estate: bad (6.36/1.82)
| | | | | property_magnitude = life insurance: bad (8.18/1.36)
| | | | | property_magnitude = car: good (3.64)
| | | | | property_magnitude = no known property: bad (5.91/1.36)
| | | | credit_history = delayed previously: good (2.27)
| | | | credit_history = critical/other existing credit: bad (0.0)
48
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | | existing_credits > 1: good (4.55)
| other_parties = co applicant: bad (15.0/1.36)
| other_parties = guarantor
| | duration <= 16: good (7.27)
| | duration > 16: bad (10.91/1.82)
checking_status = >=200
| property_magnitude = real estate
| | duration <= 7: good (2.27)
| | duration > 7: bad (21.82/3.64)
| property_magnitude = life insurance: good (5.45)
| property_magnitude = car
| | employment = unemployed: good (0.0)
| | employment = <1: bad (5.0/0.45)
| | employment = 1<=X<4
| | | installment_commitment <= 3: good (4.09)
| | | installment_commitment > 3: bad (2.73/0.45)
| | employment = 4<=X<7: good (0.91)
| | employment = >=7: good (2.27)
| property_magnitude = no known property
| | credit_amount <= 1323: bad (7.27/0.45)
| | credit_amount > 1323: good (2.27)
checking_status = no checking
| other_payment_plans = bank: bad (48.18/14.09)
| other_payment_plans = stores
| | savings_status = <100: bad (11.82/2.73)
| | savings_status = 100<=X<500: good (0.45)
| | savings_status = 500<=X<1000: good (0.45)
| | savings_status = >=1000: good (0.45)
| | savings_status = no known savings: good (2.27)
| other_payment_plans = none
| | credit_history = no credits/all paid: good (1.82)
| | credit_history = all paid: good (0.45)
| | credit_history = existing paid
| | | existing_credits <= 1
| | | | purpose = new car
| | | | | age <= 27: bad (8.64/1.82)
| | | | | age > 27: good (13.18)
| | | | purpose = used car: good (10.45)
| | | | purpose = furniture/equipment
| | | | | personal_status = male div/sep: good (0.45)
| | | | | personal_status = female div/dep/mar
| | | | | | age <= 27: good (2.73)
| | | | | | age > 27: bad (7.27/0.45)
| | | | | personal_status = male single: good (3.18)
| | | | | personal_status = male mar/wid: bad (2.73/0.45)
| | | | | personal_status = female single: bad (0.0)
| | | | purpose = radio/tv
| | | | | age <= 23: bad (5.91/1.36)
49
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | | | | age > 23: good (18.18)
| | | | purpose = domestic appliance: good (1.36)
| | | | purpose = repairs: bad (2.73/0.45)
| | | | purpose = education: bad (4.09/1.82)
| | | | purpose = vacation: good (0.0)
| | | | purpose = retraining: good (0.91)
| | | | purpose = business: good (2.73)
| | | | purpose = other: good (0.0)
| | | existing_credits > 1
| | | | installment_commitment <= 2: bad (11.82/0.45)
| | | | installment_commitment > 2
| | | | | credit_amount <= 9157: good (4.55)
| | | | | credit_amount > 9157: bad (2.27)
| | credit_history = delayed previously
| | | installment_commitment <= 3
| | | | residence_since <= 1: bad (2.27)
| | | | residence_since > 1: good (7.27)
| | | installment_commitment > 3: bad (17.73/4.09)
| | credit_history = critical/other existing credit: good (66.36/6.82)

Number of Leaves : 65

Size of the tree : 94

Cost Matrix
01
50

Time taken to build model: 0.06 seconds

=== Evaluation on training set ===

Time taken to test model on training data: 0 seconds

=== Summary ===

Correctly Classified Instances 692 69.2 %


Incorrectly Classified Instances 308 30.8 %
Kappa statistic 0.4305
Mean absolute error 0.3133
Root mean squared error 0.4638
Relative absolute error 74.5737 %
Root relative squared error 101.2033 %
Coverage of cases (0.95 level) 99.9 %
Mean rel. region size (0.95 level) 86.4 %
Total Number of Instances 1000
50
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.564 0.010 0.992 0.564 0.719 0.519 0.820 0.918 good
0.990 0.436 0.493 0.990 0.659 0.519 0.820 0.565 bad
Weighted Avg. 0.692 0.138 0.843 0.692 0.701 0.519 0.820 0.812

=== Confusion Matrix ===

a b <-- classified as
395 305 | a = good
3 297 | b = bad

Use the above procedure for case2 by setting cost as 2. And see the output

51
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
10. Do you think it is a good idea to prefer simple decision trees instead of having long complex
decision trees? How does the complexity of a Decision Tree relate to the bias of the model?

When we consider long complex decision trees, we will have many unnecessary attributes in

the tree which results in increase of the bias of the model. Because of this, the accuracy of the model

can also effected.

This problem can be reduced by considering simple decision tree. The attributes will be less

and it decreases the bias of the model. Due to this the result will be more accurate.

So it is a good idea to prefer simple decision trees instead of long complex trees.

52
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
11. You can make your Decision Trees simpler by pruning the nodes. One approach is to use
Reduced Error Pruning - Explain this idea briefly. Try reduced
error pruning for training your Decision Trees using cross-validation (you can do
this in Weka) and report the Decision Tree you obtain ? Also, report your accuracy using the
pruned model. Does your accuracy increase ?

Reduced-error pruning :-

The idea of using a separate pruning set for pruningwhich is applicable to decision trees as

well as rule setsis called reduced-error pruning. The variant described previously prunes a rule

immediately after it has been grown and is called incremental reduced-error pruning. Another

possibility is to build a full, unpruned rule set first, pruning it afterwards by discarding individual

tests.

In reduced error pruning each of the node in the decision tree is considered for pruning. Pruning

a decision node consists of removing the subtree rooted at that node, making it a leaf node and

assigning it a most common classification of the training example affiliated with that node. Nodes are

removed only if resulting pruned tree performs no worse than the original over the validation set.

Pruning of nodes continues until further pruning is harmful.

However, this method is much slower. Of course, there are many different ways to assess the

worth of a rule based on the pruning set. A simple measure is to consider how well the rule would do at

discriminating the predicted class from other classes if it were the only rule in the theory, operating

under the closed world assumption. If it gets p instances right out of the t instances that it covers, and

there are P instances of this class out of a total T of instances altogether, then it gets p positive

instances right. The instances that it does not cover include N - n negative ones, where n = t p

is the number of negative instances that the rule covers and N = T - P is the total number of negative

instances. Thus the rule has an overall success ratio of [p +(N - n)] T , and this quantity, evaluated on

the test set, has been used to evaluate the success of a rule when using reduced-error pruning.

53
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

Procedure:

Step1: Select weka from desktop


Step2: Choose explorer from weka GUI chooser
Step3: Click on open file and select credit_g2 .csv
Step4: Click on classify and click on choose button to choose J48 from trees
Step5: Click on text box and a window appears i.e., weka.gui.GenericObjectEditor
Step6: Make reducedErrorPruning true
Step7: Click on OK and click on start to see classifier output
Step8: Right click on results to see visualize tree

=== Run information ===

Scheme: weka.classifiers.trees.J48 -R -N 3 -Q 1 -M 2
Relation: german_credit
Instances: 1000
Attributes: 21
checking_status
duration
credit_history
purpose
credit_amount
savings_status
employment
installment_commitment
personal_status
other_parties
residence_since
property_magnitude
age
other_payment_plans
housing
existing_credits
job
num_dependents
own_telephone
foreign_worker
class
Test mode: evaluate on training data

=== Classifier model (full training set) ===

J48 pruned tree


------------------

checking_status = <0
| foreign_worker = yes
54
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
| | credit_history = no credits/all paid: bad (11.0/3.0)
| | credit_history = all paid: bad (9.0/1.0)
| | credit_history = existing paid
| | | other_parties = none
| | | | savings_status = <100
| | | | | existing_credits <= 1
| | | | | | purpose = new car: bad (17.0/4.0)
| | | | | | purpose = used car: good (3.0/1.0)
| | | | | | purpose = furniture/equipment: good (22.0/11.0)
| | | | | | purpose = radio/tv: good (18.0/8.0)
| | | | | | purpose = domestic appliance: bad (2.0)
| | | | | | purpose = repairs: bad (1.0)
| | | | | | purpose = education: bad (5.0/1.0)
| | | | | | purpose = vacation: bad (0.0)
| | | | | | purpose = retraining: bad (0.0)
| | | | | | purpose = business: good (3.0/1.0)
| | | | | | purpose = other: bad (0.0)
| | | | | existing_credits > 1: bad (5.0)
| | | | savings_status = 100<=X<500: bad (8.0/3.0)
| | | | savings_status = 500<=X<1000: good (1.0)
| | | | savings_status = >=1000: good (2.0)
| | | | savings_status = no known savings
| | | | | job = unemp/unskilled non res: bad (0.0)
| | | | | job = unskilled resident: good (2.0)
| | | | | job = skilled
| | | | | | own_telephone = none: bad (4.0)
| | | | | | own_telephone = yes: good (3.0/1.0)
| | | | | job = high qualif/self emp/mgmt: bad (3.0/1.0)
| | | other_parties = co applicant: good (4.0/2.0)
| | | other_parties = guarantor: good (8.0/1.0)
| | credit_history = delayed previously: bad (7.0/2.0)
| | credit_history = critical/other existing credit: good (38.0/10.0)
| foreign_worker = no: good (12.0/2.0)
checking_status = 0<=X<200
| other_parties = none
| | credit_history = no credits/all paid
| | | other_payment_plans = bank: good (2.0/1.0)
| | | other_payment_plans = stores: bad (0.0)
| | | other_payment_plans = none: bad (7.0)
| | credit_history = all paid: bad (10.0/4.0)
| | credit_history = existing paid
| | | credit_amount <= 8858: good (70.0/21.0)
| | | credit_amount > 8858: bad (8.0)
| | credit_history = delayed previously: good (25.0/6.0)
| | credit_history = critical/other existing credit: good (26.0/7.0)
| other_parties = co applicant: bad (7.0/1.0)
| other_parties = guarantor: good (18.0/4.0)
checking_status = >=200: good (44.0/9.0)
55
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
checking_status = no checking
| other_payment_plans = bank: good (30.0/10.0)
| other_payment_plans = stores: good (12.0/2.0)
| other_payment_plans = none
| | credit_history = no credits/all paid: good (4.0)
| | credit_history = all paid: good (1.0)
| | credit_history = existing paid
| | | existing_credits <= 1: good (92.0/7.0)
| | | existing_credits > 1
| | | | installment_commitment <= 2: bad (4.0/1.0)
| | | | installment_commitment > 2: good (5.0)
| | credit_history = delayed previously: good (22.0/6.0)
| | credit_history = critical/other existing credit: good (92.0/3.0)

Number of Leaves : 47

Size of the tree : 64

Time taken to build model: 0.02 seconds

=== Evaluation on training set ===

Time taken to test model on training data: 0.02 seconds

=== Summary ===

Correctly Classified Instances 786 78.6 %


Incorrectly Classified Instances 214 21.4 %
Kappa statistic 0.4134
Mean absolute error 0.2936
Root mean squared error 0.3908
Relative absolute error 69.8833 %
Root relative squared error 85.2771 %
Coverage of cases (0.95 level) 99.1 %
Mean rel. region size (0.95 level) 89.85 %
Total Number of Instances 1000

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.946 0.587 0.790 0.946 0.861 0.447 0.811 0.897 good
0.413 0.054 0.765 0.413 0.537 0.447 0.811 0.649 bad
Weighted Avg. 0.786 0.427 0.783 0.786 0.764 0.447 0.811 0.823

=== Confusion Matrix ===

a b <-- classified as
56
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
662 38 | a = good
176 124 | b = bad

57
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
12. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your
own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist
different classifiers that output the model in the form of rules - one such classifier in Weka is
rules.PART, train this model and report the set of rules obtained. Sometimes just one attribute
can be good enough in making the decision, yes, just one ! Can you predict what attribute that
might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the
attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank
the performance of j48, PART and oneR.

In weka, rules.PART is one of the classifier which converts the decision trees into IF-THEN-ELSE

rules.

Converting Decision trees into IF-THEN-ELSE rules using rules.PART classifier:-

PART decision list

outlook = overcast: yes (4.0)

windy = TRUE: no (4.0/1.0)

outlook = sunny: no (3.0/1.0)

: yes (3.0)

Number of Rules : 4

Yes, sometimes just one attribute can be good enough in making the decision.

In this dataset (Weather), Single attribute for making the decision is outlook

outlook:

sunny -> no

overcast -> yes

rainy -> yes

(10/14 instances correct)

With respect to the time, the oneR classifier has higher ranking and J48 is in 2nd

place and PART gets 3rd place.

J48 PART oneR

58
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
TIME (sec) 0.12 0.14 0.04

RANK II III I

But if you consider the accuracy, The J48 classifier has higher ranking, PART gets second

place and oneR gets lst place

J48 PART oneR

ACCURACY (%) 70.5 70.2% 66.8%

RANK I II III

59
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Extra Experiments:

13. Generate Association rules for the following transactional database using Apriori algorithm.

TID List of Items


T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3

Step-1

Create Excel Document of the above data and save it as (.CSV delimited) file type

Tid I1 I2 I3 I4 I5
T100 yes yes no no yes
T200 no yes no yes no
T300 no yes yes no no
T400 yes yes no yes no
T500 yes no yes no no
T600 no yes yes no no
T700 yes no yes no no
T800 yes yes yes no yes
T900 yes yes yes no no

Step-2
Open WEKA tool and click on WEKA tool and then click on explorer

Step-3
Click on open file tab and take the file from the desired location, the file should be of (.csv) which was
saved earlier in one location.

Step-4
Click on association tab which is on top headers tab then choose apriori and then click ok.

Step-5
Then click start button and see the generated association rules results:

=== Run information ===

60
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
Relation: customer
Instances: 9
Attributes: 6
Tid
I1
I2
I3
I4
I5
=== Associator model (full training set) ===

Apriori
=======

Minimum support: 0.35 (3 instances)


Minimum metric <confidence>: 0.9
Number of cycles performed: 13

Generated sets of large itemsets:

Size of set of large itemsets L(1): 7

Size of set of large itemsets L(2): 13

Size of set of large itemsets L(3): 9

Size of set of large itemsets L(4): 2

Best rules found:

1. I3=yes 6 ==> I4=no 6 conf:(1)


2. I4=no I5=no 5 ==> I3=yes 5 conf:(1)
3. I3=yes I5=no 5 ==> I4=no 5 conf:(1)
4. I1=yes I3=yes 4 ==> I4=no 4 conf:(1)
5. I2=yes I3=yes 4 ==> I4=no 4 conf:(1)
6. I1=no 3 ==> I2=yes 3 conf:(1)
7. I1=no 3 ==> I5=no 3 conf:(1)
8. I3=no 3 ==> I2=yes 3 conf:(1)
9. I1=no I5=no 3 ==> I2=yes 3 conf:(1)
10. I1=no I2=yes 3 ==> I5=no 3 conf:(1)

61
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

14. Generate classification rules for the following data base using decision tree (J48).

RID Age Income Student Credit_rating Class:buys_computer


1 Youth High No Fair No
2 Youth High No Excellent No
3 Middle_aged High No Fair Yes
4 Senior Medium No Fair Yes
5 Senior Low Yes Fair Yes
6 Senior Low Yes Excellent Yes
7 Middle_aged Low Yes Excellent No
8 Youth Medium No Fair Yes
9 Youth Low Yes Fair No
10 Senior Medium Yes Fair Yes
11 Youth Medium Yes Excellent Yes
12 Middle-aged Medium No Excellent Yes
13 Middle_aged High Yes Fair Yes
14 Senior Medium No Excellent No

Step1:
Create Excel Document of the above data and save it as (.CSV delimited) file type

Step-2
Open WEKA tool and click on WEKA tool and then click on explorer

Step-3
Click on open file tab and take the file from the desired location, the file should be of (.csv) which was
saved earlier in one location.

Step-4
Click on classify tab which is on top headers tab then choose classifier as j48 and select test option as
Use training set.

Step-5
Then click start button and see the generated classification rules results:

=== Run information ===

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: buys
Instances: 14
Attributes: 6
Rid
age

62
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
income
student
credit_rating
class:buys_computer
Test mode:evaluate on training data

=== Classifier model (full training set) ===

J48 pruned tree


------------------

age = youth
| student = no: no (3.0)
| student = yes: yes (2.0)
age = middle_aged: yes (4.0)
age = senior
| credit_rating = fair: yes (3.0)
| credit_rating = excellent: no (2.0)

Number of Leaves : 5

Size of the tree : 8

Time taken to build model: 0.02 seconds

=== Evaluation on training set ===


=== Summary ===

Correctly Classified Instances 14 100 %


Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 14

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


1 0 1 1 1 1 no
1 0 1 1 1 1 yes
Weighted Avg. 1 0 1 1 1 1

=== Confusion Matrix ===

a b <-- classified as
63
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]
5 0 | a = no
0 9 | b = yes

Step6:
Right click on Result list to visualize the tree

Output:

64
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

VIVA QUESTIONS:

1. What is Data mining?


Data mining refers to extracting or "mining" knowledge from large amount of data. It is considered as a
synonym for another popularly used term Knowledge Discovery in Databases or KDD.

2. Define data warehouse


Data warehouse is a repository of information collected from multiple sources stored under a unified
schema and which usually resides at a single site. It is constructed via a process of data cleaning, data
transformation, data integration, data loading and periodic data refreshing.

3. What is an association analysis?


Association analysis is the discovery of association rules showing attribute-value conditions that occur
frequently together in a given set of data. It is widely used for market basket or transaction data
analysis.

4. Define Classification.
It is the process of finding set of models that describe and distinguish data classes or concepts for the
purpose of being able to use the model to predict the class objects whose class label is unknown. The
derived model is based on the analysis of a set of training data.

What are the GUI's available in weka.


What are the data formats supported by weka.
Extensions of ARFF and CSV.
What are filters in weka.
what is the difference between categorical and real-valued attribute.
What is relation.
What is data.
What are the four types of data types we use in creation of ARFF file.
What is .xls , .arff, .csv.
How to convert an excel document in to csv document.
What are the data preprocsssing tools available in weka.
What is filter.
How to use filters in weka.
What is Data discretization.
What is Association functionality of data mining.

65
[DATA MINING LAB] [MGIT] [IV/IV CSE 1-SEM]

List out the algorithms used for association.


What is minimum support threshold.
What is minimum confidence threshold.
When an association rule is called as strong.
What is data classification.
Which techniques are used to implement data classification.
What is training data and test data.
What is decision tree.

66

You might also like