QB 10 Marker
QB 10 Marker
1. Data Cleaning
The first and foremost step in data mining is the cleaning of data. It holds importance as
dirty data can confuse procedures and produce inaccurate results if used directly in mining.
This step helps remove noisy or incomplete data from the data collection. Some methods
can clean data themselves, but they are not robust. Data Cleaning carries out its work
through the following steps:
(i) Filling The Missing Data: The missing data can be filled by various methods such as filling
the missing data manually, using the measure of central tendency, median, ignoring the
tuple, or filling in the most probable value.
(ii) Remove The Noisy Data: Random error is called noisy data. This noise can be removed by
the method of binning.
Binning methods are applied by sorting all the values to bins or buckets.
Smoothening is executed by consulting the adjacent values.
Binning is carried out by smoothing of bin, i.e., each bin is replaced by the mean of the bin.
Smoothing by a median, a bin median replaces each bin value. Smoothing by bin
boundaries, i.e., the bin's minimum and maximum values are bin boundaries, and the
closest boundary value replaces each bin value.
Then, identifying the outliers and solving inconsistencies.
2. Data Integration
When multiple data sources are combined for analysis, such as databases, data cubes, or
files, this process is called data integration. This enhances the accuracy and speed of the
mining process. There are different naming conventions of variables for different databases,
causing redundancies. These redundancies and inconsistencies can be removed by further
data cleaning without affecting the reliability of the data. Data Integration is performed
using migration Tools such as Oracle Data Service Integrator and Microsoft SQL.
3. Data Reduction
This technique helps obtain only the relevant data for analysis from data collection. The
volume of the representation is much smaller while maintaining integrity. Data Reduction is
performed using Naive Bayes, Decision Trees, Neural networks, etc. Some strategies for the
reduction of data are:
Removal of noise from data using methods like clustering, regression techniques, etc.
(Smoothing).
Summary operations are applied to data(Aggregation).
Scaling of data to come within a smaller range(Normalisation).
Intervals replace raw values of numeric data. (Discretization)
5. Data Mining
Data Mining is the process of identifying intriguing patterns and extracting knowledge from
an extensive database. Inventive patterns are applied to extract the data patterns. The data
is represented in patterns, and models are structured by classification and clustering
techniques
6. Pattern Evaluation
Pattern Evaluation is the process that involves identifying interesting patterns representing
the knowledge based on some measures. Data summarization and visualisation methods
make the data understandable to the user.
7. Knowledge Representation
Data visualisation and knowledge representation tools represent the mined data in this step.
Data is visualised in the form of reports, tables, etc.
Data mining involves extracting valuable and meaningful patterns, insights, and
knowledge from large datasets. The major requirements in data mining include:
Data mining, the process of extracting knowledge from data, has become
increasingly important as the amount of data generated by individuals,
organizations, and machines has grown exponentially. However, data mining is
not without its challenges. In this article, we will explore some of the main
challenges of data mining.
1]Data Quality
The quality of data used in data mining is one of the most significant challenges. The
accuracy, completeness, and consistency of the data affect the accuracy of the results
obtained. The data may contain errors, omissions, duplications, or inconsistencies,
which may lead to inaccurate results. Moreover, the data may be incomplete,
meaning that some attributes or values are missing, making it challenging to obtain a
complete understanding of the data.
Data quality issues can arise due to a variety of reasons, including data entry errors,
data storage issues, data integration problems, and data transmission errors. To
address these challenges, data mining practitioners must apply data cleaning and
data preprocessing techniques to improve the quality of the data. Data cleaning
involves detecting and correcting errors, while data preprocessing involves
transforming the data to make it suitable for data mining.
2]Data Complexity
Data complexity refers to the vast amounts of data generated by various sources,
such as sensors, social media, and the internet of things (IoT). The complexity of the
data may make it challenging to process, analyze, and understand. In addition, the
data may be in different formats, making it challenging to integrate into a single
dataset.
To address this challenge, data mining practitioners use advanced techniques such as
clustering, classification, and association rule mining. These techniques help to
identify patterns and relationships in the data, which can then be used to gain
insights and make predictions.
3]Data Privacy and Security
Data privacy and security is another significant challenge in data mining. As more
data is collected, stored, and analyzed, the risk of data breaches and cyber-attacks
increases. The data may contain personal, sensitive, or confidential information that
must be protected. Moreover, data privacy regulations such as GDPR, CCPA, and
HIPAA impose strict rules on how data can be collected, used, and shared.
To address this challenge, data mining practitioners must apply data anonymization
and data encryption techniques to protect the privacy and security of the data. Data
anonymization involves removing personally identifiable information (PII) from the
data, while data encryption involves using algorithms to encode the data to make it
unreadable to unauthorized users.
4]Scalability
Data mining algorithms must be scalable to handle large datasets efficiently. As the
size of the dataset increases, the time and computational resources required to
perform data mining operations also increase. Moreover, the algorithms must be able
to handle streaming data, which is generated continuously and must be processed in
real-time.
To address this challenge, data mining practitioners use distributed computing
frameworks such as Hadoop and Spark. These frameworks distribute the data and
processing across multiple nodes, making it possible to process large datasets
quickly and efficiently.
4]interpretability
Data mining algorithms can produce complex models that are difficult to interpret.
This is because the algorithms use a combination of statistical and mathematical
techniques to identify patterns and relationships in the data. Moreover, the models
may not be intuitive, making it challenging to understand how the model arrived at a
particular conclusion.
To address this challenge, data mining practitioners use visualization techniques to
represent the data and the models visually. Visualization makes it easier to
understand the patterns and relationships in the data and to identify the most
important variables.
5]Ethics
Data mining raises ethical concerns related to the collection, use, and dissemination
of data. The data may be used to discriminate against certain groups, violate privacy
rights, or perpetuate existing biases. Moreover, data mining algorithms may not be
transparent, making it challenging to detect biases or discrimination.
OR
Major challenges in Data mining:
1. Security and social challenges
2. Noisy and incomplete data
3. Distributed data
4. Complex data
5. Performance
6. Scalability and efficiency of the algorithm
7. Improvement of mining algorithm
8. Incorporation of background knowledge
Data mining encompasses several key functionalities that help extract valuable
patterns, insights, and knowledge from large datasets. These functionalities are the
building blocks of the data mining process and are used to achieve specific goals
based on the nature of the data and the problem at hand. Here are the main data
mining functionalities:
1. Decision Trees:
Decision trees are hierarchical structures that partition data based on a
series of decisions. Each internal node represents a decision based on a
feature, leading to branches representing different outcomes.
They're used for classification and regression tasks and are easy to
understand and interpret. Examples include C4.5, CART, and Random
Forests.
2. Random Forest:
Random Forest is an ensemble technique that builds multiple decision
trees and combines their predictions. It reduces overfitting and
improves generalization by averaging the results of individual trees.
It's effective for classification and regression tasks and handles high-
dimensional data well.
3. Naive Bayes:
Naive Bayes is a probabilistic algorithm based on Bayes' theorem. It
assumes that features are independent, even though this assumption
may not hold in reality.
It's particularly useful for text classification, spam detection, and
sentiment analysis.
4. K-Nearest Neighbors (KNN):
KNN classifies a data point based on the class labels of its k nearest
neighbors in the feature space.
It's simple and intuitive for classification tasks but might not perform
well with high-dimensional data.
5. Support Vector Machines (SVM):
SVM finds a hyperplane that best separates data into different classes.
It works well for linearly separable and non-linearly separable data
through the use of kernel functions.
SVM is used for classification and regression tasks and is effective in
high-dimensional spaces.
6. Neural Networks:
Neural networks consist of interconnected nodes (neurons) that
process and transmit information. They can model complex
relationships in data.
Deep learning, a subset of neural networks, has achieved remarkable
success in image recognition, natural language processing, and more.
7. Association Rule Mining:
Association rule mining finds relationships between items in
transactional datasets. It's used for market basket analysis and
recommendations.
Apriori and FP-Growth are common algorithms for finding frequent
itemsets and generating association rules.
8. Clustering:
Clustering groups similar data points together. K-Means and
Hierarchical Clustering are popular methods.
K-Means partitions data into k clusters based on similarity, while
Hierarchical Clustering builds a tree-like structure of clusters.
9. Regression Analysis:
Regression techniques predict a continuous numeric value. Linear
Regression and Polynomial Regression model relationships between
input features and the target variable.
Other techniques like Ridge Regression and Lasso Regression handle
multicollinearity and feature selection.
10. Dimensionality Reduction:
Techniques like Principal Component Analysis (PCA) and t-SNE reduce
the number of features while preserving important information and
patterns.
PCA transforms data into a new coordinate system that maximizes
variance, while t-SNE emphasizes preserving pairwise distances
between data points.
11. Time Series Analysis:
Time series techniques include Autoregressive Integrated Moving
Average (ARIMA) models, exponential smoothing, and state-space
models.
They're used to analyze and forecast data collected over time, such as
stock prices and weather patterns.
12. Text Mining and Natural Language Processing (NLP):
Text mining techniques involve tasks like tokenization, stemming, and
sentiment analysis. NLP algorithms process and understand human
language.
Named Entity Recognition, Text Classification, and Language
Generation are common NLP tasks.
https://www.geeksforgeeks.org/data-mining-techniques/
5. What is machine learning? Why machine learning must be performed? Explain its
types.
Machine learning can be categorized into several types based on different criteria,
including the learning approach, the nature of the task, and the desired outcome.
Here are some of the main types of machine learning:
1. Supervised Learning:
In supervised learning, the algorithm is trained on a labeled dataset,
where the input data is paired with corresponding correct output
labels.
The goal is to learn a mapping function that can predict the correct
output labels for new, unseen data.
Common tasks include classification (assigning labels to categories)
and regression (predicting numeric values).
2. Unsupervised Learning:
In unsupervised learning, the algorithm is trained on an unlabeled
dataset, where the output labels are not provided during training.
The goal is to discover underlying patterns, structures, or relationships
within the data.
Clustering (grouping similar data points) and dimensionality reduction
(reducing the number of features while retaining information) are
examples of unsupervised tasks.
3. Semi-Supervised Learning:
Semi-supervised learning combines aspects of both supervised and
unsupervised learning. It uses a small amount of labeled data along
with a larger amount of unlabeled data.
This approach can improve the performance of models, especially when
obtaining a large labeled dataset is expensive or time-consuming.
4. Reinforcement Learning:
Reinforcement learning involves training an agent to interact with an
environment to maximize a cumulative reward.
The agent learns through trial and error, receiving feedback in the form
of rewards or penalties based on its actions.
Applications include game playing, robotics, and autonomous systems.
5. Deep Learning:
Deep learning is a subset of machine learning that involves neural
networks with multiple layers (deep neural networks).
Deep learning has shown remarkable success in tasks like image and
speech recognition, natural language processing, and more.
Convolutional Neural Networks (CNNs) for images and Recurrent
Neural Networks (RNNs) for sequences are common architectures.
Data mining involves several challenges and issues that need to be addressed to
ensure the successful extraction of valuable insights from large datasets. Here are
some of the key data mining issues:
1. Data Quality: Poor data quality can lead to unreliable results. Inaccurate,
incomplete, or inconsistent data can introduce noise and bias into the
analysis. Data preprocessing techniques are essential to clean and transform
the data into a suitable format.
2. Data Preprocessing: Raw data often needs to be cleaned, transformed, and
integrated from multiple sources before analysis. Data preprocessing includes
handling missing values, dealing with outliers, and normalizing or scaling
features.
3. Dimensionality: High-dimensional data can lead to the "curse of
dimensionality," where algorithms struggle due to the increased
computational complexity and the risk of overfitting. Dimensionality reduction
techniques like PCA and feature selection methods help mitigate this issue.
4. Scalability: Many datasets are large and complex, making it challenging to
apply data mining algorithms efficiently. Scalable algorithms and distributed
computing techniques are necessary to handle big data effectively.
5. Algorithm Selection: Choosing the right algorithm for a specific problem is
crucial. The performance of algorithms can vary based on the characteristics of
the data and the problem domain. Incorrect algorithm choice can lead to
suboptimal results.
6. Interpretability: Some complex algorithms, like deep learning, can be difficult
to interpret. Understanding how a model arrived at a particular decision is
essential, especially in domains where transparency is crucial (e.g., medical
diagnoses).
7. Overfitting and Underfitting: Overfitting occurs when a model captures
noise in the data rather than the underlying patterns. Underfitting occurs
when a model is too simple to capture the complexities in the data. Balancing
between these two extremes is essential for model generalization.
8. Bias and Fairness: Data mining can inherit biases present in the data, leading
to biased decisions and discriminatory outcomes. Ensuring fairness and
reducing bias in algorithms is a critical ethical concern.
9. Privacy and Security: The use of sensitive or personal data raises privacy and
security concerns. Anonymization techniques and data protection mechanisms
are needed to safeguard individuals' privacy.
10. Domain Knowledge: Data mining often requires domain knowledge to
interpret results correctly and guide the analysis. Lack of domain knowledge
can lead to misinterpretation of patterns and incorrect conclusions.
11. Imbalanced Data: Imbalanced datasets, where one class significantly
outnumbers the others, can lead to biased models that perform poorly on
minority classes. Techniques like oversampling, undersampling, and using
different evaluation metrics are used to address this issue.
12. Temporal and Spatial Data: Handling data with temporal or spatial attributes
requires specialized techniques. Time series analysis, geospatial data analysis,
and handling data with changing contexts are challenging tasks.
13. Ethical Considerations: Data mining can raise ethical dilemmas, especially
when dealing with sensitive data or using algorithms that might
disproportionately affect certain groups. Ethical guidelines and regulations
must be followed to ensure responsible data use.
14. Reproducibility and Transparency: To ensure the validity of findings, data
mining processes need to be well-documented and reproducible. Transparent
reporting of methodologies, parameters, and results is important for peer
review and verification.
15. Cost and Resource Constraints: Data mining can be resource-intensive in
terms of computation, time, and expertise. Balancing the costs and benefits of
data mining projects is essential.
KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable information
from large datasets. The KDD process is an iterative process and it requires
multiple iterations of the above steps to extract accurate knowledge from the
data.The following steps are included in KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from
collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data
transformation tools.
Data Integration
Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse). Data integration using Data
Migration tools, Data Synchronization tools and ETL(Extract-Load-
Transformation) process.
Data Selection
Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection. For this we can use Neural
network, Decision Trees, Naive bayes, Clustering,
and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a two
step process:
1. Data Mapping: Assigning elements from source base to destination
to capture transformations.
2. Code generation: Creation of the actual transformation program.
Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns, and decides
purpose of model using classification or characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures. It find interestingness
score of each pattern, and uses summarization and Visualization to make
data understandable by user.
Knowledge Representation
This involves presenting the results in a way that is meaningful and can be
used to make decisions.
Advantages of KDD
1. Improves decision-making: KDD provides valuable insights and
knowledge that can help organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-
consuming tasks and makes the data ready for analysis, which saves
time and money.
3. Better customer service: KDD helps organizations gain a better
understanding of their customers’ needs and preferences, which can
help them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by
identifying patterns and anomalies in the data that may indicate
fraud.
5. Predictive modeling: KDD can be used to build predictive models
that can forecast future trends and patterns.
Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it involves
collecting and analyzing large amounts of data, which can include
sensitive information about individuals.
2. Complexity: KDD can be a complex process that requires
specialized skills and knowledge to implement and interpret the
results.
3. Unintended consequences: KDD can lead to unintended
consequences, such as bias or discrimination, if the data or models
are not properly understood or used.
4. Data Quality: KDD process heavily depends on the quality of data,
if data is not accurate or consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant
investments in hardware, software, and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a
common problem in machine learning where a model learns the
detail and noise in the training data to the extent that it negatively
impacts the performance of the model on new unseen data.
7. Difference between KDD and Data Mining
Parameter KDD Data Mining
Patterns, associations, or
Structured information, such as
insights that can be used to
Output rules and models, that can be used
improve decision-making or
to make decisions or predictions.
understanding.
Parameter KDD Data Mining
https://www.geeksforgeeks.org/kdd-process-in-data-mining/