DHV MODEL 1.2 Data Cleaning
DHV MODEL 1.2 Data Cleaning
SELECT Market_Cap
From Companies
Where Company_Name = “Apple”
Number of Rows: 0
Problem:
Missing Data
4
DB-hard Queries
Company_Name Address Market Cap
Google Googleplex, Mtn. View, CA $210Bn
Intl. Business Machines Armonk, NY $200Bn
Microsoft Redmond, WA $250Bn
SELECT Market_Cap
From Companies
Where Company Name = “IBM”
Number of Rows: 0
Problem:
Entity Resolution
5
DB-hard Queries
Company_Name Address Market Cap
Google Googleplex, Mtn. View $210
Intl. Business Machines Armonk, NY $200
Microsoft Redmond, WA $250
Sally’s Lemonade Stand Alameda,CA $260
SELECT MAX(Market_Cap)
From Companies
Number of Rows: 1
Problem:
Unit Mismatch
6
Who’s Calling Who’S Data Dirty?
7
Dirty Data
• The Statistics View:
• There is a process that produces data
• Any dataset is a sample of the output of that process
• Results are probabilistic
• You can correct bias in your sample
Dirty Data
• The Database View:
• I got my hands on this data set
• Some of the values are missing, corrupted, wrong, duplicated
• Results are absolute (relational model)
• You get a better answer by improving the quality of the values
in your dataset
Dirty Data
• The Domain Expert’s View:
• This Data Doesn’t look right
• This Answer Doesn’t look right
• What happened?
Dirty Data
• The Data Scientist’s View:
• Some Combination of all of the above
Data Quality Problems
• Data is dirty on its own
Integrate
Clean
Extract
Transform
Load
13
Example Data Quality Problems
T.Das|97336o8327|24.95|Y|-|0.0|1000
Ted J.|973-360-8779|2000|N|M|NY|1000
wrapperwrapperwrapperwrapperwrapper
Data Quality
22
Meaning of Data Quality (1)
• Generally, you have a problem if the data doesn’t
mean what you think it does, or should
• Data not up to spec : garbage in, glitches, etc.
• You don’t understand the spec : complexity, lack of
metadata.
• Many sources and manifestations
• As we have discussed
• Data quality problems are expensive and
pervasive
• DQ problems cost hundreds of billion $$$ each year.
• Resolving data quality problems is often the biggest
effort in a data mining study.
Conventional Definition of Data Quality
• Accuracy
• The data was recorded correctly.
• Completeness
• All relevant data was recorded.
• Uniqueness
• Entities are recorded once.
• Timeliness
• The data is kept up to date.
• Special problems in federated data: time consistency.
• Consistency
• The data agrees with itself.
Problems …
• Unmeasurable
• Accuracy and completeness are extremely difficult,
perhaps impossible to measure.
• Context independent
• No accounting for what is important. E.g., if you are
computing aggregates, you can tolerate a lot of
inaccuracy.
• Incomplete
• What about interpretability, accessibility, metadata,
analysis, etc.
• Vague
• The conventional definitions provide no guidance
towards practical improvements of the data.
Finding a modern definition
• We need a definition of data quality which
• Reflects the use of the data
• Leads to improvements in processes
• Is measurable (we can define metrics)
Smoothed
output
Smoothing Filter
Raw
readings
Time
Physical Data Cleaning