Skip to content

Commit ab7afd7

Browse files
authored
Add files via upload
1 parent 702a96c commit ab7afd7

File tree

1 file changed

+343
-0
lines changed

1 file changed

+343
-0
lines changed
Lines changed: 343 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,343 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"## Hands-On Data Preprocessing in Python\n",
8+
"Learn how to effectively prepare data for successful data analytics\n",
9+
" \n",
10+
" AUTHOR: Dr. Roy Jafari \n",
11+
"\n",
12+
"### Chapter 12: Data Fusion & Data Integration \n",
13+
"#### Excercises"
14+
]
15+
},
16+
{
17+
"cell_type": "code",
18+
"execution_count": 3,
19+
"metadata": {},
20+
"outputs": [],
21+
"source": [
22+
"import pandas as pd\n",
23+
"import matplotlib.pyplot as plt\n",
24+
"import numpy as np"
25+
]
26+
},
27+
{
28+
"cell_type": "markdown",
29+
"metadata": {},
30+
"source": [
31+
"# Excercise 1\n",
32+
"In your own words, what is the difference between Data Fusion and Data Integration? Give examples other than the ones in this chapter. \n"
33+
]
34+
},
35+
{
36+
"cell_type": "code",
37+
"execution_count": null,
38+
"metadata": {},
39+
"outputs": [],
40+
"source": []
41+
},
42+
{
43+
"cell_type": "markdown",
44+
"metadata": {},
45+
"source": [
46+
"# Excercise 2\n",
47+
"Answer the following question about **Challenge 4: Aggregation mismatch**. Is this challenge a data fusion one, a data integration, or both? Explain."
48+
]
49+
},
50+
{
51+
"cell_type": "code",
52+
"execution_count": null,
53+
"metadata": {},
54+
"outputs": [],
55+
"source": []
56+
},
57+
{
58+
"cell_type": "markdown",
59+
"metadata": {},
60+
"source": [
61+
"# Excercise 3\n",
62+
"How come **Challenge 2: Unwise data collection** is somehow both a data cleaning step and a data integration? Do you think it is essential that we categorize if an unwise data collection should be under data cleaning or data integration? "
63+
]
64+
},
65+
{
66+
"cell_type": "code",
67+
"execution_count": null,
68+
"metadata": {},
69+
"outputs": [],
70+
"source": []
71+
},
72+
{
73+
"cell_type": "markdown",
74+
"metadata": {},
75+
"source": [
76+
"# Excercise 4\n",
77+
"In Example 1 of this chapter, we used multi-level indexing using Date and Hour to overcome the index mismatched formatting challenge. For this exercise, repeat this example but this time use a single level indexing using python DataTime object."
78+
]
79+
},
80+
{
81+
"cell_type": "code",
82+
"execution_count": null,
83+
"metadata": {},
84+
"outputs": [],
85+
"source": []
86+
},
87+
{
88+
"cell_type": "markdown",
89+
"metadata": {},
90+
"source": [
91+
"# Excercise 5\n",
92+
"Recreate **Figure 5.23** from **Chapter 5 Data Visualization**, but this time instead of using *WH Report_preprocessed.csv*, integrate the following three files yourself first: *WH Report.csv*, *populations.csv*, and *Countires.csv*. Hint: information about happiness indices come from *WH Report.csv*, information of the countries content comes from *Countires.csv*, and population information comes from *populations.csv*. "
93+
]
94+
},
95+
{
96+
"cell_type": "code",
97+
"execution_count": null,
98+
"metadata": {},
99+
"outputs": [],
100+
"source": []
101+
},
102+
{
103+
"cell_type": "markdown",
104+
"metadata": {},
105+
"source": [
106+
"# Excercise 6\n",
107+
"In **Chapter 6, Exercise 2**, we used *ToyotaCorolla_preprocessed.csv* to create a model that predicts the price of cars. In this exercise, we want to do the preprocessing ourselves. Use *ToyotaCorolla.csv* to perform the following steps.\n",
108+
"\n",
109+
" a.\tAre there any concerns regarding Level Ⅰ data cleaning? If yes, address them if necessary. \n",
110+
" b.\tAre there any concerns regarding Level Ⅱ data cleaning? If yes, address them if necessary. \n",
111+
" c.\tAre there any concerns regarding Level Ⅲ data cleaning? If yes, address them if necessary. \n",
112+
" d.\tAre there attributes in ToyotaCorolla.csv that can be considered redundant? \n",
113+
" e.\tApply LinearRegression from sklearn.linear_model. Did you have to remove the redundant attributes? Why/Why now?\n",
114+
" f.\tApply MLPRegressor from sklearn.neural_network. Did you have to remove the redundant attributes? Why/Why now?\n"
115+
]
116+
},
117+
{
118+
"cell_type": "code",
119+
"execution_count": null,
120+
"metadata": {},
121+
"outputs": [],
122+
"source": []
123+
},
124+
{
125+
"cell_type": "markdown",
126+
"metadata": {},
127+
"source": [
128+
"# Excercise 7\n",
129+
"We would like to use the file *Universities.csv* to cluster the universities into two meaningful clusters. However, the data source has many issues including data cleaning levels Ⅰ - Ⅲ and data redundancy. Perform the following steps.\n",
130+
"\n",
131+
" a.\tDeal with data cleaning issues\n",
132+
" b.\tDeal with data redundancy issues\n",
133+
" c.\tUse any column necessary except State and Public (1)/ Private (2) to find the two meaningful clusters.\n",
134+
" d.\tPerform centroid analysis and give a name to each cluster.\n",
135+
" e.\tFind if the newly created categorical attribute cluster has a relationship with either of the two categorical attributes we intentionally did not use for clustering: State or Public (1)/ Private (2).\n"
136+
]
137+
},
138+
{
139+
"cell_type": "code",
140+
"execution_count": null,
141+
"metadata": {},
142+
"outputs": [],
143+
"source": []
144+
},
145+
{
146+
"cell_type": "markdown",
147+
"metadata": {},
148+
"source": [
149+
"# Excercise 8\n",
150+
"\n",
151+
"In this exercise, we will see an example of data fusion. The case study that we will use in this exercise was already introduced under Data Fusion Example in this chapter, please go back and read it again before continuing with this exercise. \n",
152+
"In short, in this example, we would like to integrate Yeild.csv and Treatment.csv to see if the amount of water that can impact the amount of yield.\n",
153+
"Perform the following steps to make this happen.\n"
154+
]
155+
},
156+
{
157+
"cell_type": "markdown",
158+
"metadata": {},
159+
"source": [
160+
" a.\tUse pd.read_csv() to read Yeild.csv into yield_df, and read Treatment.csv into treatment_df."
161+
]
162+
},
163+
{
164+
"cell_type": "code",
165+
"execution_count": null,
166+
"metadata": {},
167+
"outputs": [],
168+
"source": []
169+
},
170+
{
171+
"cell_type": "markdown",
172+
"metadata": {},
173+
"source": [
174+
" b.\tDraw a scatterplot of the points in treatment_df. use the dimension of color to add the amount of watter that has been dispensed from each point. "
175+
]
176+
},
177+
{
178+
"cell_type": "code",
179+
"execution_count": null,
180+
"metadata": {},
181+
"outputs": [],
182+
"source": []
183+
},
184+
{
185+
"cell_type": "markdown",
186+
"metadata": {},
187+
"source": [
188+
" c.\tDraw a scatterplot of the points in yield_df. use the dimension of color to add the amount of harvest that has been collected from each point."
189+
]
190+
},
191+
{
192+
"cell_type": "code",
193+
"execution_count": null,
194+
"metadata": {},
195+
"outputs": [],
196+
"source": []
197+
},
198+
{
199+
"cell_type": "markdown",
200+
"metadata": {},
201+
"source": [
202+
" d.\tCreate a scatterplot that combines the visual in b and c."
203+
]
204+
},
205+
{
206+
"cell_type": "code",
207+
"execution_count": null,
208+
"metadata": {},
209+
"outputs": [],
210+
"source": []
211+
},
212+
{
213+
"cell_type": "markdown",
214+
"metadata": {},
215+
"source": [
216+
" e.\tFrom the scatterplots in the preceding steps, we can deduce that the water stations are within an equidistant space from one another. Based on this realization, calculate the equidistant diameter between the water points, and call it radius. We are going to use this variable in the next steps of calculations."
217+
]
218+
},
219+
{
220+
"cell_type": "code",
221+
"execution_count": null,
222+
"metadata": {},
223+
"outputs": [],
224+
"source": []
225+
},
226+
{
227+
"cell_type": "markdown",
228+
"metadata": {},
229+
"source": [
230+
" f.\tFirst, use the following code to create the function calculateDistance(). \n",
231+
"\n",
232+
"\n",
233+
"```import math\n",
234+
"def calculateDistance(x1,y1,x2,y2):\n",
235+
" dist = math.sqrt((x2 - x1)**2 + (y2 - y1)**2)\n",
236+
" return dist```"
237+
]
238+
},
239+
{
240+
"cell_type": "markdown",
241+
"metadata": {},
242+
"source": [
243+
" Then, use the following code and the preceding function we just created, create the function waterRecieved() so we can apply it to the function to the rows of treatment_df. \n",
244+
" \n",
245+
" \n",
246+
"```def WaterReceived(r):\n",
247+
" w = 0\n",
248+
" for i, rr in treatment_df.iterrows():\n",
249+
" distance = calculateDistance(rr.longitude,\n",
250+
" rr.latitude,\n",
251+
" r.longitude,\n",
252+
" r.latitude) \n",
253+
" if (distance< radius):\n",
254+
" w= w + rr.water * ((radius-distance)/radius)\n",
255+
" return w```"
256+
]
257+
},
258+
{
259+
"cell_type": "code",
260+
"execution_count": null,
261+
"metadata": {},
262+
"outputs": [],
263+
"source": []
264+
},
265+
{
266+
"cell_type": "markdown",
267+
"metadata": {},
268+
"source": [
269+
"g.\tApply **waterRecieved()** to the rows of **yeild_df**, and add the newly calucated value for each row under the column name water."
270+
]
271+
},
272+
{
273+
"cell_type": "code",
274+
"execution_count": null,
275+
"metadata": {},
276+
"outputs": [],
277+
"source": []
278+
},
279+
{
280+
"cell_type": "markdown",
281+
"metadata": {},
282+
"source": [
283+
" h.\tStudy the newly updated yeild_df. You were just able to fuse these two data sources. Go back and study these steps, especially the creation of function waterRecieved(). What are the assumptions that made this data fusion possible?"
284+
]
285+
},
286+
{
287+
"cell_type": "markdown",
288+
"metadata": {},
289+
"source": [
290+
"Answer: "
291+
]
292+
},
293+
{
294+
"cell_type": "markdown",
295+
"metadata": {},
296+
"source": [
297+
" i.\tDraw the scatter plot of the two attributes yeild_df.harvest and yeild_df.water. Do we see an impact from yeild_df.water on yeild_df.harvest?"
298+
]
299+
},
300+
{
301+
"cell_type": "code",
302+
"execution_count": null,
303+
"metadata": {},
304+
"outputs": [],
305+
"source": []
306+
},
307+
{
308+
"cell_type": "markdown",
309+
"metadata": {},
310+
"source": [
311+
"j.\tUse correlation coefficient to confirm your observation from the previous step. "
312+
]
313+
},
314+
{
315+
"cell_type": "code",
316+
"execution_count": null,
317+
"metadata": {},
318+
"outputs": [],
319+
"source": []
320+
}
321+
],
322+
"metadata": {
323+
"kernelspec": {
324+
"display_name": "Python 3",
325+
"language": "python",
326+
"name": "python3"
327+
},
328+
"language_info": {
329+
"codemirror_mode": {
330+
"name": "ipython",
331+
"version": 3
332+
},
333+
"file_extension": ".py",
334+
"mimetype": "text/x-python",
335+
"name": "python",
336+
"nbconvert_exporter": "python",
337+
"pygments_lexer": "ipython3",
338+
"version": "3.8.5"
339+
}
340+
},
341+
"nbformat": 4,
342+
"nbformat_minor": 4
343+
}

0 commit comments

Comments
 (0)