Skip to content

Commit 7a41e88

Browse files
committed
Update README.md
1 parent 8231e2b commit 7a41e88

File tree

1 file changed

+43
-1
lines changed

1 file changed

+43
-1
lines changed

README.md

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,53 @@ For maximum speed in the analysis this is the information you should pass to a s
2727

2828
1. The raw data.
2929
2. A [tidy data set](http://vita.had.co.nz/papers/tidy-data.pdf)
30-
3. An explicit and exact recipe you used to go from 1 -> 2
30+
3. A code book describing each variable and its values in the tidy data set.
31+
4. An explicit and exact recipe you used to go from 1 -> 2,3
3132

3233
Let's look at each part of the data package you will transfer.
3334

3435

36+
### The raw data
37+
38+
It is critical that you include the rawest form of the data that you have access to. Here are some examples of the
39+
raw form of data:
40+
41+
* The strange binary file your measurement machine spits out
42+
* The unformated Excel file with 10 workbooks the company you contracted with sent you
43+
* The complicated JSON data you got from scraping the Twitter API
44+
* The hand-entered numbers you collected looking through a microscope
45+
46+
You know the raw data is in the right format if you:
47+
48+
1. Ran no software on the data
49+
2. Did not manipulate any of the numbers in the data
50+
3. You did not remove any data from the data set
51+
4. You did not summarize the data in any way
52+
53+
If you did any manipulation of the data at all it is not the raw form of the data. Reporting manipulated data
54+
as raw data is a very common way to slow down the analysis process, since the analyst will often have to do a
55+
forensic study of your data to figure out why the raw data looks weird.
56+
57+
### The tidy data set
58+
59+
The general principles of tidy data are laid out by Hadley Wickham in [this paper](http://vita.had.co.nz/papers/tidy-data.pdf)
60+
and [this video](http://vimeo.com/33727555). The paper and the video are both focused on the R package, which you
61+
may or may not know how to use. Regardless the three general principles you should pay attention to are:
62+
63+
1. Each variable you measure should be in one column
64+
2. Each different observation of that variable should be in a different row
65+
3. There should be one table for each "kind" of variable
66+
4. If you have multiple tables, they should include a row in the table that allows them to be linked
67+
68+
69+
### The code book
70+
71+
72+
### The instruction list/script
73+
74+
75+
76+
3577

3678
What you should expect from a statistician
3779
====================

0 commit comments

Comments
 (0)