@@ -27,11 +27,53 @@ For maximum speed in the analysis this is the information you should pass to a s
2727
28281 . The raw data.
29292 . A [ tidy data set] ( http://vita.had.co.nz/papers/tidy-data.pdf )
30- 3 . An explicit and exact recipe you used to go from 1 -> 2
30+ 3 . A code book describing each variable and its values in the tidy data set.
31+ 4 . An explicit and exact recipe you used to go from 1 -> 2,3
3132
3233Let's look at each part of the data package you will transfer.
3334
3435
36+ ### The raw data
37+
38+ It is critical that you include the rawest form of the data that you have access to. Here are some examples of the
39+ raw form of data:
40+
41+ * The strange binary file your measurement machine spits out
42+ * The unformated Excel file with 10 workbooks the company you contracted with sent you
43+ * The complicated JSON data you got from scraping the Twitter API
44+ * The hand-entered numbers you collected looking through a microscope
45+
46+ You know the raw data is in the right format if you:
47+
48+ 1 . Ran no software on the data
49+ 2 . Did not manipulate any of the numbers in the data
50+ 3 . You did not remove any data from the data set
51+ 4 . You did not summarize the data in any way
52+
53+ If you did any manipulation of the data at all it is not the raw form of the data. Reporting manipulated data
54+ as raw data is a very common way to slow down the analysis process, since the analyst will often have to do a
55+ forensic study of your data to figure out why the raw data looks weird.
56+
57+ ### The tidy data set
58+
59+ The general principles of tidy data are laid out by Hadley Wickham in [ this paper] ( http://vita.had.co.nz/papers/tidy-data.pdf )
60+ and [ this video] ( http://vimeo.com/33727555 ) . The paper and the video are both focused on the R package, which you
61+ may or may not know how to use. Regardless the three general principles you should pay attention to are:
62+
63+ 1 . Each variable you measure should be in one column
64+ 2 . Each different observation of that variable should be in a different row
65+ 3 . There should be one table for each "kind" of variable
66+ 4 . If you have multiple tables, they should include a row in the table that allows them to be linked
67+
68+
69+ ### The code book
70+
71+
72+ ### The instruction list/script
73+
74+
75+
76+
3577
3678What you should expect from a statistician
3779====================
0 commit comments