@@ -63,14 +63,79 @@ may or may not know how to use. Regardless the three general principles you shou
63
63
1 . Each variable you measure should be in one column
64
64
2 . Each different observation of that variable should be in a different row
65
65
3 . There should be one table for each "kind" of variable
66
- 4 . If you have multiple tables, they should include a row in the table that allows them to be linked
66
+ 4 . If you have multiple tables, they should include a column in the table that allows them to be linked
67
+
68
+ While these are the hard and fast rules, there are a number of other things that will make your data set much easier
69
+ to handle. First is to include a row at the top of each data table/spreadsheet that contains full row names.
70
+ So if you measured age at diagnosis for patients, you would head that column with the name AgeAtDiagnosis instead
71
+ of something like ADx or another abreviation that may be hard for another person to understand.
72
+
73
+
74
+ Here is an example of how this would work from genomics. Suppose that for 20 people you have collected gene expression measurements with
75
+ [ RNA-sequencing] ( http://en.wikipedia.org/wiki/RNA-Seq ) . You have also collected demographic and clinical information
76
+ about the patients including their age, treatment, and diagnosis. You would have one table/spreadsheet that contains the clinical/demographic
77
+ information. It would have four columns (patient id, age, treatment, diagnosis) and 21 rows (a row with variable names, then one row
78
+ for every patient). You would also have one spreadsheet for the summarized genomic data. Usually this type of data
79
+ is summarized at the level of the number of counts per exon. Suppose you have 100,000 exons, then you would have a
80
+ table/spreadsheet that had 21 rows (a row for gene names, and one row for each patient) and 100,001 columns (one row for patient
81
+ ids and one row for each data type).
67
82
68
83
69
84
### The code book
70
85
86
+ For almost any data set, the measurements you calculate will need to be described in more detail than you will sneak
87
+ into the spreadsheet. The code book contains this information. At minimum it should contain:
88
+
89
+ 1 . Information about the variables (including units!) in the data set not contained in the tidy data
90
+ 2 . Information about the summary choices you made
91
+ 3 . Information about the experimental study design you used
92
+
93
+ In our genomics example, the analyst would want to know what the unit of measurement for each
94
+ clinical/demographic variable is (age in years, treatment by name/dose, level of diagnosis and how heterogeneous). They
95
+ would also want to know how you picked the exons you used for summarizing the genomic data (UCSC/Ensembl, etc.). They
96
+ would also want to know any other information about how you did the data collection/study design. For example,
97
+ are these the first 20 patients that walked into the clinic? Are they 20 highly selected patients by some characteristic
98
+ like age? Are they randomized to treatments?
99
+
100
+ A common format for this document is a word file. There should be a section called "Study design" that has a thorugh
101
+ description of how you collected the data. There is a section called "Code book" that describes each variable and its
102
+ units.
103
+
104
+ ### How to code variables
105
+
106
+ When you put variables into a spreadsheet there are several main categories you will run into:
107
+
108
+ 1 . Continuous
109
+ 2 . Ordinal
110
+ 3 . Categorical
111
+ 4 . Misssing
112
+ 5 . Censored
113
+
114
+ Continuous variables are anything measured on a quantitative scale that could be any fractional number. An example
115
+ would be something like weight measured in kg. Ordinal data are data that have a fixed, small (< 100) number of levels but are ordered.
116
+ This could be for example survey responses where the choices are: poor, fair, good. Categorical data are data where there
117
+ are multiple categories, but they aren't ordered. One example would be sex: male or female. Missing data are data
118
+ that are missing and you don't know the mechanism. You should code missing values as NA. Censored data are data
119
+ where you know the missingness mechanism on some level. Common examples are a measurement being below a detection limit
120
+ ora patient being lost to follow-up. They should also be coded as NA when you don't have the data. But you should
121
+ also add a new column to your tidy data called, "VariableNameCensored" which should have values of TRUE if censored
122
+ and FALSE if not. In the code book you should explain why those values are missing. It is absolutely critical to report
123
+ to the analyst if there is a reason you know about that some of the data are missing. You should also not impute/make up/
124
+ throw away missing observations.
125
+
126
+ In general, try to avoid coding categorical or ordinal variables as numbers. When you enter the value for sex in the tidy
127
+ data, it should be "male" or "female". The ordinal values in the data set should be "poor", "fair", and "good" not 1, 2 ,3.
128
+ This will avoid potential mixups about which direction effects go and will help identify coding errors.
129
+
71
130
72
131
### The instruction list/script
73
132
133
+ You may have heard this before, but [ reproducibility is kind of a big deal in computational science] ( http://www.sciencemag.org/content/334/6060/1226 ) .
134
+ That means, when you submit your paper, the reviewers and the rest of the world should be able to exactly replicate
135
+ the analyses from raw data all the way to final results. If you are trying to be efficient, you will likely perform
136
+ some summarization/data analysis steps before the data can be considered tidy.
137
+
138
+
74
139
75
140
76
141
0 commit comments