@@ -63,14 +63,79 @@ may or may not know how to use. Regardless the three general principles you shou
63631 . Each variable you measure should be in one column
64642 . Each different observation of that variable should be in a different row
65653 . There should be one table for each "kind" of variable
66- 4 . If you have multiple tables, they should include a row in the table that allows them to be linked
66+ 4 . If you have multiple tables, they should include a column in the table that allows them to be linked
67+
68+ While these are the hard and fast rules, there are a number of other things that will make your data set much easier
69+ to handle. First is to include a row at the top of each data table/spreadsheet that contains full row names.
70+ So if you measured age at diagnosis for patients, you would head that column with the name AgeAtDiagnosis instead
71+ of something like ADx or another abreviation that may be hard for another person to understand.
72+
73+
74+ Here is an example of how this would work from genomics. Suppose that for 20 people you have collected gene expression measurements with
75+ [ RNA-sequencing] ( http://en.wikipedia.org/wiki/RNA-Seq ) . You have also collected demographic and clinical information
76+ about the patients including their age, treatment, and diagnosis. You would have one table/spreadsheet that contains the clinical/demographic
77+ information. It would have four columns (patient id, age, treatment, diagnosis) and 21 rows (a row with variable names, then one row
78+ for every patient). You would also have one spreadsheet for the summarized genomic data. Usually this type of data
79+ is summarized at the level of the number of counts per exon. Suppose you have 100,000 exons, then you would have a
80+ table/spreadsheet that had 21 rows (a row for gene names, and one row for each patient) and 100,001 columns (one row for patient
81+ ids and one row for each data type).
6782
6883
6984### The code book
7085
86+ For almost any data set, the measurements you calculate will need to be described in more detail than you will sneak
87+ into the spreadsheet. The code book contains this information. At minimum it should contain:
88+
89+ 1 . Information about the variables (including units!) in the data set not contained in the tidy data
90+ 2 . Information about the summary choices you made
91+ 3 . Information about the experimental study design you used
92+
93+ In our genomics example, the analyst would want to know what the unit of measurement for each
94+ clinical/demographic variable is (age in years, treatment by name/dose, level of diagnosis and how heterogeneous). They
95+ would also want to know how you picked the exons you used for summarizing the genomic data (UCSC/Ensembl, etc.). They
96+ would also want to know any other information about how you did the data collection/study design. For example,
97+ are these the first 20 patients that walked into the clinic? Are they 20 highly selected patients by some characteristic
98+ like age? Are they randomized to treatments?
99+
100+ A common format for this document is a word file. There should be a section called "Study design" that has a thorugh
101+ description of how you collected the data. There is a section called "Code book" that describes each variable and its
102+ units.
103+
104+ ### How to code variables
105+
106+ When you put variables into a spreadsheet there are several main categories you will run into:
107+
108+ 1 . Continuous
109+ 2 . Ordinal
110+ 3 . Categorical
111+ 4 . Misssing
112+ 5 . Censored
113+
114+ Continuous variables are anything measured on a quantitative scale that could be any fractional number. An example
115+ would be something like weight measured in kg. Ordinal data are data that have a fixed, small (< 100) number of levels but are ordered.
116+ This could be for example survey responses where the choices are: poor, fair, good. Categorical data are data where there
117+ are multiple categories, but they aren't ordered. One example would be sex: male or female. Missing data are data
118+ that are missing and you don't know the mechanism. You should code missing values as NA. Censored data are data
119+ where you know the missingness mechanism on some level. Common examples are a measurement being below a detection limit
120+ ora patient being lost to follow-up. They should also be coded as NA when you don't have the data. But you should
121+ also add a new column to your tidy data called, "VariableNameCensored" which should have values of TRUE if censored
122+ and FALSE if not. In the code book you should explain why those values are missing. It is absolutely critical to report
123+ to the analyst if there is a reason you know about that some of the data are missing. You should also not impute/make up/
124+ throw away missing observations.
125+
126+ In general, try to avoid coding categorical or ordinal variables as numbers. When you enter the value for sex in the tidy
127+ data, it should be "male" or "female". The ordinal values in the data set should be "poor", "fair", and "good" not 1, 2 ,3.
128+ This will avoid potential mixups about which direction effects go and will help identify coding errors.
129+
71130
72131### The instruction list/script
73132
133+ You may have heard this before, but [ reproducibility is kind of a big deal in computational science] ( http://www.sciencemag.org/content/334/6060/1226 ) .
134+ That means, when you submit your paper, the reviewers and the rest of the world should be able to exactly replicate
135+ the analyses from raw data all the way to final results. If you are trying to be efficient, you will likely perform
136+ some summarization/data analysis steps before the data can be considered tidy.
137+
138+
74139
75140
76141
0 commit comments