Skip to content

Commit ca1ef35

Browse files
committed
Update README.md
1 parent 7a41e88 commit ca1ef35

File tree

1 file changed

+66
-1
lines changed

1 file changed

+66
-1
lines changed

README.md

Lines changed: 66 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,14 +63,79 @@ may or may not know how to use. Regardless the three general principles you shou
6363
1. Each variable you measure should be in one column
6464
2. Each different observation of that variable should be in a different row
6565
3. There should be one table for each "kind" of variable
66-
4. If you have multiple tables, they should include a row in the table that allows them to be linked
66+
4. If you have multiple tables, they should include a column in the table that allows them to be linked
67+
68+
While these are the hard and fast rules, there are a number of other things that will make your data set much easier
69+
to handle. First is to include a row at the top of each data table/spreadsheet that contains full row names.
70+
So if you measured age at diagnosis for patients, you would head that column with the name AgeAtDiagnosis instead
71+
of something like ADx or another abreviation that may be hard for another person to understand.
72+
73+
74+
Here is an example of how this would work from genomics. Suppose that for 20 people you have collected gene expression measurements with
75+
[RNA-sequencing](http://en.wikipedia.org/wiki/RNA-Seq). You have also collected demographic and clinical information
76+
about the patients including their age, treatment, and diagnosis. You would have one table/spreadsheet that contains the clinical/demographic
77+
information. It would have four columns (patient id, age, treatment, diagnosis) and 21 rows (a row with variable names, then one row
78+
for every patient). You would also have one spreadsheet for the summarized genomic data. Usually this type of data
79+
is summarized at the level of the number of counts per exon. Suppose you have 100,000 exons, then you would have a
80+
table/spreadsheet that had 21 rows (a row for gene names, and one row for each patient) and 100,001 columns (one row for patient
81+
ids and one row for each data type).
6782

6883

6984
### The code book
7085

86+
For almost any data set, the measurements you calculate will need to be described in more detail than you will sneak
87+
into the spreadsheet. The code book contains this information. At minimum it should contain:
88+
89+
1. Information about the variables (including units!) in the data set not contained in the tidy data
90+
2. Information about the summary choices you made
91+
3. Information about the experimental study design you used
92+
93+
In our genomics example, the analyst would want to know what the unit of measurement for each
94+
clinical/demographic variable is (age in years, treatment by name/dose, level of diagnosis and how heterogeneous). They
95+
would also want to know how you picked the exons you used for summarizing the genomic data (UCSC/Ensembl, etc.). They
96+
would also want to know any other information about how you did the data collection/study design. For example,
97+
are these the first 20 patients that walked into the clinic? Are they 20 highly selected patients by some characteristic
98+
like age? Are they randomized to treatments?
99+
100+
A common format for this document is a word file. There should be a section called "Study design" that has a thorugh
101+
description of how you collected the data. There is a section called "Code book" that describes each variable and its
102+
units.
103+
104+
### How to code variables
105+
106+
When you put variables into a spreadsheet there are several main categories you will run into:
107+
108+
1. Continuous
109+
2. Ordinal
110+
3. Categorical
111+
4. Misssing
112+
5. Censored
113+
114+
Continuous variables are anything measured on a quantitative scale that could be any fractional number. An example
115+
would be something like weight measured in kg. Ordinal data are data that have a fixed, small (< 100) number of levels but are ordered.
116+
This could be for example survey responses where the choices are: poor, fair, good. Categorical data are data where there
117+
are multiple categories, but they aren't ordered. One example would be sex: male or female. Missing data are data
118+
that are missing and you don't know the mechanism. You should code missing values as NA. Censored data are data
119+
where you know the missingness mechanism on some level. Common examples are a measurement being below a detection limit
120+
ora patient being lost to follow-up. They should also be coded as NA when you don't have the data. But you should
121+
also add a new column to your tidy data called, "VariableNameCensored" which should have values of TRUE if censored
122+
and FALSE if not. In the code book you should explain why those values are missing. It is absolutely critical to report
123+
to the analyst if there is a reason you know about that some of the data are missing. You should also not impute/make up/
124+
throw away missing observations.
125+
126+
In general, try to avoid coding categorical or ordinal variables as numbers. When you enter the value for sex in the tidy
127+
data, it should be "male" or "female". The ordinal values in the data set should be "poor", "fair", and "good" not 1, 2 ,3.
128+
This will avoid potential mixups about which direction effects go and will help identify coding errors.
129+
71130

72131
### The instruction list/script
73132

133+
You may have heard this before, but [reproducibility is kind of a big deal in computational science](http://www.sciencemag.org/content/334/6060/1226).
134+
That means, when you submit your paper, the reviewers and the rest of the world should be able to exactly replicate
135+
the analyses from raw data all the way to final results. If you are trying to be efficient, you will likely perform
136+
some summarization/data analysis steps before the data can be considered tidy.
137+
138+
74139

75140

76141

0 commit comments

Comments
 (0)