AHAH (Alpha-Helices  Assessed by Humans) data readme.

This is the data that accompanies the J. Chem. Inf. Model. article entitled: Crowdsourcing yields a new standard for kinks in
protein helices. Henry R. Wilman, Jean-Paul Ebejer, Jiye Shi, Charlotte M. Deane, and Bernhard Knapp.
The gold standard annotations are given in ahah_gold_standard.csv. The raw annotations are in the file ahah_raw_annotations.csv.



There are two data files:
1. ahah_gold_standard.csv
2. ahah_raw_annotations.csv

1. ahah_gold_standard.csv
This is the gold standard data set. It is the summary data for all the annotations. 
It is divded into four sections, based on the majority view of participants. These sections are separated by lines beginning with '>'
Each comma separated row corresponds to a helix, with the following fields:
helix_id        : An underscore separated string, comprising: PDB code, PDB Chain, start residue number (pdb numbering), end residue number.
classification  : The consensus of users, one of: KINK, CURVED, STRAIGHT, or UNCLASSIFIED. 
kinked          :}
curved          :}The proportion of respondents that annotated the helix in this manner. For any helix, kinked+curved+straight =1
straight        :}
total           : Number of participants who annotated this helix. Hence the number of people who annotated a helix as kinked is kinked*total.


2. ahah_raw_annotations.csv
This is the raw data generated by the AHAH web application.
This is a comma deliminited file, where each row corresponds to a single annotation.
Each row has the following fields: 
helix_id        : as above 
user            : user id. Each user was assigned a unique interger id. NA indicates an annotation by an unregistered user
time            : time in seconds taken for annotation to be made. Any value over 100s is recorded as 100.0. NA indicates either the first annotation for a given user, or an unregistered annotation     
annotation      : One of KINK, CURVED, or STRAIGHT, as indicated by the participant.
position        : If the annotation is 'KINK', the pdb residue number of the residue selected as the kink residue by the participant.
education_level : The education level indicated by the participant during registration. One (or none) of: PI, POSTDOC, POSTGRAD_PHD, POSTGRAD_MSC, UNDERGRAD, POSTSEC, SEC, or OTHER. The PI and POSTDOC groups were combined in the analysis (Post-Doctoral and above), as were the POSTGRAD_PHD, POSTGRAD_MSC, and UNDERGRAD (Undergraduate and above), and POSTSEC, and SEC (Secondary). 
pupil           : Boolean. True if this annotation made by a school pupil, otherwise False.
background(s)   : Semi-colon list of particpant backgrounds indicated during registration. Any number of:Chem_BioChem, CompSci, Physics_Maths, StrucBio, Humanities, Other_Science, Business.



Any probelems, queries, or comments, please contact either wilman at stats.ox.ac.uk or knapp at stats.ox.ac.uk.
For more information about this work and the group, please visit our website at www.stats.ox.ac.uk/research/proteins, or see our blog at www.blopig.com 
