This package contains data to accompany the paper "Examining the conservation of kinks in alpha helices" by Eleanor C. Law, Henry R. Wilman, Sebastian Kelm, Jiye Shi, Charlotte M. Deane

PDB codes in the original sets of protein chains are given in the following files:
sol_cullpdb_pc80_res5.0_R0.4_longhelices.csv Non-redundant soluble
mem_cullpdb_pc80_res5.0_R0.4_longhelices.csv Non-redundant membrane
mem_cullpdb_pc99_res5.0_R0.4_longhelices.csv Redundant membrane

------------------------------
Homologous Helix Pairs
------------------------------

The following csv files contain data for all homologous helix pairs:

SolPairs.csv
MemPairs.csv
RMemPairs.csv

Column Headings:
helix1,helix2 - The helix ID for each helix in the pair in the format PDBcode_ChainID_HelixStart_HelixEnd. The helix with the larger angle is first.
tm1,tm2 - The global TM-score between the two chains normalised by the length of the chain for first and second helices.
chain_seqID - The sequence identity between the two complete chains, ignoring gaps.
helix_seqID - The sequence identity between the two helices, ignoring gaps.
angle1 - The angle measured in the first helix at the most disrupted site in the helix pair, i.e. the largest angle in the first helix.
angle2 - The angle measured in the second helix at the most disrupted site in the helix pair, i.e. the largest angle in a window around the location of the largest angle in the other helix.
error1,error2 - The error calculated for each of the measured angles.
proline_type - One letter for each helix. P indicates proline is present at the position of the largest angle or the in the follwing four residues; U indicates proline is not present.
H-score - (angle1 - angle2) / (error1 + error2) An H-score > 1 indicates that the angles are significantly different.

------------------------------
Homologous Helix Families
------------------------------

The following csv files contain summary data for all homologous helix families:

MemFams.csv
SolFams.csv

setfamily - The ID number for the family, for which a file can be found in the MemFams folder.
type - The classification of the pair: CK - Conserved Kinked; NC - Not Conserved; CS - Conserved Straight; ot - Other.
site_max_disruption - The most disrupted site (MDS), i.e. the location in the consensus helix which has the highest mean angle across the family.
mean_angle - The mean angle at the MDS.
sd - The standard deviation at the MDS.
mean_error - The mean error for the angles at the MDS.
H-score - sd / mean_error
chain_seqID - The average sequence identity between the whole chains of all pairs in the helix family, ignoring gaps.
helix_seqID - The average sequence identity between the helices of all pairs in the helix family, ignoring gaps.
fraction_pro - The fraction of helices in the family which contain proline, at the position of the angle at the most disrupted site after smoothing, or in the following four residues.

The following folders contain a file for each helix family:

MemFams/
SolFams/

Each file has the following csv format:

helix_ID,start,end,sequence
A list of each helix ID in the family in the format PDBcode_ChainID_HelixStart_HelixEnd, the start and end residues in the consensus helix for the family, the sequence in the consensus helix location.

Positions in the family consensus helix as column headings for the angle and error measurements below. Numbering of the consensus helix starts at 1, therefore the angle measurements generally start at position 6 unless there were fewer than five angles at this site, in which case the first site will be the first position where there were at least five angles.

Angle measurements at each site for each member of the family, after smoothing.

Estimated error for the angle measurements at each site.

------------------------------
GPCRs
------------------------------

crystal-structures-201411041603.csv - The file downloaded from the GPCRDB on 4 Nov 2014, containing a list of all PDB codes used.
gpcrtools_export_20141104155712.csv - The file downloaded from the GPCRDB on 4 Nov 2014, containing a sequence alignment and numbering for the helices.

The following files have the same format and headings as those for the MemFams and SolFams, except the numbering is that used by the GPCRDB.
GPCRs.csv
GPCRs/
