CSV Output¶
This document describes the columns in the CSV output
sex_check¶
Sex check performs a comparison between the sex reported in the ped file and that inferred from the genotypes on the non-PAR regions of the X chromosome.
1 row per sample with columns of:
- sample_id: sample from ped.
- error: boolean indicating wether there is a mismatch between X genotypes and ped sex.
- het_count: number of heterozygote calls
- hom_alt_count: number of homozygous-alternate calls
- hom_ref_count: number of homozygous-reference calls
- het_ratio: ratio of het_count / hom_alt_count. Low for males, high for females
- ped_sex: sex from .ped file
- predicted_sex: sex predicted from rate of hets on chrX.
het_check¶
Het check does general QC including rate of het calls, allele-balance at het calls, mean and median depth, and a PCA projection onto thousand genomes.
1 row per sample with columns of:
- sample_id: sample from ped.
- sampled_sites: number of sites sampled (sufficient call-rate across samples and depth in this sample)
- mean/median_depth: mean/median depths for the sites tested.
- depth_outlier: boolean indicating that this sample’s depth is considered an outlier relative to the other samples.
- het_count: number of heterozygote calls in sampled sites.
- het_ratio: proportion of sites that were heterozygous.
- ratio_outlier: boolean indicating that the het_ratio was outside what is normally seen.
- idr_baf: inter-decile range (90th percentile - 10th percentile) of b-allele frequency. We make a distribution of all sites of alts / (ref + alts) and then report the difference between the 90th and the 10th percentile. Large values indicated likely sample contamination.
- p10/p90: the numbers used to calculate idr_baf.
And the PCA columns:
- PC1/PC2/PC3/PC4: the first 4 values after this sample was projected onto the thousand genomes principle components.
- ancestry-prediction: one of AFR AMR EAS EUR SAS UNKNOWN where it is unknown if ancestry-prob < 0.65 for the highest population
- ancestry-prob: the highest probability from the SVM for any ancestry (between 0 and 1).
ped_check¶
Ped check compares the relatedness of 2 samples as reported in a .ped file to the relatedness inferred from the genotypes and ~25K sites in the genome.
This contains 1 row per sample-pair: (n_samples * n_samples) / 2 rows.
- sample_a/sample_b: the samples indicating the pair in question.
- n: the number of sites that was used to predict the relatedness.
- rel: the relatedness calculated from the genotypes.
- pedigree_relatedness: the relatedness reported in the ped file.
- rel_difference: difference between the preceding 2 colummns.
- ibs0: the number of sites at which the 2 samples shared no alleles (should approach 0 for parent-child pairs).
- ibs2: the number of sites and which the 2 samples where both hom-ref, both het, or both hom-alt.
- shared_hets: the number of sites at which both samples were hets.
- hets_a/b: the number of sites at which sample_a/b was het.
- pedigree_parents: boolean indicating that this pair is a parent-child pair according to the ped file.
- predicted_parents: boolean indicating that this pair is expected to be a parent-child pair according to the ibs0 (< 0.012) calculated from the genotypes.
- parent_error: boolean indicating that the preceding 2 columns don’t match
- sample_duplication_error: boolean indicating that rel > 0.75 and ibs0 < 0.012