CSV Output

This document describes the columns in the CSV output

sex_check

Sex check performs a comparison between the sex reported in the ped file and that inferred from the genotypes on the non-PAR regions of the X chromosome.

1 row per sample with columns of:

  • sample_id: sample from ped.
  • error: boolean indicating wether there is a mismatch between X genotypes and ped sex.
  • het_count: number of heterozygote calls
  • hom_alt_count: number of homozygous-alternate calls
  • hom_ref_count: number of homozygous-reference calls
  • het_ratio: ratio of het_count / hom_alt_count. Low for males, high for females
  • ped_sex: sex from .ped file
  • predicted_sex: sex predicted from rate of hets on chrX.

het_check

Het check does general QC including rate of het calls, allele-balance at het calls, mean and median depth, and a PCA projection onto thousand genomes.

1 row per sample with columns of:

  • sample_id: sample from ped.
  • sampled_sites: number of sites sampled (sufficient call-rate across samples and depth in this sample)
  • mean/median_depth: mean/median depths for the sites tested.
  • depth_outlier: boolean indicating that this sample’s depth is considered an outlier relative to the other samples.
  • het_count: number of heterozygote calls in sampled sites.
  • het_ratio: proportion of sites that were heterozygous.
  • ratio_outlier: boolean indicating that the het_ratio was outside what is normally seen.
  • idr_baf: inter-decile range (90th percentile - 10th percentile) of b-allele frequency. We make a distribution of all sites of alts / (ref + alts) and then report the difference between the 90th and the 10th percentile. Large values indicated likely sample contamination.
  • p10/p90: the numbers used to calculate idr_baf.

And the PCA columns:

  • PC1/PC2/PC3/PC4: the first 4 values after this sample was projected onto the thousand genomes principle components.
  • ancestry-prediction: one of AFR AMR EAS EUR SAS UNKNOWN where it is unknown if ancestry-prob < 0.65 for the highest population
  • ancestry-prob: the highest probability from the SVM for any ancestry (between 0 and 1).

ped_check

Ped check compares the relatedness of 2 samples as reported in a .ped file to the relatedness inferred from the genotypes and ~25K sites in the genome.

This contains 1 row per sample-pair: (n_samples * n_samples) / 2 rows.

  • sample_a/sample_b: the samples indicating the pair in question.
  • n: the number of sites that was used to predict the relatedness.
  • rel: the relatedness calculated from the genotypes.
  • pedigree_relatedness: the relatedness reported in the ped file.
  • rel_difference: difference between the preceding 2 colummns.
  • ibs0: the number of sites at which the 2 samples shared no alleles (should approach 0 for parent-child pairs).
  • ibs2: the number of sites and which the 2 samples where both hom-ref, both het, or both hom-alt.
  • shared_hets: the number of sites at which both samples were hets.
  • hets_a/b: the number of sites at which sample_a/b was het.
  • pedigree_parents: boolean indicating that this pair is a parent-child pair according to the ped file.
  • predicted_parents: boolean indicating that this pair is expected to be a parent-child pair according to the ibs0 (< 0.012) calculated from the genotypes.
  • parent_error: boolean indicating that the preceding 2 columns don’t match
  • sample_duplication_error: boolean indicating that rel > 0.75 and ibs0 < 0.012