Methodology

How we compute polygenic risk scores

Pipeline · data sources · statistical method · known limits

The Genetic Psychiatric Insights polygenic risk score (PRS) is a sum of a user's genotype dosages at a trait's genome-wide significant loci, weighted by the log-odds-ratio effect sizes reported by the published GWAS. This page documents the full pipeline: which GWAS summary statistics we use, how we filter variants, how we handle strand ambiguity, how we convert a raw score into a population percentile, and the specific failure modes you should know about.

5×10⁻⁸GWS p-threshold
500 kbLD-clumping window
INFO ≥ 0.8Imputation quality filter
~650KInput variants (typical)
The PRS engine is a pure function: (parsed_dna, sumstats, method) → result. No I/O, no side effects, identical output on identical input. Every decision below is in the open-source code, unit-tested, and auditable.

At a glance: the seven-step pipeline

  1. Parse the raw DNA file (AncestryDNA or 23andMe) into a normalized genotype map keyed by both rsID and chr:pos.
  2. Load the trait's GWAS summary statistics from the Psychiatric Genomics Consortium (PGC), filtered to a single primary meta-analysis source file and to variants with imputation quality INFO ≥ 0.8.
  3. Filter to genome-wide significant variants at p < 5×10⁻⁸.
  4. Drop ambiguous A/T and C/G variants (they cannot be unambiguously strand-flipped without an allele-frequency reference).
  5. Clump to independent loci using a 500 kb physical window as an LD proxy — keep the lowest-p variant per window.
  6. Score by summing dosage × β across matched variants, where dosage is validated against both genotype alleles (not just one).
  7. Compare the raw score to an analytical reference distribution built from Hardy-Weinberg expectations at the same variants, producing a percentile.

Data sources: PGC GWAS summary statistics

Every scored trait ties to a single, named publication from the Psychiatric Genomics Consortium or an affiliated consortium (for example, iPSYCH/deCODE for ADHD). We use the published summary statistics directly — not a polygenic score file, not a curated SNP list — so any methodological choice below is reproducible from the same inputs available to the broader research community.

Sumstats are loaded from one of two sources:

  • OpenMed Hugging Face mirror (most traits). The OpenMed collection mirrors PGC sumstats as sharded Parquet. Each configuration may bundle multiple sub-analyses (per-ancestry, per-sex, chrX-only, UKBB-dedupe). Our loader filters every shard to the canonical primary meta-analysis _source_file before normalization, so variants are never double-counted across sub-cohorts.
  • Direct PGC mirror (traits where OpenMed drops statistical power). OCD, MDD, Bipolar Disorder, and a few others have OpenMed configurations that exclude 23andMe or UKBB partial data. For these we host the full-power distributable from PGC in our blob storage and parse it in its native format (tab-delimited, whitespace-delimited daner, or VCF with PGC metadata lines).

After loading, each sumstats frame is normalized to a canonical schema with columns snp, chrom, pos, a1, a2, beta, se, p, info, freq_effect, freq_other, source_file. When the published file reports odds ratios instead of log-odds β, we derive β = ln(OR). Variants with INFO < 0.8 are dropped as poorly imputed, and duplicates on rsID are resolved by keeping the lowest-p row.

Parsing consumer DNA files

AncestryDNA and 23andMe raw exports are both tab-delimited flat files on build 37 (GRCh37/hg19). We read them line by line and produce a ParsedDNA object that exposes two lookup dictionaries: by_rsid and by_position, so the PRS engine can fall back from rsID to chr:pos when a GWAS reports only one identifier.

Parsing rules, shared across both formats:

  • Comment lines (#) and the column header row are skipped.
  • Allele tokens are validated against {A, C, G, T, I, D}. Non-SNV calls (indel codes I/D) are stored verbatim and naturally ignored later — PGC psychiatric GWAS sumstats use A/C/G/T alleles.
  • Hemizygous single-letter calls on X, Y, or mitochondrial chromosomes (common in 23andMe male samples) are stored as (a, a) so dosage counting works uniformly.
  • No-call tokens (--, 00, -, 0) are counted separately and excluded from the lookup maps.
  • Files with a call rate below 90% are refused with a user-facing error — this threshold is the standard clinical-genotyping cutoff for a usable file.

A typical 23andMe v5 or AncestryDNA file contains roughly 620,000–700,000 called variants. Chip version (v3 / v4 / v5) is inferred from total variant count and reported on the finished report as a diagnostic.

Genome-wide significance and ambiguous variants

We keep only variants with p < 5×10⁻⁸, the community-standard Bonferroni-adjusted threshold for genome-wide significance. Sub-threshold variants carry real signal but are mixed with noise; including them requires calibration (pruning + thresholding, PRS-CS) that depends on a held-out cohort we do not yet have for every trait. A future version of this methodology will add P+T and PRS-CS as selectable methods.

A/T and C/G variants are dropped from the scoring set. These are "strand-ambiguous" SNPs: because the two alleles are their own Watson-Crick complements, there is no way to tell from the nucleotide letters alone whether the user's chip reports the forward or reverse strand. The standard disambiguation relies on comparing allele frequencies to a reference panel, which we will add when 1000 Genomes imputation lands. For now, a conservative skip is the only statistically defensible choice — a silent strand flip at an ambiguous variant inverts the sign of its contribution, and averaging inverted contributions with correct ones is worse than omitting them.

Two-allele strand-flip validation

For each surviving GWAS variant we look up the user's genotype and classify the match into exactly one of three cases:

  • Direct match. Both user alleles lie in {a1, a2}. Dosage is the count of a1 (the effect allele) in the user's pair.
  • Strand-flip match. Both user alleles lie in {complement(a1), complement(a2)}. Dosage is computed on the complement of a1. We count these separately for reporting; a few percent of variants on any chip will match this way and it is expected.
  • Mismatch. The user's genotype exists at this position but its alleles are consistent with neither the direct nor the complement allele set. This indicates a multi-allelic site, a chip/reference-coding mismatch, or a sumstats coding error. The variant is skipped and counted in the diagnostics — the alternative (counting the effect allele in the user's pair without validating the other allele) silently includes junk data.

The two-allele check is what separates a defensible PRS engine from a naïve one. Without it, a variant whose user-side call is a completely different SNP from the one the GWAS reported can still accumulate dosage if the effect letter happens to appear in the user's pair — a bug we found and fixed during our own correctness review.

500 kb physical-window LD clumping

Raw genome-wide significant hits are not independent. A single causal variant creates a peak of dozens-to-hundreds of correlated, sub-threshold-after-adjustment neighbors through linkage disequilibrium (LD). Summing effect sizes across the whole peak double-, triple-, and tenfold-counts the same biological signal and inflates the score by orders of magnitude on traits with clean associations (schizophrenia is the canonical example, with 287 reported loci and often thousands of GWS neighbors).

The correct fix is LD clumping against a matched-ancestry reference panel (typically plink's --clump --clump-r2 0.1 against 1000 Genomes). Until we have imputation and a local LD reference panel, we use a deterministic physical-window proxy:

  1. For each chromosome, walk variants in ascending p-value order.
  2. Keep the first (lowest-p) variant you encounter. Add its base-pair position to a per-chromosome sorted list.
  3. For each subsequent variant, if its position is within 500,000 bp of any already-kept variant on the same chromosome, drop it.
  4. The algorithm is implemented with bisect insertion for O(N log N) total work on a per-chromosome basis.

500 kb was chosen because European LD at r² > 0.1 typically decays within ~500 kb for common variants (pairwise LD at longer distances is dominated by rare haplotypes). In our validation the resulting "independent loci" count matches published per-trait loci counts within about 10% across the PGC psychiatric traits we score.

The two known failure modes are documented on every report: the proxy is coarser than plink's r²-based clumping in MHC regions on chromosome 6 (where LD extends further), and it can drop one of two true independent causal variants in gene-dense regions where two genuine signals sit within 500 kb. Both are rare on psychiatric traits.

Scoring: dosage-weighted sum of effect sizes

The raw polygenic score is the familiar sum:

raw_score = Σᵢ (βᵢ × dosageᵢ)

over the clumped, validated, user-covered variants. βᵢ is the log-odds-ratio from the GWAS; dosageᵢ is 0, 1, or 2, the number of copies of the effect allele in the user's genotype (after strand-flip validation). Variants the chip does not cover contribute zero and do not affect the sum. Strand-flipped matches contribute at the complement effect allele, which is statistically equivalent.

A typical finished report uses 60–400 variants per trait after clumping, and shows the top 15 by absolute contribution — along with the gene they fall in when we have a curated annotation for it.

From raw score to percentile: analytical reference distribution

A raw PRS of +0.347 is not interpretable on its own — it's a sum of log odds ratios whose scale depends entirely on which variants happened to be in the scoring set. To say "75th percentile" we need a reference distribution over the same variants, in a population of people who did not contribute their DNA to us.

Until we can run empirical scoring on 1000 Genomes samples (planned), we compute the reference analytically from Hardy-Weinberg expectations at the same variants the user was scored on:

E[PRS] = Σᵢ βᵢ × 2 × pᵢ
Var[PRS] = Σᵢ βᵢ² × 2 × pᵢ × (1 − pᵢ)

where pᵢ is the effect-allele frequency in the GWAS control cohort (freq_effect in our canonical schema). Under Hardy-Weinberg equilibrium and between-variant independence (which clumping approximates), the PRS is a sum of independent scaled Bernoulli pairs and these formulas are exact. By the central limit theorem, the PRS itself is approximately normal for N ≳ 20 independent loci.

The user's percentile is then Φ((raw − E[PRS]) / √Var[PRS]) × 100, clamped to the [0.01, 99.99] interval so we never claim certainty. We compute the normal CDF in closed form using the error function, avoiding any scipy dependency in the scoring pipeline.

When a trait's source sumstats do not publish control-allele frequencies (historically this affected autism and early substance-use meta-analyses), the engine returns has_reference=False and the report falls back to raw- number framing with a clear note. When fewer than 20 effective loci are available, the percentile is returned but flagged as a weak normal approximation.

What this methodology does not do

  • Non-European-ancestry calibration. Every scored trait's GWAS was conducted predominantly in European-ancestry samples. Effect sizes, frequencies, and LD structure differ across ancestries, so percentiles derived from European frequencies are not directly portable. Every report carries this caveat; users of non-European ancestry should interpret results with extra caution.
  • Rare-variant contributions. Consumer chips genotype ~650K common variants. The rare-variant component of heritability is invisible to this pipeline by construction.
  • Sub-threshold variants. Pruning + thresholding and PRS-CS capture sub-GWS signal that GWS-only scoring misses. These methods require a held-out calibration cohort and are on the roadmap, not yet shipped.
  • Exact LD clumping. The 500 kb physical-window proxy is accurate to within ~10% on per-trait loci counts, but real r²-based clumping against a 1000 Genomes LD panel is the standard.
  • Diagnosis. A polygenic score summarizes common-variant liability only. Clinical psychiatric diagnoses depend on lifetime symptoms, functional impairment, and clinical assessment, not on a number derived from saliva. This is not a medical device. Reports are educational.

Reproducibility and source code

The scoring engine is implemented in Python as a pure function with no I/O and no side effects. Parsers, sumstats loaders, report renderers, and the web layer are plumbing around a single compute_prs(parsed_dna, sumstats, method) → TraitResult call. The engine is unit-tested against synthetic fixtures where every allele is controlled, and against real PGC sumstats where our per-trait independent-loci counts are compared to the published numbers.

Identical inputs always produce identical outputs: the engine is deterministic, contains no random-number generation, and uses a stable mergesort for p-value tie-breaking. When we update a methodology decision (a wider clumping window, a different sumstats source, a new PRS method), we version the change, note it in a changelog, and re-render affected reports only on user re-upload.

See this methodology applied to your own DNA

Upload an AncestryDNA or 23andMe raw data file to get the ADHD and Anxiety preview free. Unlock all 13 traits for $29 (launch pricing).

Upload your file