G-L1264 Live

Interactive visualization of genetic density.

All

Modern

Ancient

Density

Diversity

Mix

SNP Diversity

Hide

Show All

Loading data...

Total samples: 0

Max cluster: 0

Methodology & Data

What this map shows

This tool visualizes the geographic footprint of the G-L1264 Y-DNA subclade using both STR (Short Tandem Repeat) and SNP (Single Nucleotide Polymorphism) phylogenetic data.

It explores a deceptively simple question:
Where has the target subclade been present the longest?

A region where a subclade is common today does not necessarily indicate long-term presence — observed frequency may reflect a recent population expansion or a founder effect. Conversely, a region with fewer but highly diverse carriers may be consistent with deeper historical roots. The map combines both signals to identify areas where long-term, stable presence is most plausible.

"The difference in repeat score between alleles carries information about the amount of time that has passed since they shared a common ancestral allele. [...] We show analytically that the expectation of this distance is a linear function of time."
— Goldstein et al., 1995

1. Haplotype Frequency (Density)

Geographic density of the target subclade. Sample counts per geographic cluster are normalized via KDE (Kernel Density Estimation) to [0, 1].

2. SNP Lineage Diversity (Phylogenetic Richness)

Instead of counting mutations, this layer computes the phylogenetic richness based on the hierarchical YFull/FTDNA paths of modern samples. For a local cluster, the diversity index quantifies the Uniqueness ($U$) of terminal sub-branches. If $\Delta p_{ij}$ is the length of the non-shared path between samples $i$ and $j$, the index isolates regions of deep branching versus regions dominated by a single recent founder lineage.

3. Haplotype Diversity (STR)

The diversity score is based on pairwise mu-normalized ASD (Goldstein et al., 1995; Slatkin, 1995).

For a pair of haplotypes $(i, j)$ sharing $L$ comparable loci:

$$\text{ASD}_\mu(i, j) = \frac{1}{L} \sum_{l=1}^{L} \left( x_{il} - x_{jl} \right)^2 \cdot \frac{\mu_{\text{median}}}{\mu_l}$$

$x_{il}$ — allele value of sample $i$ at locus $l$
$\mu_l$ — per-locus mutation rate

Inverse weighting by $\mu_{\text{median}} / \mu_l$ amplifies slow-mutating markers (more informative about deep divergence) and dampens fast-mutating ones.

Spatial Sampling Pipeline

For each geographic cluster, the $K$ nearest unique locations are found via KDTree. Pairwise ASDs within the KNN neighborhood are weighted by geographic proximity:

$$\text{ASD}_{\text{w}}(i, j) = \text{ASD}_\mu(i, j) \cdot e^{-d_{ij} / \lambda}$$

where $d_{ij}$ is the geographic distance (km), and $\lambda$ is the characteristic decay length. Small clusters receive a Bayesian reliability penalty (shrinkage) based on total raw sample count $n_{\text{raw}}$.

4. Frequency x Diversity (Combined)

Candidate regions of long-term continuous presence, reflecting the diversity-peak hypothesis (Ramachandran et al., 2005):

$$W_{\text{comb}} = \hat{F} \cdot \hat{D} \cdot \sqrt{n_{\text{div}}}$$

Adaptive Spatial Calibration

Nearest-neighbor distance analysis (Clark & Evans, 1954) is used to infer all spatial parameters from the input dataset's coordinate distribution. For each unique sample location, the Euclidean distance to its closest neighbor is computed in km to form the $\text{NND}$ array.

Clustering Radius ($r_{\text{cluster}}$)

$$r_{\text{cluster}} = P_{78}(\text{NND}) \times 2$$

The 78th percentile captures the typical gap between distinct sampling regions, excluding outliers.

Distance-Decay Length ($\lambda$)

$$\lambda = \text{IQR}(\text{NND}) \times 2$$

The characteristic length scale for the exponential decay $e^{-d/\lambda}$. The Interquartile Range (IQR) captures the spread of inter-sample spacings without sensitivity to extremes.

KNN Neighborhood Size ($K$)

$$K = \lfloor N^{0.6} \rceil$$

Sub-linear scaling: exponent 0.6 balances between $\sqrt{N}$ and a fixed fraction of $N$.

Ancient DNA Integration

The ancient DNA (aDNA) layer consists of a visual temporal anchor system. Ancient and modern layers are rendered independently and do not influence each other's KDE calculations.

Why ancient samples are excluded from KDE

Integrating aDNA into KDE-based diversity or frequency calculations would introduce systematic bias:

Geographic bias: Vast areas have never been sampled for aDNA — absence on the map does not mean absence in history.
Lineage bias: Whether a subclade appears in the archaeological record depends on which sites preserved DNA and were published.
Temporal bias: Older periods are dramatically underrepresented.

Metadata and Outliers

Each ancient sample is displayed with its sample_age_str (e.g., "1200 BC"), YBP, and culture. Samples marked as Genetic Outliers (e.g., culturally assimilated individuals structurally diverging from local ancestry) are framed with a distinct high-contrast red border to highlight potential migrations.

Database Architecture & Pipeline

Unlike flat file datasets, this project operates on a live Supabase (PostgreSQL) database, feeding directly into the Deck.gl WebGL engine.

Sample Records (`dna_samples`)

The primary table integrates both modern and ancient DNA. Key columns include:

latitude / longitude — WGS84 coordinates.
str_values — JSONB structure holding values for 111 FTDNA Y-STR markers.
haplogroup_paths — JSONB hierarchical SNP path (e.g. ["G-M201", "G-L1264", "G-FGC21495"]).
study_link, ybp, is_outlier — Contextual fields for paleogenomics.

Automated SNP Enrichment (Valalav API)

Instead of manually curating terminal SNPs, samples are continuously enriched via an automated Edge Function pipeline connected to the Valalav phylogeny API. This reconstructs the full ancestral path for every confirmed SNP, enabling the hierarchical SNP Lineage Diversity calculations.

Mutation Rates ($\mu$)

Per-marker mutation rates (mutations per generation) for each of the 111 Y-STR markers are utilized during diversity weighting. Assumed sources:

96 of 111 markers use rates from Marko Heinilä (ISOGG dataset), providing standard robust coverage.
Missing markers are estimated using median-ratio calibration across alternative sources (Chandler, Little, SMGF).