G-L1264 Live
Interactive visualization of genetic density.
Interactive visualization of genetic density.
This tool visualizes the geographic footprint of the G-L1264 Y-DNA subclade using both STR (Short Tandem Repeat) and SNP (Single Nucleotide Polymorphism) phylogenetic data.
It explores a deceptively simple question:
Where has the target subclade been present the longest?
A region where a subclade is common today does not necessarily indicate long-term presence — observed frequency may reflect a recent population expansion or a founder effect. Conversely, a region with fewer but highly diverse carriers may be consistent with deeper historical roots. The map combines both signals to identify areas where long-term, stable presence is most plausible.
"The difference in repeat score between alleles carries information about the amount of time that has passed since they shared a common ancestral allele. [...] We show analytically that the expectation of this distance is a linear function of time."
— Goldstein et al., 1995
Geographic density of the target subclade. Sample counts per geographic cluster are normalized via KDE (Kernel Density Estimation) to [0, 1].
Instead of counting mutations, this layer computes the phylogenetic richness based on the hierarchical YFull/FTDNA paths of modern samples. For a local cluster, the diversity index quantifies the Uniqueness ($U$) of terminal sub-branches. If $\Delta p_{ij}$ is the length of the non-shared path between samples $i$ and $j$, the index isolates regions of deep branching versus regions dominated by a single recent founder lineage.
The diversity score is based on pairwise mu-normalized ASD (Goldstein et al., 1995; Slatkin, 1995).
For a pair of haplotypes $(i, j)$ sharing $L$ comparable loci:
$$\text{ASD}_\mu(i, j) = \frac{1}{L} \sum_{l=1}^{L} \left( x_{il} - x_{jl} \right)^2 \cdot \frac{\mu_{\text{median}}}{\mu_l}$$
Inverse weighting by $\mu_{\text{median}} / \mu_l$ amplifies slow-mutating markers (more informative about deep divergence) and dampens fast-mutating ones.
For each geographic cluster, the $K$ nearest unique locations are found via KDTree. Pairwise ASDs within the KNN neighborhood are weighted by geographic proximity:
$$\text{ASD}_{\text{w}}(i, j) = \text{ASD}_\mu(i, j) \cdot e^{-d_{ij} / \lambda}$$
where $d_{ij}$ is the geographic distance (km), and $\lambda$ is the characteristic decay length. Small clusters receive a Bayesian reliability penalty (shrinkage) based on total raw sample count $n_{\text{raw}}$.
Candidate regions of long-term continuous presence, reflecting the diversity-peak hypothesis (Ramachandran et al., 2005):
$$W_{\text{comb}} = \hat{F} \cdot \hat{D} \cdot \sqrt{n_{\text{div}}}$$
Nearest-neighbor distance analysis (Clark & Evans, 1954) is used to infer all spatial parameters from the input dataset's coordinate distribution. For each unique sample location, the Euclidean distance to its closest neighbor is computed in km to form the $\text{NND}$ array.
$$r_{\text{cluster}} = P_{78}(\text{NND}) \times 2$$
The 78th percentile captures the typical gap between distinct sampling regions, excluding outliers.
$$\lambda = \text{IQR}(\text{NND}) \times 2$$
The characteristic length scale for the exponential decay $e^{-d/\lambda}$. The Interquartile Range (IQR) captures the spread of inter-sample spacings without sensitivity to extremes.
$$K = \lfloor N^{0.6} \rceil$$
Sub-linear scaling: exponent 0.6 balances between $\sqrt{N}$ and a fixed fraction of $N$.
The ancient DNA (aDNA) layer consists of a visual temporal anchor system. Ancient and modern layers are rendered independently and do not influence each other's KDE calculations.
Integrating aDNA into KDE-based diversity or frequency calculations would introduce systematic bias:
Each ancient sample is displayed with its sample_age_str (e.g., "1200 BC"), YBP, and culture. Samples marked as Genetic Outliers (e.g., culturally assimilated individuals structurally diverging from local ancestry) are framed with a distinct high-contrast red border to highlight potential migrations.
Unlike flat file datasets, this project operates on a live Supabase (PostgreSQL) database, feeding directly into the Deck.gl WebGL engine.
dna_samples)The primary table integrates both modern and ancient DNA. Key columns include:
latitude / longitude — WGS84 coordinates.str_values — JSONB structure holding values for 111 FTDNA Y-STR markers.haplogroup_paths — JSONB hierarchical SNP path (e.g. ["G-M201", "G-L1264", "G-FGC21495"]).study_link, ybp, is_outlier — Contextual fields for paleogenomics.Instead of manually curating terminal SNPs, samples are continuously enriched via an automated Edge Function pipeline connected to the Valalav phylogeny API. This reconstructs the full ancestral path for every confirmed SNP, enabling the hierarchical SNP Lineage Diversity calculations.
Per-marker mutation rates (mutations per generation) for each of the 111 Y-STR markers are utilized during diversity weighting. Assumed sources: