The University of Melbourne
Browse

Functional-And-Evolutionary Multi-trait Importance (FAEMI) score for 16 million sequence variants

Download (536.16 MB)
dataset
posted on 2025-10-30, 12:57 authored by RUIDONG XIANGRUIDONG XIANG
<p dir="ltr"><b>Summary:</b></p><p dir="ltr">This webpage contains data of the <b>Functional-And-Evolutionary Multi-trait Importance (FAEMI) score</b> on 16,035,444 sequence variants, which is reported in the manuscript "<b>Integrating extensive functional annotations and multiomics </b><b>of cattle enhances climate-resilience </b><b>prediction and mapping"</b>, <i>In Press</i> in <i>PNAS</i>. FAEMI score combines functional annotations to predict the probability that a variable genomic site causes variation in 16 complex traits of 103K cattle. The details of this score and its potential usage in the future are listed below.</p><p dir="ltr">The FAEMI score file faemidata_NOV2025.gz is a gz file that should be opened in Linux. For example, you can view the first few lines by zcat faemidata_NOV2025.gz|head on a Linux machine. You can also open it using R, such as data.table(). For example, dt <- data.table::fread("faemidata_NOV2025.gz").</p><p dir="ltr"><b>The last 4 columns are the FAEMI score</b> (ranking) estimated on 16 million sequence variants. The <b>23rd column</b> labelled as "<b>faemisc</b>" is the FAEMI score used in the manuscript. The <b>24th column "faemisc_noSNPDens"</b> is the older version of FAEMI score, which was not corrected for SNP density, i.e., the number of nearby SNPs for each SNP (column "SNPDensity"). <b>The 25th column "faemisc_RF"</b> was the FAEMI score trained using Random Forrest. For completeness, we published these two additional scores in case they are useful in future studies.</p><p dir="ltr"><b>The 26th column "faemisc_class" categorises</b> the <b>23rd column</b> labelled as "<b>faemisc</b>" into 3 discrete classes of rankings: "FAEMI_high" (highly informative), "FAEMI_med" (modest informative) and "FAEMI_low" (lowly informative). This ranking is used in the manuscript and can be used as prior for genomic prediction or functional annotation in your own studies.</p><p dir="ltr"><b>Usage</b>:</p><p dir="ltr"><b>Biological Prior for genomic prediction</b>. You can use the FAEMI score as an SNP prior for genomic prediction. Use the <b>23rd column</b> labelled as "<b>faemisc</b>" as a <b>quantitative</b> prior or use <b>t</b><b>he 26th column "faemisc_class" </b>as a <b>categorical</b> prior.</p><p dir="ltr"><b>SNP annotation</b>. You can use <b>t</b><b>he 26th column "faemisc_class" </b>to annotate your SNPs [e.g., "FAEMI_high" (highly informative), "FAEMI_med" (modest informative) and "FAEMI_low" (lowly informative)]. You can also use the collated functional categories from the annotation file (1-22 columns, see below and manuscript) to annotate your GWAS results or SNP sets. For example, you can find out if your GWAS hits are eQTL, sQTL, conserved across species or differentiated between Bos taurus and Bos indicus.</p><p dir="ltr"><b>Train your own scoring using FAEMI data</b>. You may also use the functional categories to re-train your own FAEMI score - requires you to have jointly estimated SNP effects on phenotypes of your own interest (you should use BayesR or GBLUP; GWAS estimates are not reliable as they test SNPs one at a time). If you have jointly estimated SNP effects, e.g., Posterior Inclusion Probability (PIP) of each SNP, you can regress the functional categories on your PIPs and then quantify the contribution of annotations to your phenotypes and also predict FAEMI score using your own data.</p><p dir="ltr">For example, if you have a subset (say 5 million) of SNPs out of those 16 million SNPs that have jointly estimated PIP. You can join that data (5 million SNPs with PIP) with the 1st to 22nd columns of the original FAEMI score file (faemidata_NOV2025.gz, until the 1st column to column "SNPDensity").</p><p dir="ltr">Suppose now you have a new file of 5 million SNPs with 23 columns (the first 22 columns are the same as the original FAEMI score file) and the 23rd column is your PIP values. You can then regress your own PIPs on the 4th to the 22nd column (excluding the first 3 columns of SNP, chr and bp) by a linear model (e.g., lm() in R):</p><p dir="ltr">PIP = intercept + VEPc + ChIPseq + CAGE + HiC + AnmQTLdb + eQTL + sQTL + aseQTL + asbQTL + hQTL + ausLCMS_mQTL + nzNMR_mQTL + Conserved + BiBtDiff + youngSNP + TopCADD + MAF + LDscore + SNPDensity + e</p><p dir="ltr">You can also try non-linear models (e.g., Random Forest as tested in the manuscript) to train. You can then obtain coefficients of each functional category and use them to predict the FAEMI score on the remaining SNPs. In our study, when we train the model, we incorporated MAF, LD score and SNP Density. However, when predicting, we have dropped the coefficients of MAF, LD score and SNP Density. Because we think the predicted score should not be dominated by MAF, LD and SNP density. We recommend you do the same when you train your own FAEMI score.</p><p dir="ltr"><b>Other columns of the file:</b></p><p dir="ltr">SNP: SNP ID (chromosome:position based on ARS-UCD1.2 genome).</p><p dir="ltr">Chr: chromosome.</p><p dir="ltr">bp: position.</p><p dir="ltr">VEPc: merged Variant Effect Predictor annotation.</p><p dir="ltr">ChIPseq: ChIP-seq annotation using 3 FAANG studies.</p><p dir="ltr">CAGE: Variants under CAGE annotation (enhancers, super enhancers etc).</p><p dir="ltr">HiC: Variants under HiC annotations.</p><p dir="ltr">AnmQTLdb: Variants annotated as cattle QTL from Animal QTL database.</p><p dir="ltr">eQTL: meta-analysis 16-tissue expression quantitative trait loci.</p><p dir="ltr">sQTL: meta-analysis 16-tissue splicing quantitative trait loci.</p><p dir="ltr">aseQTL: allele-specific expression QTL.</p><p dir="ltr">asbQTL: allele-specific binding QTL.</p><p dir="ltr">hQTL: ChIP-seq height QTL.</p><p dir="ltr">ausLCMS_mQTL: Australian LCMS assayed metabolomic QTL.</p><p dir="ltr">nzNMR_mQTL: New Zealand NMR assayed metabolomic QTL.</p><p dir="ltr">Conserved: conserved sites across different species.</p><p dir="ltr">BiBtDiff: differentially selected sites between Bos taudus and Bos indicus.</p><p dir="ltr">youngSNP: variants are relatively young.</p><p dir="ltr">TopCADD: human CADD annotation lifted over to the cattle genome.</p><p dir="ltr">MAF: minor allele frequency.</p><p dir="ltr">LDscore: LD-score estimate for each variant.</p><p dir="ltr">SNPDensity: the number of nearby SNPs for each SNP.</p>

Funding

Australian Research Council’s Discovery Projects (DP160101056, DP200100499 and DP230101352);DairyBio, a joint venture project of Agriculture Victoria (Melbourne, Australia), Dairy Australia (Melbourne, Australia), and the Gardiner Foundation (Melbourne, Australia), funded computing resources used in the analysis.

History