The Mastermind Indexed Variants File is the essential resource for quickly identifying which genomic variants have been cited in the scientific and medical literature. This information is an essential resource for automating and prioritizing which variants have been reported in peer reviewed evidence.
It is delivered as a VCF (Variant Call Format) or CSV (Comma Separated Values) file to easily integrate into existing analysis and bioinformatics pipelines for sample sequence data. Files to support pipelines built on both GRCh37 and GRCh38 are supplied. For more information about Genomenon’s Indexed Variant File Formats, please see documentation here.
Easy Integration
VCF file format displays results at the genomic coordinate level to easily integrate into your existing pipelines
Over 26 Million Variants
The most comprehensive variant list available, with 3 different citation counts per variant to aid in prioritization
Updated Quarterly
More current than any other resource, with one-click access directly to the relevant literature within Mastermind
What information is in the Indexed Variant File?
The Mastermind Cited Variants Reference contains the number of articles for each variant cited in the medical literature. This information can be used in many ways, and is most commonly used as an evidence filter for clinical actionability in genomic analysis pipelines (based on presence or lack of evidence in the literature) and a quick way to get insight into the literature for variant curation through links into the Mastermind Genomic Intelligence Platform.
Specifically, the Cited Variants Reference includes three unique numbers for each variant, denoted as MMCNT1, MMCNT2, and MMCNT3. These three literature counts range from highly specific (MMCNT1) to highly sensitive (MMCNT3):MMCNT1 (most specific) – cDNA-level exact matches. This is the number of articles that mention the variant at the nucleotide level in either the title/abstract or the full-text.MMCNT2 – cDNA-level possible matches. This is the number of articles with nucleotide-level matches (from 1) plus articles with protein-level matches in which the publication did not specify the cDNA-level change, meaning they could be referring to this nucleotide-level variant but there is insufficient data in these articles to determine conclusively.
MMCNT3 (most sensitive) – This is the number of articles citing any variant resulting in the same biological effect as this variant. This includes the articles from MMCNT1 and MMCNT2 plus articles with alternative cDNA-level variants that result in the same protein effect.
MMURL3 – This is a deep-link into Mastermind for the selected variant, which shows all articles from MMCNT3, in order to investigate and explore the evidence in the literature.
How can I use this data?
A common use-case would be to integrate this information into a genomic analysis pipeline for NGS (next-generation sequencing) data. For example, the variant citation counts can be used to annotate a patient VCF file in order to prioritize those variants with clinical evidence, while the URL can be used to speed up the variant curation process.
To further improve the curation process, you may prioritize variants relative to one another by number of articles, prioritizing those with more citations more highly. Preference may also be given to those variants with more exact cDNA-level citations (MMCNT1).
Is all Mastermind data contained in this reference file?
No. While the file does contain over 26 million variants seen in the medical literature, it doesn’t include everything in Mastermind’s ever-expanding database.
There are some technical limitations to providing variant counts by genomic coordinates standard in the VCF specification, due to the fact that the authors don’t always provide enough information for protein-level changes to translate them into their exact gDNA-level variants.
This is why the Indexed Variant Reference includes three separate levels of specificity for each genomic-level variant. In order to provide both MMCNT2 and MMCNT3 in the file, we must expand each protein-level change, such as amino acid substitutions, into all nucleotide-level changes that could result in that change at the protein level.
For example, an article may cite p.M856V in the SLC4A11 gene with no mention of the gDNA-level change. From the gene’s transcript, we can determine that there are four nucleotide-level changes which can result in this amino acid substitution:
• NC_000020.10:g.3208945T>C
• NC_000020.10:g.3208943_3208945delinsAAC
• NC_000020.10:g.3208943_3208945delinsGAC
• NC_000020.10:g.3208943_3208945delinsTAC
This allows maximum sensitivity for these four nucleotide-level variants in the reference file (which absent any other articles would have counts of MMCNT1=0, MMCNT2=1, and MMCNT3=1).
However, some protein-level changes cannot feasibly be normalized into all possible nucleotide-level changes, as there would be too many possibilities to list. These variants may be queried in the Mastermind user interface or API, but are not contained in the Cited Variants Reference.
To summarize:
Substitutions, intronic and splice-site variants, and UTR variants are in the reference file.
Frameshift variants, Duplications, Deletions, Insertions, Indels, and Inversions may be in the reference file, depending on the complexity of the variation and the level of nomenclatures used within the literature. For these, we recommend querying the Mastermind API for maximum sensitivity.
PLEASE NOTE: If you embed the IVR data in user-facing applications, we strongly recommend that you clearly inform users of these limitations. Mastermind has found variants in the published literature which are not in the IVR (refer to summary above). In these cases, the Mastermind API can be queried directly to find relevant literature.
For more information contact Genomenon at support@genomenon.com
*Updated July 10, 2024
Talk to a Genomenon expert.