This page documents the meaning of each fileds (columns) used in the variant databases available in PubMind.
For the column that has |
in it, it means the aggregated value for each record of a variant. The order of different columns should match. For example, the order in column formatted_reference
should match with LLM_reasoning
and pathogenicity
.
PVID
: Unique PubMind Variant ID.gene
: Name of the gene extracted by LLM.formatted_reference
: Citation(s) or source(s) of the variant.MONDO_name_09
: Normalized disease name assigned based on 90% similarity with MONDO human disease database.MONDO_ID_09
: Corresponding MONDO ID for MONDO_name_09
.LLM_reasoning
: Reasoning provided by the LLM for pathogenicity classification.pathogenicity
: LLM-assigned classification (i.e., pathogenic, likely pathogenic, benign, likely benign, unknown).disease
: Disease name extracted from literature paragraph by LLM before normalization (for SNV).related_disease
: Disease name extracted from literature paragraph by LLM before normalization (for complex variants).Num_of_record_used
: Number of individual records (paragraphs) supporting/mentioning this variant.Num_of_paper_used
: Number of individual papers supporting/mentioning this variant.pathogenicity_sum
: Final pathogenicity assignment across all records.pathogenicity_score
: Quantitative score summarizing the pathogenicity across all records (range: 0–1).confidence
: Confidence level for a variant entry based on how many evidences are collected (range: 0-3).confidence_criteria
: Reasoning or rules behind the confidence level.RSID
: Reference SNP ID from dbSNP.dna_change
: cDNA-level mutation (e.g., 123A>T).aa_change
: Protein-level mutation (e.g., Glu41Lys).genomic_coord_result
: Genomic coordinate(s) inferred from Ensembl transcript mapping. One variant could correspond to multiple transctips.parsed_variants
: Parsed structured representation of the variant based on genomic_coord_result.phenotype
: Phenotype terms extracted from literature paragraph by LLM before normalization.HPO_term_09
: Normalized phenotype term assigned based on 90% similarity with HPO database.HPO_ID_09
: Corresponding HPO phenotype ID for HPO_term_09
.MONDO_name_counted
: Aggregated number of MONDO disease names mentioned across records.HPO_term_counted
: Aggregated number of HPO terms mentioned across records.MONDO_ID_counted
: Aggregated number of MONDO disease IDs mentioned across records.HPO_ID_counted
: Aggregated number of HPO IDs mentioned across records.PMCID_PMID_counted
: Aggregated number of references (PMCID/PMID) for the variant.gene_fusion
: Formatted fusion (e.g. EML4::ALK).first_gene
: First gene of the fusion assigned by LLM, normally this is the driver gene in the fusion.partner_gene
: Second gene of the fusion assigned by LLM, normally this is the partner gene.protein_domains_affected
: Functional domains impacted, normally for first_gene
.type
: Structural variant type (deletion, duplication, inversion, etc.).chr_start_end
: Genomic coordinates.gene_affected
: Genes overlapped/affected by the SV.If you have questions or suggestions, please visit the GitHub repo.