SOP: finding protein and gene data online

For any protein/gene page, pull from these primary sources. Cross-reference UniProt + NCBI Gene at minimum.

Primary databases

UniProt — `https://www.uniprot.org/`

Best for: Curated protein-level information — sequence, function, domains, post-translational modifications, disease associations, all with PubMed-cited evidence.

What to extract:

UniProt accession (e.g., P04637 for human p53)
Recommended name + alternative names
Functional summary (the “Function” section)
Subcellular location
Domain structure
Key PTMs (phosphorylation sites, etc.)
Disease/phenotype associations

Reviewed vs unreviewed: Use Swiss-Prot (manually curated) entries — accession starts with a letter and is short. TrEMBL entries (auto-generated) are less reliable.

Citation: UniProt P04637 (TP53_HUMAN), accessed YYYY-MM-DD

NCBI Gene — `https://www.ncbi.nlm.nih.gov/gene/`

Best for: Gene-level info — genomic location, exon structure, RNA isoforms, expression data (linked to GTEx), orthologs across species.

What to extract:

NCBI Gene ID (e.g., 7157 for human TP53)
HGNC symbol (TP53)
Official full name
Genomic location and size
Orthologs (especially mouse, useful for cross-species claims)
RefSeq transcripts

Citation: NCBI Gene ID 7157, accessed YYYY-MM-DD

HGNC — `https://www.genenames.org/`

Best for: Authoritative human gene nomenclature. Use HGNC symbols in filenames and frontmatter to avoid synonym ambiguity.

Interaction databases

STRING — `https://string-db.org/`

Best for: Protein-protein interaction network with confidence scoring (text-mining, experimental, database, co-expression evidence channels).

What to extract:

Top 10–20 interactors at confidence ≥0.7
Distinguish “experimental” from “predicted” (text-mining-only) edges

Caveat: STRING’s text-mining channel is noisy. For mechanistic claims, require either “experimental” or “database” evidence.

BioGRID — `https://thebiogrid.org/`

Best for: Manually curated physical and genetic interactions, with the original publication for each.

What to extract: Direct interactors with PubMed IDs for each interaction → ingest the most-cited interaction studies into studies/.

IntAct — `https://www.ebi.ac.uk/intact/`

Best for: Detailed interaction evidence with experimental method (Y2H, AP-MS, IP, etc.). More structured than STRING for evaluating interaction quality.

Function and ontology

AmiGO / Gene Ontology — `https://amigo.geneontology.org/`

Best for: Standardized functional annotations (molecular function, biological process, cellular component). Use for “what does this protein do” at a controlled-vocabulary level.

InterPro — `https://www.ebi.ac.uk/interpro/`

Best for: Protein domains and motifs.

CAZy — `https://www.cazy.org/`

Best for: Carbohydrate-active enzyme (CAZyme) families and modules, especially when evaluating glycoside hydrolases (GHs), glycosyltransferases (GTs), polysaccharide lyases (PLs), carbohydrate esterases (CEs), auxiliary activities (AAs), and carbohydrate-binding modules (CBMs). Use for enzyme-family scouting when a hypothesis involves carbohydrate or glycoconjugate recognition/catalysis, such as engineered glucosepane-cleavage scaffolds.

What to extract:

CAZy family or module class (e.g., GH, GT, PL, CE, AA, CBM)
Family number and clan where available
Known activities and EC numbers
Mechanism annotation if listed
Organism distribution and candidate microbial/fungal homologs
Cross-references to UniProt, GenBank, PDB, EC, and PubMed

Caveat: CAZy family membership supports fold/function hypothesis generation, not proof that a specific enzyme cleaves a proposed substrate. For noncanonical substrates such as glucosepane, require direct biochemical validation on authentic substrate; do not infer activity from GH/CBM family membership alone.

Citation: CAZy family GHxx / CBMxx, accessed YYYY-MM-DD

Workflow for a new protein page

UniProt → accession, function summary, domains, PTMs, disease associations.
NCBI Gene → gene ID, HGNC symbol, mouse/rat orthologs.
InterPro / GO → domains and controlled-vocabulary function.
CAZy → enzyme-family/module annotations when the protein is a CAZyme or candidate carbohydrate/glycoconjugate-active enzyme.
STRING (confidence ≥0.7, experimental evidence channel) → top interactors.
List pathways the protein participates in → wikilink to pathway pages (run sops/finding-pathway-data.md if pathway pages don’t exist yet).
Map to relevant Hallmarks of Aging.
Use archive search to find the 3–5 most-cited reviews of the protein’s role in aging → create study pages.

Frontmatter example

---
type: protein
aliases: [TP53, tumor protein p53, "guardian of the genome"]
uniprot: P04637
ncbi-gene: 7157
hgnc: 11998
mouse-ortholog: Trp53
ensembl: ENSG00000141510
key-domains: [DNA-binding, tetramerization, transactivation]
key-ptms: [Ser15-phos, Ser20-phos, Lys382-acetyl]
pathways: ["[[p53-pathway]]", "[[dna-damage-response]]", "[[apoptosis-pathway]]"]
hallmarks: ["[[genomic-instability]]", "[[cellular-senescence]]"]
known-interactors: ["[[mdm2]]", "[[cbp-p300]]", "[[atm]]"]
---

What NOT to trust

Wikipedia protein boxes (good starting point only).
“Predicted” annotations not backed by experimental evidence.
STRING text-mining-only edges for mechanistic claims.
Old NCBI Gene records — check the “Update date” at the bottom.

🧬 Aging Wiki

Explorer

finding-protein-data

SOP: finding protein and gene data online

Primary databases

UniProt — `https://www.uniprot.org/`

NCBI Gene — `https://www.ncbi.nlm.nih.gov/gene/`

HGNC — `https://www.genenames.org/`

Interaction databases

STRING — `https://string-db.org/`

BioGRID — `https://thebiogrid.org/`

IntAct — `https://www.ebi.ac.uk/intact/`

Function and ontology

AmiGO / Gene Ontology — `https://amigo.geneontology.org/`

InterPro — `https://www.ebi.ac.uk/interpro/`

CAZy — `https://www.cazy.org/`

Workflow for a new protein page

Frontmatter example

What NOT to trust

See also

Graph View

Table of Contents

Backlinks

🧬 Aging Wiki

Explorer

finding-protein-data

SOP: finding protein and gene data online

Primary databases

UniProt — https://www.uniprot.org/

NCBI Gene — https://www.ncbi.nlm.nih.gov/gene/

HGNC — https://www.genenames.org/

Interaction databases

STRING — https://string-db.org/

BioGRID — https://thebiogrid.org/

IntAct — https://www.ebi.ac.uk/intact/

Function and ontology

AmiGO / Gene Ontology — https://amigo.geneontology.org/

InterPro — https://www.ebi.ac.uk/interpro/

CAZy — https://www.cazy.org/

Workflow for a new protein page

Frontmatter example

What NOT to trust

See also

Graph View

Table of Contents

Backlinks

UniProt — `https://www.uniprot.org/`

NCBI Gene — `https://www.ncbi.nlm.nih.gov/gene/`

HGNC — `https://www.genenames.org/`

STRING — `https://string-db.org/`

BioGRID — `https://thebiogrid.org/`

IntAct — `https://www.ebi.ac.uk/intact/`

AmiGO / Gene Ontology — `https://amigo.geneontology.org/`

InterPro — `https://www.ebi.ac.uk/interpro/`

CAZy — `https://www.cazy.org/`