About — AI Co-Scientist

Biomedical knowledge is scattered across dozens of siloed sources: literature indexes, curated knowledge bases, ontologies, trial registries, molecular databases, and dataset repositories. Each of these come with its own schema, terminology, and interface. Answering a single research question - “Is this gene a viable drug target?” - often requires moving across these systems by hand, normalizing identifiers, comparing conflicting evidence, and tracing claims back to the original source.

AI Co-Scientist brings 50+ biomedical information resources into one conversational workflow. Ask a question in plain language and the agent plans a multi-step investigation, queries the most relevant sources, cross-checks the results, and synthesizes the findings into a cited research report with provenance.

Data sources

Organized from the broadest research questions to the most specialized analyses.

Literature & Clinical Evidence

Starting point for any biomedical question.

PubMed: Over 38 million biomedical abstracts and citations from MEDLINE, life science journals, and online books. The primary literature search engine for biomedicine.
OpenAlex: Broad scholarly knowledge graph covering 250M+ works including preprints, conference papers, and datasets. Useful for researcher discovery and citation analysis.
Europe PMC: European life-science literature platform with PubMed records, full text links, preprints, citation counts, grants, and text-mined metadata. Useful when PubMed alone is too narrow.
ClinicalTrials.gov: Registry of 500K+ clinical studies worldwide. Search active, recruiting, and completed trials by condition, intervention, or sponsor.

Target-Disease Associations

Connecting genes and proteins to diseases with aggregated evidence.

Open Targets Platform: Integrates genetics, genomics, transcriptomics, drugs, and literature into scored target-disease associations. The go-to resource for target validation.
GWAS Catalog: Curated collection of published genome-wide association studies. Returns specific variants, p-values, odds ratios, and mapped genes for any trait or disease.
CIViC: Community-curated clinical interpretations of cancer variants. Expert-reviewed evidence linking specific mutations to diagnosis, prognosis, and treatment response.
ClinGen: Curated gene-disease validity and dosage sensitivity resource. Adds expert-panel classifications and dosage evidence beyond variant-level pathogenicity databases.

Drug Discovery & Safety

The therapeutic landscape — from compounds to post-marketing surveillance.

PubChem: Open chemistry database with 116M+ compounds. Molecular properties, SMILES structures, InChIKeys, drug-likeness descriptors (XLogP, H-bond donors/acceptors, polar surface area), synonyms, and compound descriptions.
ChEMBL: Bioactivity database with 2M+ compounds, binding affinities, functional assays, and ADMET properties. Essential for understanding the pharmacological landscape around a target.
DGIdb: Drug-Gene Interaction Database aggregating 40+ sources. Returns druggability categories (kinase, clinically actionable, etc.) and known drug interactions for any gene.
Guide to Pharmacology: Curated target-ligand pharmacology from IUPHAR/BPS. Useful for mechanism-of-action summaries, action types, affinity evidence, and a cleaner curated pharmacology layer than general chemistry databases alone.
GDSC / CancerRxGene: Genomics of Drug Sensitivity in Cancer pharmacogenomics screens across hundreds of cancer cell lines. Useful for compound sensitivity patterns, tissue-specific response, and in vitro IC50/AUC context.
PRISM Repurposing: Broad Institute pooled-cell-line repurposing screen with single-dose log2-fold-change viability readouts across large cancer cell-line panels. Useful for broad viability patterns and fast repurposing-style response scans.
PharmacoDB: Cross-dataset pharmacogenomics portal harmonizing compound-response data from GDSC, PRISM, CTRPv2, and related public screens. Useful when response needs to be compared across multiple public drug-response resources in one layer.
FDA FAERS: Post-marketing adverse event reports from the FDA. Analyze safety signals, compare adverse event profiles across drugs, and identify drug class effects.
RxNorm: Standardized drug nomenclature from the NLM. Maps between brand names, generics, ingredients, and clinical drug forms.
DailyMed: Current FDA Structured Product Labels (SPLs) for US drugs. Useful for boxed warnings, indications, contraindications, and warnings/precautions straight from the label.

Gene & Protein Biology

Understanding molecular function, expression, and interactions.

UniProt: Comprehensive protein knowledgebase with sequences, domains, post-translational modifications, subcellular localization, and functional annotations for every known protein.
GTEx: Tissue-level gene expression from the Genotype-Tissue Expression project. Median TPM values across 54 human tissues from 948 donors — critical for target safety assessment.
Human Protein Atlas: Protein-level tissue specificity, single-cell specificity, protein class, and subcellular localization summaries for human genes. Useful for target validation beyond RNA-only evidence.
Reactome: Curated biological pathways and reactions. Understand the signaling cascades, metabolic pathways, and cellular processes a gene participates in.
STRING: Protein-protein interaction networks combining experimental data, text mining, and computational predictions. Identify interaction partners and functional modules.
IntAct: Curated experimental molecular interactions from EMBL-EBI. Useful when you want publication-backed interaction records and detection methods rather than integrated network predictions.
BioGRID: Large experimental interaction archive covering both physical and genetic interactions. Useful when you want broader publication-backed interaction coverage, throughput context, and partner evidence beyond a narrower curated interaction subset.
Pathway Commons: Integrated pathway and interaction resource aggregating multiple providers into a single queryable graph. Useful for widening pathway context beyond any one source.

Protein Structure

3D structural insights — predicted and experimental.

AlphaFold: AI-predicted protein structures from DeepMind covering 200M+ proteins. Returns pLDDT confidence scores and downloadable PDB/CIF structure files.
RCSB PDB: Experimentally determined structures from X-ray crystallography, cryo-EM, and NMR. Search by gene or UniProt ID to find resolved structures with bound ligands.

Genomic Variation, Phenotypes & Ontologies

Population genetics, phenotype normalization, and rare-disease grounding.

gnomAD: Population variant frequencies from 76K+ genomes and 125K+ exomes. Essential for distinguishing rare pathogenic variants from common benign polymorphisms.
1000 Genomes: Reference catalog of human genetic variation across 26 populations. Foundation for understanding population-level diversity and ancestry-specific variants.
ClinVar: Clinical significance classifications for human variants — pathogenic, benign, uncertain significance. Links variants to conditions with submitter-level evidence.
Ensembl VEP: Variant Effect Predictor returning functional consequence types, SIFT and PolyPhen deleteriousness scores, and AlphaMissense pathogenicity predictions.
MyVariant.info: Aggregated variant annotations pulling from ClinVar, CADD, dbSNP, gnomAD, and COSMIC in a single lookup. Quick comprehensive view of any variant.
MyGene.info: Fast gene identifier normalization service for symbols, aliases, Entrez IDs, Ensembl IDs, and UniProt IDs. Useful for joining evidence across heterogeneous APIs.
RefSeq: NCBI curated reference sequences for transcripts, non-coding RNAs, chromosomes, and proteins (NM/NR/NC/NG/NP accessions). Tools search nuccore and protein indices with refseq[filter] and return accession-level metadata plus links.
UCSC Genome Browser: Interactive reference genome assemblies (hg38, hg19, mouse, and more) with the public REST API for search, interval sequence, and track rows. Tools resolve gene symbols to coordinates, fetch DNA for a locus, and query named tracks within bounded windows.
ENCODE: Encyclopedia of DNA Elements: functional genomics metadata and files (ChIP-seq, DNase-seq, RNA-seq, and more). Tools query the ENCODE REST API for experiments and related objects, then fetch accession-level JSON with portal links.
OxO: Ontology cross-reference service from EMBL-EBI. Bridges MONDO, EFO, DOID, MeSH, OMIM, UMLS, and related identifier systems for safer cross-database joins.
QuickGO: Gene Ontology term search and annotation service. Supports GO term discovery plus reviewed GO annotations for gene products with evidence codes and references.
Human Phenotype Ontology (HPO): Standard phenotype vocabulary for rare disease and clinical genomics. Useful for normalizing phenotype terms like ataxia, microcephaly, and seizures before cross-source joins.
Orphanet / ORDO: Reference rare-disease catalog and ontology with disease definitions, xrefs, inheritance, age of onset, phenotype annotations, and curated disease-gene links.
Monarch Initiative: Phenotype-centric knowledge graph spanning genes, diseases, phenotypes, and model-organism evidence. Useful for phenotype-to-gene and disease-to-phenotype association queries.
Alliance Genome Resources: Integrated cross-species knowledge platform spanning human and model-organism genomes. Useful for orthologs, disease and phenotype summaries, and model-organism disease evidence that complements human-only sources.

Specialized Domains

Advanced and niche applications.

Allen Brain Atlas: Reference neuroanatomy and gene expression atlases, including region-level structure ontology and in situ hybridization expression profiles for mouse brain. Supports differential expression and structure-focused queries.
EBRAINS Knowledge Graph: Curated neuroscience knowledge graph spanning datasets, models, software, workflows, and contributors. Useful for discovering reusable brain research assets with rich metadata and provenance.
Zenodo: General-purpose open repository for datasets, software, posters, and publications with DOIs (often 10.5281/zenodo.*). Tools query the public JSON API for discovery and retrieve record metadata with file links.
CONP Datasets: Canadian Open Neuroscience Platform datasets discoverable through the CONP ecosystem and the `conpdatasets` public catalog. Useful for finding reusable neuroscience repositories and linking to dataset documentation and terms.
Neurobagel: Federated cohort discovery ecosystem with harmonized phenotype and imaging metadata. The public node API enables filtered cohort queries across indexed datasets without requiring direct data download.
OpenNeuro: Primary open platform for sharing neuroimaging data in BIDS format. Search datasets by modality (MRI, MEG, EEG, PET) and retrieve metadata, DOIs, and snapshot information.
DANDI Archive: BRAIN Initiative archive for cellular neurophysiology: electrophysiology, calcium imaging, behavioral time-series, immunostaining. NWB/BIDS format with searchable metadata by keyword.
NEMAR: NeuroElectroMagnetic data Archive for EEG, MEG, and iEEG from OpenNeuro. BIDS format, HED event descriptions, NSG compute integration. Hosted at SDSC.
Brain-CODE: Ontario Brain Institute neuroinformatics platform: clinical, MRI, EEG, genomic data for epilepsy, depression, neurodegenerative disease, cerebral palsy, concussion. Public and controlled releases via braincode.ca and CONP.
ENIGMA Consortium: Imaging genetics meta-analysis: 100+ case-control summary statistics for schizophrenia, depression, ADHD, bipolar, OCD, autism, epilepsy, Parkinson's. Cortical thickness, subcortical volume, surface area via ENIGMA Toolbox.
cBioPortal: Cancer genomics from the TCGA Pan-Cancer Atlas (32 tumor types, ~10K samples). Mutation frequencies, hotspot protein changes, and mutation type breakdowns by cancer.
DepMap: Cancer Dependency Map target-vulnerability resource. Public releases expose CRISPR/RNAi dependency fractions, pan-dependency/selectivity metrics, and predictive biomarkers for target prioritization.
BioGRID ORCS: Open Repository of CRISPR Screens from BioGRID. Useful for published screen-level hit status, phenotype labels, cell-line context, methodologies, and score summaries that complement release-level dependency resources.
CELLxGENE Discover / Census: Public single-cell dataset catalog and Census ecosystem from CZ CELLxGENE. Useful for discovering datasets by disease, tissue, assay, organism, and annotated cell types.
IEDB: Immune Epitope Database with experimentally characterized B-cell and T-cell epitopes, MHC binding data, and T-cell receptor sequences for immunology research.
LINCS L1000: Library of Integrated Network-based Cellular Signatures. Gene expression profiles measured after chemical and genetic perturbations — for drug repurposing and mechanism of action studies.
SureChEMBL: Chemical structures automatically extracted from patent literature. Search the patent landscape for compounds related to your target or chemical series.