Biomedical knowledge is scattered across dozens of siloed sources: literature indexes, curated knowledge bases, ontologies, trial registries, molecular databases, and dataset repositories. Each of these come with its own schema, terminology, and interface. Answering a single research question - “Is this gene a viable drug target?” - often requires moving across these systems by hand, normalizing identifiers, comparing conflicting evidence, and tracing claims back to the original source.
AI Co-Scientist brings 50+ biomedical information resources into one conversational workflow. Ask a question in plain language and the agent plans a multi-step investigation, queries the most relevant sources, cross-checks the results, and synthesizes the findings into a cited research report with provenance.
Data sources
Organized from the broadest research questions to the most specialized analyses.
Literature & Clinical Evidence
Starting point for any biomedical question.
- PubMed
- Over 38 million biomedical abstracts and citations from MEDLINE, life science journals, and online books. The primary literature search engine for biomedicine.
- OpenAlex
- Broad scholarly knowledge graph covering 250M+ works including preprints, conference papers, and datasets. Useful for researcher discovery and citation analysis.
- Europe PMC
- European life-science literature platform with PubMed records, full text links, preprints, citation counts, grants, and text-mined metadata. Useful when PubMed alone is too narrow.
- ClinicalTrials.gov
- Registry of 500K+ clinical studies worldwide. Search active, recruiting, and completed trials by condition, intervention, or sponsor.
Target-Disease Associations
Connecting genes and proteins to diseases with aggregated evidence.
- Open Targets Platform
- Integrates genetics, genomics, transcriptomics, drugs, and literature into scored target-disease associations. The go-to resource for target validation.
- GWAS Catalog
- Curated collection of published genome-wide association studies. Returns specific variants, p-values, odds ratios, and mapped genes for any trait or disease.
- CIViC
- Community-curated clinical interpretations of cancer variants. Expert-reviewed evidence linking specific mutations to diagnosis, prognosis, and treatment response.
- ClinGen
- Curated gene-disease validity and dosage sensitivity resource. Adds expert-panel classifications and dosage evidence beyond variant-level pathogenicity databases.
Drug Discovery & Safety
The therapeutic landscape — from compounds to post-marketing surveillance.
- PubChem
- Open chemistry database with 116M+ compounds. Molecular properties, SMILES structures, InChIKeys, drug-likeness descriptors (XLogP, H-bond donors/acceptors, polar surface area), synonyms, and compound descriptions.
- ChEMBL
- Bioactivity database with 2M+ compounds, binding affinities, functional assays, and ADMET properties. Essential for understanding the pharmacological landscape around a target.
- DGIdb
- Drug-Gene Interaction Database aggregating 40+ sources. Returns druggability categories (kinase, clinically actionable, etc.) and known drug interactions for any gene.
- Guide to Pharmacology
- Curated target-ligand pharmacology from IUPHAR/BPS. Useful for mechanism-of-action summaries, action types, affinity evidence, and a cleaner curated pharmacology layer than general chemistry databases alone.
- GDSC / CancerRxGene
- Genomics of Drug Sensitivity in Cancer pharmacogenomics screens across hundreds of cancer cell lines. Useful for compound sensitivity patterns, tissue-specific response, and in vitro IC50/AUC context.
- PRISM Repurposing
- Broad Institute pooled-cell-line repurposing screen with single-dose log2-fold-change viability readouts across large cancer cell-line panels. Useful for broad viability patterns and fast repurposing-style response scans.
- PharmacoDB
- Cross-dataset pharmacogenomics portal harmonizing compound-response data from GDSC, PRISM, CTRPv2, and related public screens. Useful when response needs to be compared across multiple public drug-response resources in one layer.
- FDA FAERS
- Post-marketing adverse event reports from the FDA. Analyze safety signals, compare adverse event profiles across drugs, and identify drug class effects.
- RxNorm
- Standardized drug nomenclature from the NLM. Maps between brand names, generics, ingredients, and clinical drug forms.
- DailyMed
- Current FDA Structured Product Labels (SPLs) for US drugs. Useful for boxed warnings, indications, contraindications, and warnings/precautions straight from the label.
Gene & Protein Biology
Understanding molecular function, expression, and interactions.
- UniProt
- Comprehensive protein knowledgebase with sequences, domains, post-translational modifications, subcellular localization, and functional annotations for every known protein.
- GTEx
- Tissue-level gene expression from the Genotype-Tissue Expression project. Median TPM values across 54 human tissues from 948 donors — critical for target safety assessment.
- Human Protein Atlas
- Protein-level tissue specificity, single-cell specificity, protein class, and subcellular localization summaries for human genes. Useful for target validation beyond RNA-only evidence.
- Reactome
- Curated biological pathways and reactions. Understand the signaling cascades, metabolic pathways, and cellular processes a gene participates in.
- STRING
- Protein-protein interaction networks combining experimental data, text mining, and computational predictions. Identify interaction partners and functional modules.
- IntAct
- Curated experimental molecular interactions from EMBL-EBI. Useful when you want publication-backed interaction records and detection methods rather than integrated network predictions.
- BioGRID
- Large experimental interaction archive covering both physical and genetic interactions. Useful when you want broader publication-backed interaction coverage, throughput context, and partner evidence beyond a narrower curated interaction subset.
- Pathway Commons
- Integrated pathway and interaction resource aggregating multiple providers into a single queryable graph. Useful for widening pathway context beyond any one source.
Protein Structure
3D structural insights — predicted and experimental.
- AlphaFold
- AI-predicted protein structures from DeepMind covering 200M+ proteins. Returns pLDDT confidence scores and downloadable PDB/CIF structure files.
- RCSB PDB
- Experimentally determined structures from X-ray crystallography, cryo-EM, and NMR. Search by gene or UniProt ID to find resolved structures with bound ligands.
Genomic Variation, Phenotypes & Ontologies
Population genetics, phenotype normalization, and rare-disease grounding.
- gnomAD
- Population variant frequencies from 76K+ genomes and 125K+ exomes. Essential for distinguishing rare pathogenic variants from common benign polymorphisms.
- 1000 Genomes
- Reference catalog of human genetic variation across 26 populations. Foundation for understanding population-level diversity and ancestry-specific variants.
- ClinVar
- Clinical significance classifications for human variants — pathogenic, benign, uncertain significance. Links variants to conditions with submitter-level evidence.
- Ensembl VEP
- Variant Effect Predictor returning functional consequence types, SIFT and PolyPhen deleteriousness scores, and AlphaMissense pathogenicity predictions.
- MyVariant.info
- Aggregated variant annotations pulling from ClinVar, CADD, dbSNP, gnomAD, and COSMIC in a single lookup. Quick comprehensive view of any variant.
- MyGene.info
- Fast gene identifier normalization service for symbols, aliases, Entrez IDs, Ensembl IDs, and UniProt IDs. Useful for joining evidence across heterogeneous APIs.
- RefSeq
- NCBI curated reference sequences for transcripts, non-coding RNAs, chromosomes, and proteins (NM/NR/NC/NG/NP accessions). Tools search nuccore and protein indices with refseq[filter] and return accession-level metadata plus links.
- UCSC Genome Browser
- Interactive reference genome assemblies (hg38, hg19, mouse, and more) with the public REST API for search, interval sequence, and track rows. Tools resolve gene symbols to coordinates, fetch DNA for a locus, and query named tracks within bounded windows.
- ENCODE
- Encyclopedia of DNA Elements: functional genomics metadata and files (ChIP-seq, DNase-seq, RNA-seq, and more). Tools query the ENCODE REST API for experiments and related objects, then fetch accession-level JSON with portal links.
- OxO
- Ontology cross-reference service from EMBL-EBI. Bridges MONDO, EFO, DOID, MeSH, OMIM, UMLS, and related identifier systems for safer cross-database joins.
- QuickGO
- Gene Ontology term search and annotation service. Supports GO term discovery plus reviewed GO annotations for gene products with evidence codes and references.
- Human Phenotype Ontology (HPO)
- Standard phenotype vocabulary for rare disease and clinical genomics. Useful for normalizing phenotype terms like ataxia, microcephaly, and seizures before cross-source joins.
- Orphanet / ORDO
- Reference rare-disease catalog and ontology with disease definitions, xrefs, inheritance, age of onset, phenotype annotations, and curated disease-gene links.
- Monarch Initiative
- Phenotype-centric knowledge graph spanning genes, diseases, phenotypes, and model-organism evidence. Useful for phenotype-to-gene and disease-to-phenotype association queries.
- Alliance Genome Resources
- Integrated cross-species knowledge platform spanning human and model-organism genomes. Useful for orthologs, disease and phenotype summaries, and model-organism disease evidence that complements human-only sources.
Specialized Domains
Advanced and niche applications.
- Allen Brain Atlas
- Reference neuroanatomy and gene expression atlases, including region-level structure ontology and in situ hybridization expression profiles for mouse brain. Supports differential expression and structure-focused queries.
- EBRAINS Knowledge Graph
- Curated neuroscience knowledge graph spanning datasets, models, software, workflows, and contributors. Useful for discovering reusable brain research assets with rich metadata and provenance.
- Zenodo
- General-purpose open repository for datasets, software, posters, and publications with DOIs (often 10.5281/zenodo.*). Tools query the public JSON API for discovery and retrieve record metadata with file links.
- CONP Datasets
- Canadian Open Neuroscience Platform datasets discoverable through the CONP ecosystem and the `conpdatasets` public catalog. Useful for finding reusable neuroscience repositories and linking to dataset documentation and terms.
- Neurobagel
- Federated cohort discovery ecosystem with harmonized phenotype and imaging metadata. The public node API enables filtered cohort queries across indexed datasets without requiring direct data download.
- OpenNeuro
- Primary open platform for sharing neuroimaging data in BIDS format. Search datasets by modality (MRI, MEG, EEG, PET) and retrieve metadata, DOIs, and snapshot information.
- DANDI Archive
- BRAIN Initiative archive for cellular neurophysiology: electrophysiology, calcium imaging, behavioral time-series, immunostaining. NWB/BIDS format with searchable metadata by keyword.
- NEMAR
- NeuroElectroMagnetic data Archive for EEG, MEG, and iEEG from OpenNeuro. BIDS format, HED event descriptions, NSG compute integration. Hosted at SDSC.
- Brain-CODE
- Ontario Brain Institute neuroinformatics platform: clinical, MRI, EEG, genomic data for epilepsy, depression, neurodegenerative disease, cerebral palsy, concussion. Public and controlled releases via braincode.ca and CONP.
- ENIGMA Consortium
- Imaging genetics meta-analysis: 100+ case-control summary statistics for schizophrenia, depression, ADHD, bipolar, OCD, autism, epilepsy, Parkinson's. Cortical thickness, subcortical volume, surface area via ENIGMA Toolbox.
- cBioPortal
- Cancer genomics from the TCGA Pan-Cancer Atlas (32 tumor types, ~10K samples). Mutation frequencies, hotspot protein changes, and mutation type breakdowns by cancer.
- DepMap
- Cancer Dependency Map target-vulnerability resource. Public releases expose CRISPR/RNAi dependency fractions, pan-dependency/selectivity metrics, and predictive biomarkers for target prioritization.
- BioGRID ORCS
- Open Repository of CRISPR Screens from BioGRID. Useful for published screen-level hit status, phenotype labels, cell-line context, methodologies, and score summaries that complement release-level dependency resources.
- CELLxGENE Discover / Census
- Public single-cell dataset catalog and Census ecosystem from CZ CELLxGENE. Useful for discovering datasets by disease, tissue, assay, organism, and annotated cell types.
- IEDB
- Immune Epitope Database with experimentally characterized B-cell and T-cell epitopes, MHC binding data, and T-cell receptor sequences for immunology research.
- LINCS L1000
- Library of Integrated Network-based Cellular Signatures. Gene expression profiles measured after chemical and genetic perturbations — for drug repurposing and mechanism of action studies.
- SureChEMBL
- Chemical structures automatically extracted from patent literature. Search the patent landscape for compounds related to your target or chemical series.