To form novel hypotheses, each day we query the D2H2 gene sets created from the collection of bulk RNA-seq studies against >600,000 gene sets extracted from the supporting materials of research publications listed in PubMed Central (see rummagene.com). We then identify studies that have high overlap with the D2H2 gene set and low abstract similarity. We then pass the two dissimilar abstracts to GPT4 for hypothesis generation. The LLM model attempts to reason about possible explanation of why the studies with such dissimilar abstracts have such similar gene sets.
Today's Hypothesis
File formats accepted: csv, tsv, txt file with Entrez gene symbols on each line
Perform enrichment analysis on hundreds of thousands gene sets extracted from supporting tables of over one hundred thousand articles to find the most similar gene sets that match your query. The top 100 are returned and are ranked by cosine similarity based upon your entered abstract and those of the enriched gene sets.