The Data

The Dataset

730k

Papers

375k

Abstracts

170k

Full Text Papers

12k

Academic Journals

29M

Paper References

28M

Paper Citations

Data FAQ

Where is our data from?

Our data is from Semantic Scholar’s Open Corpus, which is an open access database supported by the Allen Institute for AI. The Semantic Scholar Corpus is an AI-powered database built on scientific literature that takes research papers from a variety of large scientific journals, cleans it, and aggregates relevant attributes of the papers into a semi structured dataset. We use this data source as it is the most reliable and up-to-date in the market. Some of the data sources that Semantic Scholar pulls from are: WHO, Pubmed, BioRxiv, MedRxiv, ArXiv.

What scientific journals are included in the app?

We provide data from roughly 12k scientific journals, with the most notable being bioRxiv, Proceedings of the National Academy of Sciences of the United States of America, British Medical Journal, Journal of Bacteriology, Journal of Virology, Journal of Neuroscience, Journal of Clinical Microbiology, International Journal of Molecular Sciences.

What attributes do we provide for a given paper?

We provide users with a long list of attributes for a given corpus ID, which is collected from the paper itself as well as external resources, to provide users with the full scope of information on a given paper. Some popular attributes being: Paper ID (corpus, medline, pubmed), Title, Abstract , Full Text, Authors, Publication Date, References, Citations, Publication Journal, Affiliations, Field of Study Classification, URL.

What subjects are included in the app?

Our database of papers is built on medical topics, so our subject matters more broadly are Chemistry, Biology, Psychology, and Medicine.

The Data Pipeline

The data pipeline starts by importing paper attributes and full text from Semantic Scholar via their API and through an S3 file export. We then clean, process and embed the data before putting it in our development database (Postgres ML). From here, we vectorize, index, and organize the data further into various tables based on app use case. This cleaned and organized version of the data is then stored in our production database (Postgres ML), which is used for our app.

The Full Pipeline

Pipeline - Copy of High Level Overview.png