top of page

Our Models

We provide three main applications that users can take advantage of during their research: similarity search, summarization, and Q&A. Our models build on the most optimal LLM architecture for each use case and are then fine tuned to the papers within our corpus. We provide a brief summary of the models here, as well as an in-depth explanation of each model in the pages below. 

Image by israel palacio

Similarity Search

The similarity model gives the user 5 similar papers based on a text search or a selected paper.  This list of similar papers by topic allows users to quickly and efficiently expand their research of a topic in seconds, without relying on subject matter experts.

biology-knowledge-graph-desktop.jpg

Summarization

The summarization model gives the user a summary of the abstract or full text of their selected paper. Summary allow the user to speed-read through a paper, quickly deciding if it is relevant to their research.

​

Q & A

The Q&A model gives the user relevant answers to their questions based on a selected paper.  This model takes in a paper as well as 5 similar papers as context to answer the question, allowing users to get  answers from this research domain. 

Similarity 

Our similarity model is built on Specter 1 Embeddings model. The similarity algorithm takes a paper’s CorpusID and generates the top 5 most similar papers from our database. The model minimizes the difference between the embedded vector of the inputted paper and the embedded vectors of the papers in our database. From here, we re-rank the top 1k most similar papers using a Trigram similarity model to re-weight the algorithm’s output to emphasize unique terms in a paper when selecting most similar papers. Combining both the Specter embeddings in addition to Trigrams is a flexible method that gets us the most relevant papers based on the specific details (methods, diseases, assays, etc) of a paper.

Specter 1 embeddings model is built off of a existing LM, SciBERT, and additionally trained on paper citations with the goal of adopting output representations so that they are more similar for papers that share a citation link. The exact loss function is described to the right, maximizing the difference between related paper and unrelated paper. Additional training between direct citation and secondary citation is also done to further enhance the models ability to distinguish papers. 

Q&A

Our Q&A model and abstract summarization models are built on T5. The T5 model (Text-To-Text Transfer Transformer) is a versatile language model build on transformer architecture. Its text-to-text approach, where different NLP tasks are converted into a unified text generation problem, provides a flexible and efficient model used for Q&A, where the user question can warrant different types of responses. 

​

The T5 encoder converts and embeds text into numerical representations. These are then passed into an attention and normalization sequence to extract the most relevant information for text prediction or generation. The decoder then creates numerical predictions of text responses based on the specified task (in this case, answering a question or writing a summary). 

T5.png

Full Text & Abstract Summarization

Our Text Summarization model uses the LangChain framework. Longer text blocks often contain nuanced information, which LLMs typically have difficulty grasping and filtering for important details. In addition, LLMs are built with text length limits for input and output strings, which can limit the amount of information they can handle. The LangChain framework allows us to process large text documents by “’chaining” together LlMS. First, the model breaks up a document into multiple chunks, applying embeddings to each section of the document. Then, the model creates a semantic index across all the sections. This index is fed into the LLM (OpenAI GPT model). By breaking up a documents into chunks, the model has small enough pieces of text to ingest efficiently, capturing the details within each section. By creating a semantic index for each chunk, the model is also able to consider context across the document, capturing the larger themes. We found this approach generates the most meaningful and concise summaries of full text.

Screen Shot 2023-07-27 at 11.11.55 PM.png

Model Evaluation

The evaluation for our models is a mix of factors that include fluency & coherence (ROUGE scores), precision (F1), as well as human verified factual accuracy, judgement on relevance & meaningfulness.

​

ROUGE is a set of metrics mainly used for evaluating the quality of a summary (measuring how much the n-grams in the reference summaries appeared in the machine generated summaries), with higher ROUGE scores being better. In the case of evaluating our summaries, due to the high cost / time of generating human written summaries of papers and abstracts, we compared both abstract summaries and full text summaries to paper abstracts. While this approach is imperfect, it does allow us to benchmark our models in a low-cost and standardized way. 

Similarity:

F1 score on classification task + Human verified. The Specter 1 model got a higher F1 score versus a variety of LMs on scientific classification and citation prediction tasks when evaluated on SCIDOCS (evaluation benchmark for scientific documents).

Screen Shot 2023-07-28 at 4.51.48 PM.png

Abstract & Full Text Summarization:

Abstract summarization: ROUGE scores + Human verified. ROUGE score calculate between summarized abstract  and abstract. 

Full text summarization: ROUGE scores + Human verified. ROUGE score calculated between summarized full text and abstract.

 

While T5 full text summarizations scored higher than LangChain on ROUGE score, we use LangChain summaries in our product as they scored higher on human verification and are faster to generate. In all cases, the ROUGE F1 scores are low. This is expected as information is lost between an abstract and a summary of an abstract or full text article (low recall). We suspect that using human generated summaries as a benchmark for model evaluation will increase the scores. 

​

​

​

​

​

​

​

​

Q&A: Human verified due to high cost of generating comparable human responses and lack of proxy for Q&A

Screen Shot 2023-08-06 at 10.29.12 PM.png
bottom of page