Scholarly

Scholarly is a web application that assists researchers in conducting literature reviews by providing detailed summaries of papers relevant to their search queries. It’s intended to demonstrate the use of Apache Spark, a big data tool that can process large datasets in parallel by distributing tasks across a cluster of computers. In this way, Spark enables us to process large amounts of data (that is, summarize papers) more quickly than we can on a single computer.

Usage

Make sure to have Python 3.10 and Docker installed on your local machine
Clone the repository: git clone https://github.com/allan-jt/Scholarly.git
Go into the src directory: cd Scholarly/src
Create an .env file following the structure of .env.example and insert your OpenAI API Key
Go into the root directory (cd Scholarly)
Enter make all in the terminal to start the project
Access the web app at http://localhost:5173/

Note: The .env.example configures Spark to simulate a cluster of computers using the local machine's processes since a real cluster wasn't available to us. Consequently, summarizing a paper is relatively slow.

Features

A bottleneck in research is the literature review, as it requires sifting through numerous papers to understand their key findings and results. This time-consuming process is necessary for formulating hypotheses and designing experiments. Scholarly aims to streamline this process by providing researchers with concise yet detailed summaries of relevant papers. To achieve this, we implemented the following features:

Search: Enter query keywords to quickly find related papers

Advanced Search: Use filters like publication date, sorting order, and abstract keywords to filter results

Summarize: Click on a paper to generate a concise summary by section

Re-summarize: Cache summaries for instant, future access without reprocessing

Dark Mode: Switch to dark mode for improved visibility and reduced eye strain

Pagination: Navigate through large sets of results with pagination controls

Tech Stack

React (with Vite): A frontend framework using TypeScript
FastAPI: A Python-based backend framework designed for high performance
Redis: A caching layer for storing and quickly retrieving paper summaries
Docker: Used for containerizing the frontend, backend, and database to ensure consistency across environments
Arxiv: An external database used to retrieve research papers via API calls
PySpark: A Python client for interfacing with Apache Spark for distributed data processing
Unstructured.io: A python library for chunking research papers by sections to facilitate efficient processing
LangChain (using GPT-4): A framework we use with gpt-4o-mini for summarizing individual chunks of research papers

Summary Process

A typical research paper, averaging around 10,000 tokens, often exceeds the context windows of many LLMs (e.g., 2,048 tokens for GPT-3 and 8,192 tokens for GPT-4). Even when context windows are sufficiently large to process entire papers, the resulting summaries often lack the depth and detail achievable by focusing on specific sections. To address these, we implemented a MapReduce-style process using Spark and LangChain. The approach involves splitting the paper into sections, summarizing them in parallel, and then collating the results for a comprehensive and detailed summary.

Workflow explanation

Chunking: The selected paper is first chunked by section, using Unstructured IO to parse the PDF from arxiv. For a minority of cases, however, this fails owing to the difficulties of parsing PDFs. In such cases, the paper is chunked in blocks of 10,000 characters.
RDD Creation: The resulting array of chunks is converted into a Resilient Distributed Dataset (RDD), PySpark’s core data structure for distributed computing.
MapReduce Architecture: Spark consists of a master node and worker nodes, where the former allocates tasks to the workers (map) for independent processing. Once the workers have completed their tasks, the master node aggregates the results (reduce) for final processing. Worker nodes are typically separate machines in a cluster. But for demonstration purposes and due to limited access, the worker nodes are simulated using the processes of a local machine.
Partitioning the Chunks (map): The chunks from the RDD are grouped into a smaller number of partitions, where each is assigned to a worker.
Processing on Worker Nodes: The worker nodes process the chunk in their partition sequentially. The LangChain summarization process, instantiated on each node, handles the generation of summaries for individual chunks.
Handling Large Sections: If a chunked section is too large for the context window, it is further divided into sub-chunks. These sub-chunks are summarized in parallel within the same worker node using LangChain’s threading capabilities, ensuring even the largest sections are handled effectively.
Collecting and Combining Summaries (reduce): Once all the worker nodes complete their assigned tasks, the generated summaries are collected by the master node. These summaries are then combined into a final comprehensive summary, which is delivered to the user.

Contributors

Timothy Cao
Chase Huang
Ahhyun Moon
Allan Thekkepeedika

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
demo		demo
src		src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scholarly

Table of Contents

Usage

Features

Search: Enter query keywords to quickly find related papers

Advanced Search: Use filters like publication date, sorting order, and abstract keywords to filter results

Summarize: Click on a paper to generate a concise summary by section

Re-summarize: Cache summaries for instant, future access without reprocessing

Dark Mode: Switch to dark mode for improved visibility and reduced eye strain

Pagination: Navigate through large sets of results with pagination controls

Tech Stack

Summary Process

Workflow explanation

Contributors

About

Releases

Packages

Contributors 4

Languages

allan-jt/Scholarly

Folders and files

Latest commit

History

Repository files navigation

Scholarly

Table of Contents

Usage

Features

Search: Enter query keywords to quickly find related papers

Advanced Search: Use filters like publication date, sorting order, and abstract keywords to filter results

Summarize: Click on a paper to generate a concise summary by section

Re-summarize: Cache summaries for instant, future access without reprocessing

Dark Mode: Switch to dark mode for improved visibility and reduced eye strain

Pagination: Navigate through large sets of results with pagination controls

Tech Stack

Summary Process

Workflow explanation

Contributors

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages