Skip to content

ScholarVista analyses research papers and extracts/plots information about them. It uses Grobid to extract all the content of the research papers. Then all this data is plotted and displayed using Python.

License

Notifications You must be signed in to change notification settings

mciccale/ScholarVista

Repository files navigation

Documentation Status zenodo test workflow lint workflow

ScholarVista

ScholarVista is a tool that extracts and plots information from a set of Academic Research Papers in PDF / TEI XML format. To process PDFs, it utilizes Grobid to generate the TEI XML files, then ScholarVista extracts the relevant information from the TEI XML files and generates the following data:

  1. Keyword Cloud for each of the paper's abstract and for the total of all abstracts.
  2. Links List for each one of the links found in the paper.
  3. Figures Histogram comparing the number of figures per paper.

Table of Contents:

Requirements

Python >=3.12 is required for installing the ScholarVista package, not for the Docker Image.

If you want to generate the results from a set of PDF academic papers, you must ensure that the Grobid Service is installed and running in your machine. See Grobid installation instrucions here.

The most straight-forward way of starting and running Grobid Service is by running a Docker image. Make sure you have Docker installed in your system.

docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0

This command will run Grobid and expose a web client in port 8070.

If you already have the TEI XML files generated from Grobid saved in a folder, you can directly generate the information from them.

Note: The TEI XML files MUST be obtained using Grobid, as this tool is intended to work only with Grobid generated TEI XML files.

Install ScholarVista

From Source

To install ScholarVista from source, you can clone the repository and install the package using pip. When using pip it is a good practice to use virtual environments. Check out the official documentation on virtual envornments here.

Conda

git clone https://github.com/mciccale/ScholarVista
cd ScholarVista
conda create -n scholarvista-env-3.12 python=3.12
conda activate scholarvista-env-3.12
pip install .

Note: You can use PyEnv to create a virtual environment. But since ScholarVista needs Python >=3.12, it is more suitable to use Conda, where you can select the Python version to use.

Docker Container

If you prefer running ScholarVista from a Docker Container, you can build the Docker Image with the following commands.

git clone https://github.com/mciccale/ScholarVista
cd ScholarVista
docker build -t scholarvista-app .

This will create an image called scholarvista-app.

Execution Instructions

From Source

CLI Tool

The most convenient way of using ScholarVista is by using its CLI.

The CLI Tool will generate and save to a directory a keyword cloud of the abstract of each paper and a list of URLs for each PDF analyzed; together with a histogram comparing the numer of figures of each PDF and a general keyword cloud of all abstracts.

Usage: scholarvista [OPTIONS] COMMAND [ARGS]...

  ScholarVista's CLI main entry point.

Options:
  --input-dir PATH   Directory containing PDF files.  [required]
  --output-dir PATH  Directory to save results. Defaults to current directory.
  --help             Show this message and exit.

Commands:
  process-pdfs  Process all PDFs in the given directory.
  process-xmls  Process all TEI XMLs in the given directory.
Example
  1. Start Grobid service using the container.
docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0
  1. Run ScholarVista's CLI to process all the PDFs in a given directory and leave the results in another directory.
# Process PDF files and save the results to a specified directory
scholarvista --input-dir ./pdfs --output-dir ./output process-pdfs

Python Modules

ScholarVista provides a set of classes and modules to take leverage of all its functionality from your Python code. To see an example, see example.py

Docker Container

If you prefer running ScholarVista with Docker, you can make use of ScholarVista CLI directly from the Docker Image you created following these instructions.

  1. Start Grobid service using the container.
docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0
  1. Run ScholarVista's container with 2 mounted volumes for input and output directories and connected to the host network.
docker run -it --rm --network=host -v /path/to/input/dir:/input -v /path/to/output/dir:/output scholarvista-app

Note: The default behaviour of ScholarVista's Docker Image is processing pdf files, you can override this by providing the process-xmls argument after the image name.

Example

Here's an example where we process a set of PDFs contained in the foo directory and we leave the results at bar using the Docker Image. Assuming the Grobid service is running at localhost:8070.

docker run -it --rm --network=host -v foo:/input -v bar:/output scholarvista-app process-pdfs

Docker Compose (Experimental)

You can try to run ScholarVista through Docker Compose. However, this feature is still in development and may not work as expected. ScholarVista will be trying to connect to Grobid before it has started, and it will be restarted until the Grobid service is up and running. You can try it by:

SH-Shell like

INPUT_DIR=/path/to/input/dir OUTPUT_DIR=/path/to/output/dir COMMAND='process-pdfs' docker-compose up

PowerShell

$env:INPUT_DIR="/path/to/input/dir"; $env:OUTPUT_DIR="/path/to/output/dir"; $env:COMMAND="process-pdfs" docker-compose up

Note: The COMMAND variable can be either process-pdfs or process-xmls. And the directories are the host machine directories where the files are extracted and left, respectively.

License

Please refer to the LICENSE file.

Where to Get Help

For further assistance or to contribute to the project, please refer to the CONTRIBUTING.md file.

About

ScholarVista analyses research papers and extracts/plots information about them. It uses Grobid to extract all the content of the research papers. Then all this data is plotted and displayed using Python.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published