Skip to content

KenHBS/pdf_to_text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

From PDF to text

pdf_to_csv.py takes a folder with PDFs and saves the text in the PDF as CSV (either seperately, or all at once). LvB_PDF_2_CSV.R downloads the academic articles of the Humboldt University Statistics department, extracts the text and saves it in a CSV document. LvB_PDF_2_CSV.R is very specific and unlikely to be of general use.

Usage

Prequisites is a version of python with PyPDF2 (version 1.5.3).

$ python3 pdf_to_csv.py --help

Usage: pydevconsole.py [options]
Options:
  -h, --help  show this help message and exit
  -f FOLDER   absolute folder path with PDF files (required)
  -o OPT      0: create CSV for each PDF (default)
              1: generates single CSV for the LDA
                    thesis data data preparation (including JEL code and DOI extraction)

Example

If the PDF files are located at /Users/Ken/MyPDFs, then:

$ python3 pdf_to_csv.py -f /Users/Ken/MyPDFs

Note

The option opt=1 is a special use case I needed for my thesis on Latent Dirichlet Allocation. This option is possibly useless to everybody else.

About

Extracts text from PDF documents using PyPDF2

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published