Skip to content

zudefoque/extrair-urls-pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lemonpdf

PyPI - Downloads PyPI - License GitHub Tag

Python3 library to get urls from PDF files.

Install

sudo apt install tesseract-ocr poppler-utils
pip install lemonpdf

Quickstart

Command line interface use (CLI)

get urls

lemonpdf -u file.pdf

save urls list in file txt

lemonpdf -u file.pdf -o urls.txt -s

get domains

lemonpdf -d file.pdf

save domains in file txt

lemonpdf -d file.pdf -o domains.txt -s

scripts

get urls and save file txt

from lemonpdf import Extractor

pdf_path = 'file.pdf'
output_txt_path = 'out_file.txt'

extractor = Extractor(pdf_path=pdf_path, output_txt_path=output_txt_path)

urls = extractor.extract_urls(save=True)

print(urls)

get domains and save file txt

from lemonpdf import Extractor

pdf_path = 'file.pdf'
output_txt_path = 'domains.txt'

extractor = Extractor(pdf_path=pdf_path, output_txt_path=output_txt_path)

urls = extractor.extract_domains(save=True)

print(urls)

About

Extrarir URLs de arquivos .pdf

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published