Skip to content

Gh214 add table detection model #34

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: 330-release-candidate
Choose a base branch
from
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ set JSL_OCR_LICENSE=license_key
* Select `SparkOcrSimpleExample.ipynb` notebook.
* Set `secret` and `license` variables to valid values in first cell.
* Run all cells: Runtime -> Run all.
* Restart runtime: Runtime -> Resturt runtime (Need restart first time after installing new packages).
* Restart runtime: Runtime -> Restart runtime (Need restart first time after installing new packages).
* Run all cellls again.

### Run notebooks locally using jupyter
Expand All @@ -66,5 +66,5 @@ jupyter-notebook
* Open `jupyter/SparkOcrSimpleExample.ipynb` notebook.
* Set `secret` and `license` variables to valid values in first cell.
* Run all cells: Cell -> Run all.
* Restart runtime: Kernel -> Resturt (Need restart first time after installing new packages).
* Restart runtime: Kernel -> Restart (Need restart first time after installing new packages).
* Run all cellls again.
262 changes: 1 addition & 261 deletions databricks/python/SparkOcrPdfProcessing.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion databricks/python/SparkOcrSimpleExample.ipynb

Large diffs are not rendered by default.

25 changes: 18 additions & 7 deletions databricks/scala/SparkOcrSimpleExample.scala
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,26 @@ def pipeline() = {
val binaryToImage = new BinaryToImage()
.setInputCol("content")
.setOutputCol("image")

val transformer = new GPUImageTransformer()
.addHuangTransform()
.addScalingTransform(2)
.addDilateTransform(2,2)
.addErodeTransform(2,2)
.setInputCol("image")
.setOutputCol("transformed_image")

// Run OCR
val ocr = new ImageToText()
.setInputCol("image")
.setInputCol("transformed_image")
.setOutputCol("text")
.setConfidenceThreshold(65)

.setModelType("best")
.setLanguage("eng")

new Pipeline().setStages(Array(
binaryToImage,
transformer,
ocr
))
}
Expand All @@ -50,25 +61,25 @@ def pipeline() = {
// COMMAND ----------

// MAGIC %sh
// MAGIC OCR_DIR=/dbfs/tmp/ocr
// MAGIC OCR_DIR=/dbfs/tmp/ocr_1
// MAGIC if [ ! -d "$OCR_DIR" ]; then
// MAGIC mkdir $OCR_DIR
// MAGIC cd $OCR_DIR
// MAGIC wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/ocr/datasets/images.zip
// MAGIC unzip images.zip
// MAGIC wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/ocr/datasets/news.2B.0.png.zip
// MAGIC unzip news.2B.0.png.zip
// MAGIC fi

// COMMAND ----------

display(dbutils.fs.ls("dbfs:/tmp/ocr/images/"))
display(dbutils.fs.ls("dbfs:/tmp/ocr_1/0/"))

// COMMAND ----------

// MAGIC %md ## Read images as binary files from DBFS

// COMMAND ----------

val imagesPath = "/tmp/ocr/images/*.tif"
val imagesPath = "/tmp/ocr_1/0/*.png"
val imagesExampleDf = spark.read.format("binaryFile").load(imagesPath).cache()
display(imagesExampleDf)

Expand Down
411 changes: 411 additions & 0 deletions jupyter/SparkOcrImageTableCellRecognition.ipynb

Large diffs are not rendered by default.

375 changes: 375 additions & 0 deletions jupyter/SparkOcrImageTableDetection.ipynb

Large diffs are not rendered by default.

598 changes: 598 additions & 0 deletions jupyter/SparkOcrImageTableRecognition.ipynb

Large diffs are not rendered by default.

Binary file added jupyter/data/tab_images/cTDaR_t10011.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added jupyter/data/tab_images/cTDaR_t10168.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.