Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ray version of the html2parquet transform #666

Merged
merged 22 commits into from
Oct 7, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions transforms/language/html2parquet/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ docker-save-image::
@# Help: Recursively make $@ in all subdirs
$(MAKE) RULE=$@ .recurse

.PHONY: workflow-venv
.PHONY: workflow-vent
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no kfp support yet. Why is this in here ?

workflow-venv:
if [ -e kfp_ray ]; then \
$(MAKE) -C kfp_ray workflow-venv; \
Expand All @@ -75,5 +75,4 @@ workflow-upload:
workflow-build:
if [ -e kfp_ray ]; then \
$(MAKE) -C kfp_ray workflow-build; \
fi

fi
42 changes: 42 additions & 0 deletions transforms/language/html2parquet/ray/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
ARG BASE_IMAGE=docker.io/rayproject/ray:2.24.0-py310

FROM ${BASE_IMAGE}

RUN pip install --upgrade --no-cache-dir pip

# install pytest
RUN pip install --no-cache-dir pytest

# Copy and install data processing libraries
# These are expected to be placed in the docker context before this is run (see the make image).
COPY --chown=ray:users data-processing-lib-python/ data-processing-lib-python/
RUN cd data-processing-lib-python && pip install --no-cache-dir -e .
COPY --chown=ray:users data-processing-lib-ray/ data-processing-lib-ray/
RUN cd data-processing-lib-ray && pip install --no-cache-dir -e .
COPY --chown=ray:users python-transform/ python-transform/
RUN cd python-transform && pip install --no-cache-dir -e .

# END OF STEPS destined for a data-prep-kit base image

COPY --chown=ray:users pyproject.toml pyproject.toml
COPY --chown=ray:users requirements.txt requirements.txt
RUN pip install --no-cache-dir -e .

# copy transform main() entry point to the image
COPY --chown=ray:users src/html2parquet_transform_ray.py ./

# copy some of the samples in
COPY --chown=ray:users ./src/html2parquet_local_ray.py local/

# copy test
COPY test/ test/
COPY test-data/ test-data/

# Set environment
ENV PYTHONPATH /home/ray

# Put these at the end since they seem to upset the docker cache.
ARG BUILD_DATE
ARG GIT_COMMIT
LABEL build-date=$BUILD_DATE
LABEL git-commit=$GIT_COMMIT
61 changes: 61 additions & 0 deletions transforms/language/html2parquet/ray/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Define the root of the local git clone for the common rules to be able
# know where they are running from.
REPOROOT=../../../..
# Include a library of common .transform.* targets which most
# transforms should be able to reuse. However, feel free
# to override/redefine the rules below.

# $(REPOROOT)/.make.versions file contains the versions

TRANSFORM_NAME= html2parquet

include $(REPOROOT)/transforms/.make.transforms

touma-I marked this conversation as resolved.
Show resolved Hide resolved
venv:: .transforms.ray-venv

test:: .transforms.ray-test

clean:: .transforms.clean

image:: .transforms.ray-image

test-src:: .transforms.test-src

setup:: .transforms.setup

build:: build-dist image

publish: publish-image

publish-image:: .transforms.publish-image-ray

setup:: .transforms.setup

# distribution versions is the same as image version.
set-versions:
$(MAKE) TRANSFORM_PYTHON_VERSION=$(HTML2PARQUET_PYTHON_VERSION) TOML_VERSION=$(HTML2PARQUET_PYTHON_VERSION) .transforms.set-versions

build-dist:: set-versions .defaults.build-dist

publish-dist:: .defaults.publish-dist

test-image:: .transforms.ray-test-image

run-cli-sample:
$(MAKE) RUN_FILE=$(TRANSFORM_NAME)_transform_python.py \
RUN_ARGS=" \
--data_local_config \" { 'input_folder' : '../test-data/input', 'output_folder' : '../output' } \" \
--data_files_to_use \"['.html','.zip']\"

run-local-sample: .transforms.run-local-ray-sample


#run-s3-ray-sample: .transforms.run-s3-ray-sample

minio-start: .minio-start

kind-load-image:: .transforms.kind-load-image
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the following rules to match those in other transforms from the latest dev branch. For example,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Makefile you are referring to is located in the ray folder, while the example provided is directly under the ededup folder. I checked the Makefile for the ededup transform within the ray folder, and it matches the same structure.


docker-load-image: .defaults.docker-load-image

docker-save-image: .defaults.docker-save-image
16 changes: 16 additions & 0 deletions transforms/language/html2parquet/ray/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# html2parquet Ray Transform

This module implements the ray version of the [html2parquet transform](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/html2parquet/python/README.md).

The HTML conversion is using the [Trafilatura](https://trafilatura.readthedocs.io/en/latest/usage-python.html).

## Prometheus metrics

The transform will produce the following statsd metrics:

| metric name | Description |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not part of the output parquet file. I think you need to tell the reader where these can be viewed. Also, please link from here to the python readme for annotations and configuration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not entirely sure where these metrics can be viewed. In the Ray implementation, I borrowed the pdf2parquet code, but I’m not very familiar with Prometheus metrics. These metrics are generated from the Html2ParquetRayTransform class, and they can be viewed through logging. After some investigation, it appears that you can access these metrics in the Prometheus dashboard under the relevant namespace for your application. However, my experience with this is limited. I would greatly appreciate your assistance in adding more detailed descriptions!

|----------------------------------|------------------------------------------------------------------|
| worker_html_doc_count | Number of HTML documents converted by the worker |
| worker_html_pages_count | Number of HTML pages converted by the worker |
| worker_html_page_avg_convert_time | Average time for converting a single HTML page on each worker |
| worker_html_convert_time | Time spent converting a single document |
46 changes: 46 additions & 0 deletions transforms/language/html2parquet/ray/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
[project]
name = "dpk_html2parquet_transform_ray"
version = "0.2.2.dev0"
requires-python = ">=3.10"
description = "HTML2PARQUET Python Transform"
license = {text = "Apache-2.0"}
readme = {file = "README.md", content-type = "text/markdown"}
authors = [
{ name = "Sungeun An", email = "[email protected]" },
{ name = "Syed Zawad", email = "[email protected]" },
]
dynamic = ["dependencies"]

[build-system]
requires = ["setuptools>=68.0.0", "wheel", "setuptools_scm[toml]>=7.1.0"]
build-backend = "setuptools.build_meta"

[tool.setuptools.dynamic]
dependencies = {file = ["requirements.txt"]}

[project.optional-dependencies]
dev = [
"twine",
"pytest>=7.3.2",
"pytest-dotenv>=0.5.2",
"pytest-env>=1.0.0",
"pre-commit>=3.3.2",
"pytest-cov>=4.1.0",
"pytest-mock>=3.10.0",
"moto==5.0.5",
"markupsafe==2.0.1",
]

[options]
package_dir = ["src","test"]

[options.packages.find]
where = ["src/"]

[tool.pytest.ini_options]
# Currently we use low coverage since we have to run tests separately (see makefile)
#addopts = "--cov --cov-report term-missing --cov-fail-under 25"
markers = ["unit: unit tests", "integration: integration tests"]

[tool.coverage.run]
include = ["src/*"]
3 changes: 3 additions & 0 deletions transforms/language/html2parquet/ray/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
dpk-html2parquet-transform-python==0.2.2.dev0
data-prep-toolkit-ray==0.2.2.dev0
trafilatura==1.12.0
55 changes: 55 additions & 0 deletions transforms/language/html2parquet/ray/src/html2parquet_local_ray.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# (C) Copyright IBM Corp. 2024.
# Licensed under the Apache License, Version 2.0 (the “License”);
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an “AS IS” BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
################################################################################


import ast
import os
import sys

from data_processing.utils import ParamsUtils
from data_processing_ray.runtime.ray import RayTransformLauncher
from html2parquet_transform_ray import Html2ParquetRayTransformConfiguration


# create parameters
input_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "test-data", "input"))
output_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "output"))
local_conf = {
"input_folder": input_folder,
"output_folder": output_folder,
}
worker_options = {"num_cpus": 0.8}
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}
params = {
# where to run
"run_locally": True,
# Data access. Only required parameters are specified
"data_local_config": ParamsUtils.convert_to_ast(local_conf),
"data_files_to_use": ast.literal_eval("['.html','.zip']"),
# orchestrator
"runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
"runtime_num_workers": 3,
"runtime_pipeline_id": "pipeline_id",
"runtime_job_id": "job_id",
"runtime_creation_delay": 0,
"runtime_code_location": ParamsUtils.convert_to_ast(code_location),
# pdf2parquet params
}

html2parquet_params = {}
if __name__ == "__main__":
# Set the simulated command line args
sys.argv = ParamsUtils.dict_to_req(d=params)
# create launcher
launcher = RayTransformLauncher(Html2ParquetRayTransformConfiguration())
# Launch the ray actor(s) to process the input
launcher.launch()
57 changes: 57 additions & 0 deletions transforms/language/html2parquet/ray/src/html2parquet_s3_ray.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# (C) Copyright IBM Corp. 2024.
# Licensed under the Apache License, Version 2.0 (the “License”);
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an “AS IS” BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
################################################################################

import ast
import os
import sys

from data_processing.utils import ParamsUtils
from data_processing_ray.runtime.ray import RayTransformLauncher
from html2parquet_transform_ray import Html2ParquetRayTransformConfiguration


# create launcher
launcher = RayTransformLauncher(Html2ParquetRayTransformConfiguration())
# create parameters
s3_cred = {
"access_key": "localminioaccesskey",
"secret_key": "localminiosecretkey",
"url": "http://localhost:9000",
}

s3_conf = {
"input_folder": "test/html2parquet/input",
"output_folder": "test/html2parquet/output",
}
worker_options = {"num_cpus": 0.8}
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}
params = {
# where to run
"run_locally": True,
# Data access. Only required parameters are specified
"data_s3_cred": ParamsUtils.convert_to_ast(s3_cred),
"data_s3_config": ParamsUtils.convert_to_ast(s3_conf),
"data_files_to_use": ast.literal_eval("['.html','.zip']"),
# orchestrator
"runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
"runtime_num_workers": 3,
"runtime_pipeline_id": "pipeline_id",
"runtime_job_id": "job_id",
"runtime_creation_delay": 0,
"runtime_code_location": ParamsUtils.convert_to_ast(code_location),
}
sys.argv = ParamsUtils.dict_to_req(d=params)
# for arg in sys.argv:
# print(arg)

# launch
launcher.launch()
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# (C) Copyright IBM Corp. 2024.
# Licensed under the Apache License, Version 2.0 (the “License”);
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an “AS IS” BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
################################################################################

import time
from argparse import ArgumentParser, Namespace
from typing import Any

import pyarrow as pa
from data_processing.runtime.pure_python.runtime_configuration import (
PythonTransformRuntimeConfiguration,
)
from data_processing.transform import AbstractTableTransform, TransformConfiguration
from data_processing.utils import CLIArgumentProvider, get_logger
from data_processing_ray.runtime.ray import RayTransformLauncher
from data_processing_ray.runtime.ray.runtime_configuration import (
RayTransformRuntimeConfiguration,
)
from html2parquet_transform import (
Html2ParquetTransform,
Html2ParquetTransformConfiguration,
)
from ray.util.metrics import Counter, Gauge

logger = get_logger(__name__)

class Html2ParquetRayTransform(Html2ParquetTransform):
def __init__(self, config: dict):
""" """
super().__init__(config)

self.doc_counter = Counter("worker_html_doc_count", "Number of HTML documents converted by the worker")
self.page_counter = Counter("worker_html_pages_count", "Number of HTML pages converted by the worker")
self.page_convert_gauge = Gauge(
"worker_html_page_avg_convert_time", "Average time for converting a single HTML page on each worker"
)
self.doc_convert_gauge = Gauge("worker_html_convert_time", "Time spent converting a single document")

def _update_metrics(self, num_pages: int, elapse_time: float):
self.page_convert_gauge.set(elapse_time / num_pages)
self.doc_convert_gauge.set(elapse_time)
self.doc_counter.inc(1)
self.page_counter.inc(num_pages)

class Html2ParquetRayTransformConfiguration(RayTransformRuntimeConfiguration):
"""
Implements the RayTransformConfiguration for HTML2PARQUET as required by the RayTransformLauncher.
"""

def __init__(self):
"""
Initialization
:param base_configuration - base configuration class
"""
super().__init__(transform_config=Html2ParquetTransformConfiguration())


if __name__ == "__main__":
launcher = RayTransformLauncher(Html2ParquetRayTransformConfiguration())
logger.info("Launching html2parquet transform")
launcher.launch()

Binary file not shown.
Loading
Loading