-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added ray version of the html2parquet transform #666
Changes from 20 commits
6b6cc62
e692433
efc7271
84dfa95
da9ce93
347be6f
51ff6b9
1189666
0a14f3b
c436eed
9cad376
15fc047
9cc0e8f
c338204
3512d1e
a01a529
f42fa54
dc9cde3
f1842b2
82020fa
7e5619c
b671761
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
ARG BASE_IMAGE=docker.io/rayproject/ray:2.24.0-py310 | ||
|
||
FROM ${BASE_IMAGE} | ||
|
||
RUN pip install --upgrade --no-cache-dir pip | ||
|
||
# install pytest | ||
RUN pip install --no-cache-dir pytest | ||
|
||
# Copy and install data processing libraries | ||
# These are expected to be placed in the docker context before this is run (see the make image). | ||
COPY --chown=ray:users data-processing-lib-python/ data-processing-lib-python/ | ||
RUN cd data-processing-lib-python && pip install --no-cache-dir -e . | ||
COPY --chown=ray:users data-processing-lib-ray/ data-processing-lib-ray/ | ||
RUN cd data-processing-lib-ray && pip install --no-cache-dir -e . | ||
COPY --chown=ray:users python-transform/ python-transform/ | ||
RUN cd python-transform && pip install --no-cache-dir -e . | ||
|
||
# END OF STEPS destined for a data-prep-kit base image | ||
|
||
COPY --chown=ray:users pyproject.toml pyproject.toml | ||
COPY --chown=ray:users requirements.txt requirements.txt | ||
RUN pip install --no-cache-dir -e . | ||
|
||
# copy transform main() entry point to the image | ||
COPY --chown=ray:users src/html2parquet_transform_ray.py ./ | ||
|
||
# copy some of the samples in | ||
COPY --chown=ray:users ./src/html2parquet_local_ray.py local/ | ||
|
||
# copy test | ||
COPY test/ test/ | ||
COPY test-data/ test-data/ | ||
|
||
# Set environment | ||
ENV PYTHONPATH /home/ray | ||
|
||
# Put these at the end since they seem to upset the docker cache. | ||
ARG BUILD_DATE | ||
ARG GIT_COMMIT | ||
LABEL build-date=$BUILD_DATE | ||
LABEL git-commit=$GIT_COMMIT |
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -0,0 +1,61 @@ | ||||
# Define the root of the local git clone for the common rules to be able | ||||
# know where they are running from. | ||||
REPOROOT=../../../.. | ||||
# Include a library of common .transform.* targets which most | ||||
# transforms should be able to reuse. However, feel free | ||||
# to override/redefine the rules below. | ||||
|
||||
# $(REPOROOT)/.make.versions file contains the versions | ||||
|
||||
TRANSFORM_NAME= html2parquet | ||||
|
||||
include $(REPOROOT)/transforms/.make.transforms | ||||
|
||||
touma-I marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
venv:: .transforms.ray-venv | ||||
|
||||
test:: .transforms.ray-test | ||||
|
||||
clean:: .transforms.clean | ||||
|
||||
image:: .transforms.ray-image | ||||
|
||||
test-src:: .transforms.test-src | ||||
|
||||
setup:: .transforms.setup | ||||
|
||||
build:: build-dist image | ||||
|
||||
publish: publish-image | ||||
|
||||
publish-image:: .transforms.publish-image-ray | ||||
|
||||
setup:: .transforms.setup | ||||
|
||||
# distribution versions is the same as image version. | ||||
set-versions: | ||||
$(MAKE) TRANSFORM_PYTHON_VERSION=$(HTML2PARQUET_PYTHON_VERSION) TOML_VERSION=$(HTML2PARQUET_PYTHON_VERSION) .transforms.set-versions | ||||
|
||||
build-dist:: set-versions .defaults.build-dist | ||||
|
||||
publish-dist:: .defaults.publish-dist | ||||
|
||||
test-image:: .transforms.ray-test-image | ||||
|
||||
run-cli-sample: | ||||
$(MAKE) RUN_FILE=$(TRANSFORM_NAME)_transform_python.py \ | ||||
RUN_ARGS=" \ | ||||
--data_local_config \" { 'input_folder' : '../test-data/input', 'output_folder' : '../output' } \" \ | ||||
--data_files_to_use \"['.html','.zip']\" | ||||
|
||||
run-local-sample: .transforms.run-local-ray-sample | ||||
|
||||
|
||||
#run-s3-ray-sample: .transforms.run-s3-ray-sample | ||||
|
||||
minio-start: .minio-start | ||||
|
||||
kind-load-image:: .transforms.kind-load-image | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please update the following rules to match those in other transforms from the latest dev branch. For example,
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Makefile you are referring to is located in the ray folder, while the example provided is directly under the ededup folder. I checked the Makefile for the ededup transform within the ray folder, and it matches the same structure. |
||||
|
||||
docker-load-image: .defaults.docker-load-image | ||||
|
||||
docker-save-image: .defaults.docker-save-image |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# html2parquet Ray Transform | ||
|
||
This module implements the ray version of the [html2parquet transform](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/html2parquet/python/README.md). | ||
|
||
The HTML conversion is using the [Trafilatura](https://trafilatura.readthedocs.io/en/latest/usage-python.html). | ||
|
||
## Prometheus metrics | ||
|
||
The transform will produce the following statsd metrics: | ||
|
||
| metric name | Description | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are not part of the output parquet file. I think you need to tell the reader where these can be viewed. Also, please link from here to the python readme for annotations and configuration. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, I'm not entirely sure where these metrics can be viewed. In the Ray implementation, I borrowed the pdf2parquet code, but I’m not very familiar with Prometheus metrics. These metrics are generated from the |
||
|----------------------------------|------------------------------------------------------------------| | ||
| worker_html_doc_count | Number of HTML documents converted by the worker | | ||
| worker_html_pages_count | Number of HTML pages converted by the worker | | ||
| worker_html_page_avg_convert_time | Average time for converting a single HTML page on each worker | | ||
| worker_html_convert_time | Time spent converting a single document | |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
[project] | ||
name = "dpk_html2parquet_transform_ray" | ||
version = "0.2.2.dev0" | ||
requires-python = ">=3.10" | ||
description = "HTML2PARQUET Python Transform" | ||
license = {text = "Apache-2.0"} | ||
readme = {file = "README.md", content-type = "text/markdown"} | ||
authors = [ | ||
{ name = "Sungeun An", email = "[email protected]" }, | ||
{ name = "Syed Zawad", email = "[email protected]" }, | ||
] | ||
dynamic = ["dependencies"] | ||
|
||
[build-system] | ||
requires = ["setuptools>=68.0.0", "wheel", "setuptools_scm[toml]>=7.1.0"] | ||
build-backend = "setuptools.build_meta" | ||
|
||
[tool.setuptools.dynamic] | ||
dependencies = {file = ["requirements.txt"]} | ||
|
||
[project.optional-dependencies] | ||
dev = [ | ||
"twine", | ||
"pytest>=7.3.2", | ||
"pytest-dotenv>=0.5.2", | ||
"pytest-env>=1.0.0", | ||
"pre-commit>=3.3.2", | ||
"pytest-cov>=4.1.0", | ||
"pytest-mock>=3.10.0", | ||
"moto==5.0.5", | ||
"markupsafe==2.0.1", | ||
] | ||
|
||
[options] | ||
package_dir = ["src","test"] | ||
|
||
[options.packages.find] | ||
where = ["src/"] | ||
|
||
[tool.pytest.ini_options] | ||
# Currently we use low coverage since we have to run tests separately (see makefile) | ||
#addopts = "--cov --cov-report term-missing --cov-fail-under 25" | ||
markers = ["unit: unit tests", "integration: integration tests"] | ||
|
||
[tool.coverage.run] | ||
include = ["src/*"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
dpk-html2parquet-transform-python==0.2.2.dev0 | ||
data-prep-toolkit-ray==0.2.2.dev0 | ||
trafilatura==1.12.0 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# (C) Copyright IBM Corp. 2024. | ||
# Licensed under the Apache License, Version 2.0 (the “License”); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an “AS IS” BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
################################################################################ | ||
|
||
|
||
import ast | ||
import os | ||
import sys | ||
|
||
from data_processing.utils import ParamsUtils | ||
from data_processing_ray.runtime.ray import RayTransformLauncher | ||
from html2parquet_transform_ray import Html2ParquetRayTransformConfiguration | ||
|
||
|
||
# create parameters | ||
input_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "test-data", "input")) | ||
output_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "output")) | ||
local_conf = { | ||
"input_folder": input_folder, | ||
"output_folder": output_folder, | ||
} | ||
worker_options = {"num_cpus": 0.8} | ||
code_location = {"github": "github", "commit_hash": "12345", "path": "path"} | ||
params = { | ||
# where to run | ||
"run_locally": True, | ||
# Data access. Only required parameters are specified | ||
"data_local_config": ParamsUtils.convert_to_ast(local_conf), | ||
"data_files_to_use": ast.literal_eval("['.html','.zip']"), | ||
# orchestrator | ||
"runtime_worker_options": ParamsUtils.convert_to_ast(worker_options), | ||
"runtime_num_workers": 3, | ||
"runtime_pipeline_id": "pipeline_id", | ||
"runtime_job_id": "job_id", | ||
"runtime_creation_delay": 0, | ||
"runtime_code_location": ParamsUtils.convert_to_ast(code_location), | ||
# pdf2parquet params | ||
} | ||
|
||
html2parquet_params = {} | ||
if __name__ == "__main__": | ||
# Set the simulated command line args | ||
sys.argv = ParamsUtils.dict_to_req(d=params) | ||
# create launcher | ||
launcher = RayTransformLauncher(Html2ParquetRayTransformConfiguration()) | ||
# Launch the ray actor(s) to process the input | ||
launcher.launch() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
# (C) Copyright IBM Corp. 2024. | ||
# Licensed under the Apache License, Version 2.0 (the “License”); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an “AS IS” BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
################################################################################ | ||
|
||
import ast | ||
import os | ||
import sys | ||
|
||
from data_processing.utils import ParamsUtils | ||
from data_processing_ray.runtime.ray import RayTransformLauncher | ||
from html2parquet_transform_ray import Html2ParquetRayTransformConfiguration | ||
|
||
|
||
# create launcher | ||
launcher = RayTransformLauncher(Html2ParquetRayTransformConfiguration()) | ||
# create parameters | ||
s3_cred = { | ||
"access_key": "localminioaccesskey", | ||
"secret_key": "localminiosecretkey", | ||
"url": "http://localhost:9000", | ||
} | ||
|
||
s3_conf = { | ||
"input_folder": "test/html2parquet/input", | ||
"output_folder": "test/html2parquet/output", | ||
} | ||
worker_options = {"num_cpus": 0.8} | ||
code_location = {"github": "github", "commit_hash": "12345", "path": "path"} | ||
params = { | ||
# where to run | ||
"run_locally": True, | ||
# Data access. Only required parameters are specified | ||
"data_s3_cred": ParamsUtils.convert_to_ast(s3_cred), | ||
"data_s3_config": ParamsUtils.convert_to_ast(s3_conf), | ||
"data_files_to_use": ast.literal_eval("['.html','.zip']"), | ||
# orchestrator | ||
"runtime_worker_options": ParamsUtils.convert_to_ast(worker_options), | ||
"runtime_num_workers": 3, | ||
"runtime_pipeline_id": "pipeline_id", | ||
"runtime_job_id": "job_id", | ||
"runtime_creation_delay": 0, | ||
"runtime_code_location": ParamsUtils.convert_to_ast(code_location), | ||
} | ||
sys.argv = ParamsUtils.dict_to_req(d=params) | ||
# for arg in sys.argv: | ||
# print(arg) | ||
|
||
# launch | ||
launcher.launch() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
# (C) Copyright IBM Corp. 2024. | ||
# Licensed under the Apache License, Version 2.0 (the “License”); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an “AS IS” BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
################################################################################ | ||
|
||
import time | ||
from argparse import ArgumentParser, Namespace | ||
from typing import Any | ||
|
||
import pyarrow as pa | ||
from data_processing.runtime.pure_python.runtime_configuration import ( | ||
PythonTransformRuntimeConfiguration, | ||
) | ||
from data_processing.transform import AbstractTableTransform, TransformConfiguration | ||
from data_processing.utils import CLIArgumentProvider, get_logger | ||
from data_processing_ray.runtime.ray import RayTransformLauncher | ||
from data_processing_ray.runtime.ray.runtime_configuration import ( | ||
RayTransformRuntimeConfiguration, | ||
) | ||
from html2parquet_transform import ( | ||
Html2ParquetTransform, | ||
Html2ParquetTransformConfiguration, | ||
) | ||
from ray.util.metrics import Counter, Gauge | ||
|
||
logger = get_logger(__name__) | ||
|
||
class Html2ParquetRayTransform(Html2ParquetTransform): | ||
def __init__(self, config: dict): | ||
""" """ | ||
super().__init__(config) | ||
|
||
self.doc_counter = Counter("worker_html_doc_count", "Number of HTML documents converted by the worker") | ||
self.page_counter = Counter("worker_html_pages_count", "Number of HTML pages converted by the worker") | ||
self.page_convert_gauge = Gauge( | ||
"worker_html_page_avg_convert_time", "Average time for converting a single HTML page on each worker" | ||
) | ||
self.doc_convert_gauge = Gauge("worker_html_convert_time", "Time spent converting a single document") | ||
|
||
def _update_metrics(self, num_pages: int, elapse_time: float): | ||
self.page_convert_gauge.set(elapse_time / num_pages) | ||
self.doc_convert_gauge.set(elapse_time) | ||
self.doc_counter.inc(1) | ||
self.page_counter.inc(num_pages) | ||
|
||
class Html2ParquetRayTransformConfiguration(RayTransformRuntimeConfiguration): | ||
""" | ||
Implements the RayTransformConfiguration for HTML2PARQUET as required by the RayTransformLauncher. | ||
""" | ||
|
||
def __init__(self): | ||
""" | ||
Initialization | ||
:param base_configuration - base configuration class | ||
""" | ||
super().__init__(transform_config=Html2ParquetTransformConfiguration()) | ||
|
||
|
||
if __name__ == "__main__": | ||
launcher = RayTransformLauncher(Html2ParquetRayTransformConfiguration()) | ||
logger.info("Launching html2parquet transform") | ||
launcher.launch() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no kfp support yet. Why is this in here ?