Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support object_name in all from_ formats #14

Merged
merged 12 commits into from
Jul 15, 2024
Merged

Conversation

dberenbaum
Copy link
Collaborator

Closes https://github.com/iterative/dvcx/issues/1710

  • Replaces parse_parquet/parse_csv with from_parquet/from_csv (replacing other from_csv)
  • Keeps parse_tabular() as a public method for more complex logic (like doing complex filtering before parsing parquet)
  • Adds object_name (taken from from_json but name suggestions welcome) to from_storage/from_features/from_pandas/from_parquet/from_csv

Comment on lines +130 to +131
else obj
if isinstance(obj, tuple)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If expected output was of type Feature and return value was a tuple, it was being wrapped in another tuple instead of returned as is

@dberenbaum dberenbaum changed the title Support object_name in all from_ formats Support object_name in all from_ formats Jul 11, 2024
print(static_csv_ds.to_pandas())

uri = "gs://datachain-demo/laion-aesthetics-csv"
print()
print("========================================================================")
print("dynamic CSV with header schema test parsing 3M objects")
print("========================================================================")
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion", show_schema=True)
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion")
dynamic_csv_ds.print_schema()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds.print_schema() and the argument show_schema=True serve different purposes.

The former just prints the entire schema for the datachain.
The latter is a check if the CSV/JSON was auto-schemed correctly; if not, the user will need to specify a schema manually.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add show_schema to all these methods but would like some consensus on if it's desirable since it seems a bit superfluous with the schema-related methods added recently by @dmpetrov. Previously, these methods were always printing the schema of the table if it was inferred but @dmpetrov commented here about it looking like a strange side effect, so I dropped it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are important differences there.

If you are trying to read a new CSV/JSON file, you likely want to check the schema but also will probably copy-paste the output if you want to modify it.

ds.print_schema() does not serve this purpose.

I can talk to Dmitry if needed as we already had this convo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get back the actual schema, though, which seems more useful than a print statement since you can parse it, validate it, and modify it. For example, you could do DataChain.from_csv(uri, object_name="csv").schema["csv"] to get the pydantic-like feature class (I think having something like schema.to_dict() might also be nice).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataChain.from_csv(uri, object_name="csv").schema["csv"]

how does it help to copy-paste the schema to correct the errors?

Also I think you don't want to parse a 500Gb CSV file just to discover your auto-schema was wrong.
Maybe we should have show_csv_schema() as a companion to show_json_schema() to address this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does it help to copy-paste the schema to correct the errors?

That's why I mentioned schema.to_dict() would be nice. Then you could edit the dict and pass it back.

Also I think you don't want to parse a 500Gb CSV file just to discover your auto-schema was wrong. Maybe we should have show_csv_schema() as a companion to show_json_schema() to address this.

Only a single block (10k lines) will be parsed to infer the schema (this is how arrow does it). Like all other chain operations, the processing happens lazily.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataChain.from_csv(uri, object_name="csv").schema["csv"]

This seems to be the right approach for examine the schema alongside print_schema() Otherwise, we will need to introduce similar side effects to all parsing methods which explodes API and not considering a good practice.

We can potentially consider improving print_schema() for example by limiting number of outputs like print_schema("csv", "file")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a single block (10k lines) will be parsed to infer the schema (this is how arrow does it). Like all other chain operations, the processing happens lazily.

This is not intuitive but might be just okay.

What's not okay is not to offer a copy-paste Feature or Pydantic-to-feature schema output.
Note that neither print_schema() nor schema() gets us there (the latter should actually be renamed model():

>>> chain.schema["mistral_response"]
<class '__main__.MistralModel'>
>>> chain.schema["mistral_response"].schema()
{'$defs': {'CompletionResponseChoice': {'properties': {'message': {'allOf': [{'$ref': '#/$defs/MyChatMessage'}], 'default': {'role': '', 'content': ''}}}, 'title': 'CompletionResponseChoice', 'type': 'object'}, 'MyChatMessage': {'properties': {'role': {'default': '', 'title': 'Role', 'type': 'string'}, 'content': {'default': '', 'title': 'Content', 'type': 'string'}}, 'title': 'MyChatMessage', 'type': 'object'}, 'Usage': {'properties': {'prompt_tokens': {'default': 0, 'title': 'Prompt Tokens', 'type': 'integer'}, 'completion_tokens': {'default': 0, 'title': 'Completion Tokens', 'type': 'integer'}}, 'title': 'Usage', 'type': 'object'}}, 'properties': {'id': {'default': '', 'title': 'Id', 'type': 'string'}, 'choices': {'items': {'$ref': '#/$defs/CompletionResponseChoice'}, 'title': 'Choices', 'type': 'array'}, 'usage': {'allOf': [{'$ref': '#/$defs/Usage'}], 'default': {'prompt_tokens': 0, 'completion_tokens': 0}}}, 'required': ['choices'], 'title': 'MistralModel', 'type': 'object'}

>>> chain.print_schema()
 source: str
 parent: str
 name: str
 version: str
 etag: str
 size: int
 vtype: str
 location: dict
 file: File
     source: str
     parent: str
     name: str
     size: int
     version: str
     etag: str
     is_latest: bool
     last_modified: datetime
     location: Union[dict, list[dict], NoneType]
     vtype: str
 mistral_response: MistralModel
     id: str
     choices: list[CompletionResponseChoice]
         * list of: CompletionResponseChoice
             message: MyChatMessage
                 role: str
                 content: str
     usage: Usage
         prompt_tokens: int
         completion_tokens: int

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a copy-paste Feature or Pydantic-to-feature

Generating code for user to copy-past is uncommon because it's error prone (depends on environment) and not secure.

It might be better to handle this differently, possibly with a dedicated helper function like pydantic_to_feature().

@volkfox if you think it's an important use case - please share more context. However, it's unlikely should be a part of the data readers.

Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dreadatour great change! Thank you.

A couple comments are inline mostly about the naming.

PS: It seems I didn't clicked Approve button 😅

@@ -208,6 +208,7 @@ def from_storage(
catalog: Optional["Catalog"] = None,
recursive: Optional[bool] = True,
anon: bool = False,
object_name: str = "file",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object_name is a bit too long name. And object never looks good in naming.

how about output or just name?

src/datachain/lib/dc.py Outdated Show resolved Hide resolved
@@ -208,6 +208,7 @@ def from_storage(
catalog: Optional["Catalog"] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please replace catalog with session. We shouldn't show catalog to users - only session.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced it but also added kwargs to pass catalog=catalog because it looks like it's not that simple to replace it in tests and probably needs to happen in another PR.

@@ -208,6 +208,7 @@ def from_storage(
catalog: Optional["Catalog"] = None,
recursive: Optional[bool] = True,
anon: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we can move anon to session. and use only session params instead of catalog+anon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably not in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped it as a named arg but left it possible to pass in kwargs.

src/datachain/lib/dc.py Outdated Show resolved Hide resolved
@@ -706,6 +663,7 @@ def from_features(
ds_name: str = "",
session: Optional[Session] = None,
output: Union[None, FeatureType, Sequence[str], dict[str, FeatureType]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably rename output to something like schema or spec and make object_name -> output

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas calls it dtype. I'm not sure it's the best name tho.

def from_csv(
cls,
path,
anon: bool = False,
delimiter: str = ",",
header: bool = True,
column_names: Optional[list[str]] = None,
output: Optional[dict[str, FeatureType]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same - schema/spec instead?

def from_csv(
cls,
path,
anon: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great to get rid of anon as well. Not sure it belongs to this PR.

print(static_csv_ds.to_pandas())

uri = "gs://datachain-demo/laion-aesthetics-csv"
print()
print("========================================================================")
print("dynamic CSV with header schema test parsing 3M objects")
print("========================================================================")
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion", show_schema=True)
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion")
dynamic_csv_ds.print_schema()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataChain.from_csv(uri, object_name="csv").schema["csv"]

This seems to be the right approach for examine the schema alongside print_schema() Otherwise, we will need to introduce similar side effects to all parsing methods which explodes API and not considering a good practice.

We can potentially consider improving print_schema() for example by limiting number of outputs like print_schema("csv", "file")

print(static_csv_ds.to_pandas())

uri = "gs://datachain-demo/laion-aesthetics-csv"
print()
print("========================================================================")
print("dynamic CSV with header schema test parsing 3M objects")
print("========================================================================")
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion", show_schema=True)
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion")
dynamic_csv_ds.print_schema()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a copy-paste Feature or Pydantic-to-feature

Generating code for user to copy-past is uncommon because it's error prone (depends on environment) and not secure.

It might be better to handle this differently, possibly with a dedicated helper function like pydantic_to_feature().

@volkfox if you think it's an important use case - please share more context. However, it's unlikely should be a part of the data readers.

@volkfox
Copy link
Contributor

volkfox commented Jul 11, 2024 via email

@dmpetrov
Copy link
Member

OTH, crafting these schemas by hand is a major drawback.

It looks like we need a dedicated method for this similar to print_schema() but for printing classes like print_features()

@dberenbaum
Copy link
Collaborator Author

@dmpetrov Re: output and object_name, I agree the names are not ideal, but want to provide more context for why it's this way in the PR before deciding on the final api:

  1. output argument is to be consistent with output in map/gen/agg/batch_map and is treated identical to those args (with one minor difference that those support Sequence[str], although I am not sure that is actually working or makes sense to keep). Do we want to rename in those also and make it consistent across the api?
  2. object_name was taken from from_json, which already has args spec, schema_from, object_name, and model_name, so we need an arg name that works there. It might also be worth considering adding the arg to map/gen/agg/batch_map in addition to/instead of signal map since it does the same thing. Also, I think we need to establish some terminology for what this is so we can name it. Is it an object, signal, feature, output, or something else?
  3. If we make object_name required, we make it impossible to have a "flat" table with more typical column names that people expect. It's also different from map/gen/agg/batch_map where a grouped signal name is optional.

@volkfox
Copy link
Contributor

volkfox commented Jul 12, 2024

@dberenbaum we also should include the model_name argument in the signature.
The reason is that you might want to have different objects with a same schema pointing to the same model. For example you are reading two annotation variants of the same schema, from CSV1 and CSV2. You want the resulting objects to have different names but the model to be named identically.

Unrelated to this issue but a good idea - have the n_rows argument which would enable optimization for code like :

DataChain.from_csv(uri).limit(5).show()

not to process 3 million rows to extract 5.

@volkfox volkfox self-requested a review July 12, 2024 18:37
output = schema_to_output(schema)
except ValueError as e:
raise DatasetPrepareError(self.name, e) from e

if object_name:
if isinstance(output, dict):
output = dict_to_feature(object_name, output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not always that you want a Feature class name to be the same as the object name.

This is why from_json() also has model_name argument

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more note on the proposed change:

It would be great to have some way to mark the fields of the auto-schema as 'Optional'
Without this, the last example from the test set examples/json-csv-reader.py will fail at first reference:

from datachain.lib.dc import C, DataChain
uri = "gs://datachain-demo/laion-aesthetics-csv"
print()
print("========================================================================")
print("dynamic CSV with header schema test parsing 3M objects")
print("========================================================================")
dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion")
dynamic_csv_ds.collect()

dynamic_csv_ds = DataChain.from_csv(uri, object_name="laion")
Processed: 1 rows [00:00, 1088.02 rows/s]
dynamic_csv_ds.collect()
Processed: 1 rows [00:00, 1221.05 rows/s]
Processed: 1 rows [01:57, 117.96s/ rows]
Generated: 3303469 rows [01:57, 28004.44 rows/s]
Traceback (most recent call last):
File "", line 1, in
File "/Users/dkh/datachain/src/datachain/lib/dc.py", line 544, in collect
return list(self.iterate(*cols))
File "/Users/dkh/datachain/src/datachain/lib/dc.py", line 537, in iterate
yield chain.signals_schema.row_to_features(row, chain.session.catalog)
File "/Users/dkh/datachain/src/datachain/lib/signal_schema.py", line 189, in row_to_features
obj = fr_cls(**json)
File "/Users/dkh/datachain/venv/datachain/lib/python3.9/site-packages/pydantic/main.py", line 193, in init
self.pydantic_validator.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for laion
pwatermark
Input should be a valid number [type=float_type, input_value=None, input_type=NoneType]
For further information visit https://errors.pydantic.dev/2.8/v/float_type

@dmpetrov
Copy link
Member

dmpetrov commented Jul 13, 2024

  1. (with one minor difference that those support Sequence[str], although I am not sure that is actually working or makes sense to keep)

Yeah. It was introduced for tuples that are flattened (def func(file) -> tuple[int, int, str]:) which is similar use case. So, clear similarity with output

2. adding the arg to map/gen/agg/batch_map in addition to/instead of signal map since it does the same thing. Also, I think we need to establish some terminology for what this is so we can name it.

Right, it feels like a lack of terminology and unification! SignalSchema seems like a right data structure. The problem is, it was not designed for user-facing APIs. But it can be dome with simple tricks: making it a child of dict (no .values) and start accepting Sequence[str]. User still should not see it tho - just passing Sequence[str].
If we go this road, output here will have similar meaning as in map() and such.

WDYT?

3. If we make object_name required ...

💯
My idea was using empty string as a default value (for flattening).

@dberenbaum we also should include the model_name argument in the signature.

It makes sense. @dberenbaum WDYT?

PS: @dberenbaum please do not wait long for all the design decisions - we can redesign on top of this.

Copy link

codecov bot commented Jul 15, 2024

The author of this PR, dberenbaum, is not an activated member of this organization on Codecov.
Please activate this user on Codecov to display this PR comment.
Coverage data is still being uploaded to Codecov.io for purposes of overall coverage calculations.
Please don't hesitate to email us at [email protected] with any questions.

@dberenbaum
Copy link
Collaborator Author

Right, it feels like a lack of terminology and unification! SignalSchema seems like a right data structure. The problem is, it was not designed for user-facing APIs. But it can be dome with simple tricks: making it a child of dict (no .values) and start accepting Sequence[str]. User still should not see it tho - just passing Sequence[str]. If we go this road, output here will have similar meaning as in map() and such.

WDYT?

Sequence[str] is supported in the latest update. I think it makes sense to keep the separation between the user-facing arg and the SignalSchema data structure. Not sure how it would work to pass something like Sequence[str] directly to SignalSchema since it will need data types, but let me know if I'm misunderstanding you.

@dberenbaum we also should include the model_name argument in the signature.

It makes sense. @dberenbaum WDYT?

Added it to unblock and move forward here. TBH I'm not sure this should be exposed to the user, and this goes back to the discussion above about whether we should print pydantic schemas or provide an editable dict-like schema.

It would be great to have some way to mark the fields of the auto-schema as 'Optional'
Without this, the last example from the test set examples/json-csv-reader.py will fail at first reference:

Yes, this seems like it's the same issue as https://github.com/iterative/dvcx/issues/1697, so let's follow up there.

PS: @dberenbaum please do not wait long for all the design decisions - we can redesign on top of this.

Yup, I'm merging and we can follow up on these questions and how to name these args.

@dberenbaum dberenbaum merged commit 3aa4031 into main Jul 15, 2024
12 checks passed
@dberenbaum dberenbaum deleted the unify_from_formats branch July 15, 2024 17:09
mnrozhkov pushed a commit that referenced this pull request Jul 16, 2024
* support object_name in all from_ formats

* refactor arrow infer schema

* use session instead of catalog in datachain

* fix tests

* Revert "fix tests"

This reverts commit 4ca848f.

* hide anon arg in datachain

* drop optional from object_name

* accept list of col names

* fix csv col names and add model_name
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants