Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early wrangling #1

Draft
wants to merge 70 commits into
base: main
Choose a base branch
from
Draft

Early wrangling #1

wants to merge 70 commits into from

Conversation

bobular
Copy link
Member

@bobular bobular commented Nov 27, 2024

No description provided.

- Added type detection for variables
- Introduced warnings for duplicate column names
- Integrated `type_convert` for robust type inference
- Added `data_type` auto-detection, including a new `id` type for primary keys.
  - Dates are explicitly excluded from being detected as `id`.
  - Detection of `id` applies only to primary keys (parent IDs won't work with current simple logic).
- Added `data_shape` inference logic:
  - Variables with `number`, `integer`, or `date` types default to `continuous`.
  - All other types default to `categorical`.
- Updated test fixture data to include an additional row, ensuring non-date variables have non-unique values.

This implementation improves the metadata generation for variables, aligning with EDA requirements.
…and tests

- Introduced the `preprocess_fn` argument for user-defined data cleanup
  before type inference. This allows handling edge cases like correcting
  invalid dates (e.g., changing '2021-02-29' to '2021-03-01').
- Enhanced type inference warnings:
  - Invalid date warnings from `type_convert` are now intercepted and
    embellished with a note about using `preprocess_fn`.
  - Suppressed propagation of handled warnings to avoid duplicates.
- Added test for handling invalid leap year dates (e.g., '2021-02-29').
  - Invalid dates added using `preprocess_fn`
  - Invalid dates are converted to `NA` as per `type_convert` behavior,
    with appropriate warnings and user-guidance issued.
- Added functionality to display metadata for ID and variable columns separately:
  - Prints a concise summary of ID columns.
  - Includes detailed metadata for variable columns (`data_type`, `data_shape`).
- Integrated `skimr::skim()` for summarizing variable data.
  - Excludes ID columns from the summary.
- Placeholder note for future entity-level metadata summary.
- Provides an intuitive way to inspect Entity objects, including column metadata and data summaries.
- Introduced an S4 method `inspect_variable()` to inspect a single variable in detail:
  - Validates the presence of the variable in the Entity's metadata.
  - Displays metadata for the specified variable in a vertical format using `pivot_longer()`.
  - Summarizes the variable's data using `skim()`, pivoted for readability.
- Ensures robust handling of mixed types in skim output by converting all values to character before pivoting.
- Complements the `inspect()` method for Entity-wide inspection by focusing on individual variables.
…them in a few places - fix tests that were broken with using kable simple format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant