Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add to_df, from_df for data records Help users easily use DataRecord (and the chosen plan), that means they can easily use their data to execute a plan, the data could be from some df. * Fix some includes * fix imports in pytest fix imports in pytest * fix imports for test fix imports for test * centralize hash functions into a helper lib centralize hash functions into a helper lib * Add as_df, from_df methods for DataRecord, so users can easily use their own data in the computations. Add as_df, from_df methods for DataRecord, so users can easily use their own data in the computations. Next step will be making df format more natural for the computation engines. DataRecord is just a special format of df, so it should be possible. * New features (#57) * add to_df, from_df for data records Help users easily use DataRecord (and the chosen plan), that means they can easily use their data to execute a plan, the data could be from some df. * fix imports in pytest fix imports in pytest * fix imports for test fix imports for test * centralize hash functions into a helper lib centralize hash functions into a helper lib * Add as_df, from_df methods for DataRecord, so users can easily use their own data in the computations. Add as_df, from_df methods for DataRecord, so users can easily use their own data in the computations. Next step will be making df format more natural for the computation engines. DataRecord is just a special format of df, so it should be possible. * I think we may still want to finish early for LimitScans; will check w/Mike * Dataset accept list, df as inputs. # Change Description ### Datasources.py 1. Auto generate schema for MemorySource output_schema 2. MemorySource accepts more types ### Record.py 1. Move build_schema_from_df to Schema class 2. Make as_df() take para fields_in_schema ### schema.py 1. Add a new defaultSchema ### datamanager.py 1. Generate ID for memorySource instead of requiring dataset_id input from users - Future plan: Consider auto-generation for all data registry - Rationale: Users shouldn't need to manage data IDs manually ### datasource.py 1. Add get_datasource() method to centralize source retrieval in DataSource ### logical.py 1. Remove input_schema from BaseScan init() when not supported - LogicalOperator now accepts exactly 2 params for cleaner BaseScan ### sets.py 1. Dataset supports memory data as input * resolve merge conflicts * Update sets.py * demo doesn't write to disk * add pz update --name for pz config add pz update --name for pz config * The changes move away from a single class handling all responsibilities to a more modular design with separate strategies for optimization, execution, and processing. Key improvements include better configuration management through a dedicated Config class, clearer separation of concerns, and more maintainable strategy implementations. Future plans include improving Optimizer interface and the QueryProcessor interface. The refactoring maintains the existing clean interface while setting up a more scalable foundation for future development. # Rationals ##Why this change: ### Previously ExecutionEngine takes too many responsibilities for everything. In order to give one interface to users, we put everything into one class, which is not a good practice. Different concepts are coupling together in ExecutionEngine, for example, execute_plan(), execute(), which is confusing. It’s easy to get messed up. It’s not a good practice to new Execute() for running, as initiating an instance and running an instance are likely having different concerns, e.g. when testing, or we might want to pass instances to different places and run them in different modules. ### After this change Separate core concepts to their dedicated models. The names can speak for themselves. Put long params to Config which will be easier to maintain. Each module is supposed to take care of one thing, and team will be easier to work together on the codebase. Split out OptmizerStrategy, ExecutionStrategy, ProcessorStrategy QueryProcessor from ExecutionEngine, it will be easier to know where we need a new strategy, and don’t need to extend the huge class every time. Interface is still clean. In the future, we’ll add “auto” for strategies, saying, the system can figure out the best strategies based on Dataset and params in Config, which will be easy to do in Factory. ### Important Notes This is the first infra update, and I expect we can further refine the infrastructure so that PZ will be easier to scalable in the future. I didn’t change any code inside functions to make this change easier to review, mostly just copy things around. If you see I deleted something, 99% because I moved it to another place. ##Next steps After this change looks good to you, I’ll refactor all the demos. I’ll see how to improve Optimizer interface. QueryProcessor class can be improved further. Currently a new class inherits both ExecutionStrategy and ProcessorStrategy to make processor run correctly, I feel this can be improved. Some strategies seem like not working, I’ll dig into some functionalities and further improve how we define and use strategies. # Core Classes and Their Relationships ## QueryProcessor (Base Class) Abstract base class that defines the query processing pipeline. Dependencies: - Optimizer - for plan optimization - Dataset - for data source handling - QueryProcessorConfig - for configuration, including policy Implementation Includes NoSentinelSequentialSingleThreadProcessor NoSentinelPipelinedSinglelProcessor NoSentinelPipelinedParallelProcessor MABSentinelSequentialSingleThreadProcessor MABSentinelPipelinedParallelProcessor StreamingQueryProcessor RandomSamplingSentinelSequentialSingleThreadProcessor RandomSamplingSentinelPipelinedProcessor ## QueryProcessorFactory Creates specific QueryProcessor implementations. - Creates and configures different types of processors based on: - ProcessingStrategyType - ExecutionStrategyType - OptimizationStrategyType ## Optimizer Create OptimizerStategy to implement optimize() for different OptimizerStategyTypes ## OptimizationStrategy (Abstract Base Class) Defines interface for different optimization strategies. Implementations include: - GreedyStrategy - ParetoStrategy - SentinelStrategy - ConfidenceIntervalStrategy - NoOptimizationStrategy - AutoOptimizationStrategy * Add option to use plan.run() to execute the plan (Dataset). Support plan.run() to executes the data pipeline, offering a streamlined way to build and process the data. This design centralizes all operations around a single Plan (or Dataset) object, making it more intuitive to construct the pipeline and maintain focus on the data workflow. * improve the readability of get_champion_model() improve the readability of get_champion_model() * leave TODO for gerardo * add field types when possible for schemas derived from dataframes * use for record field access * map both json str methods --> to_json_str; keep indentation consistent w/indent=2 * map all as_* verb fcns --> to_* for consistency * lint check for List --> list, Tuple --> tuple; tried addressing TODO in code * assert first physical operator is DataSourcePhysicalOp & use source_operator.get_datasource() * remove typo * add newline at end of file * mostly lint changes * rename optimize --> get_optimal_plan to clarify purpose a bit * adding note about None for optimization strategy --------- Co-authored-by: Michael Cafarella <[email protected]> Co-authored-by: Matthew Russo <[email protected]>
- Loading branch information