Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Good First Issue]🎆Add npz format output in transform module #33

Open
MooooCat opened this issue Oct 26, 2023 · 0 comments
Open

[Good First Issue]🎆Add npz format output in transform module #33

MooooCat opened this issue Oct 26, 2023 · 0 comments
Labels
difficulty-easy good first issue Good for newcomers help wanted Extra attention is needed

Comments

@MooooCat
Copy link
Contributor

MooooCat commented Oct 26, 2023

🚅Search before asking

I have searched for issues similar to this one.

🚅Description

In the transformer_opt module, add this function in the method to write the np.ndarray format data output by the module to disk in npz format.

Compared with directly writing the entire csv file, this function can effectively save hard disk space. Since the transformer_opt module has already processed the csv file in batches, writing npz files in each batch can reduce repeated batches in the processing of other modules in the future. The operation is also more convenient for parallel processing.

🏕Solution

Modifications to this issue should be located in the sdgx/transform/transformer_opt.py path.

Please find the _synchronous_transform method in class DataTransformer, it is necessary to add the parameter output_typeto determine the storage type.

🍰Detail

For the coding implementation details of Issue, please refer to the comments in the following code block:

# ISSUE DESCRIPTION add the parameter `output_type`to
def _synchronous_transform(self, input_data_path,
                           column_transform_info_list, 
                           output_path,
                           output_type): # ISSUE DESCRIPTION new args
    """Method Description ... """
    
    loop = True
    # has_write_header = True
    # use iterator = True 
    reader =  pd.read_csv(input_data_path, iterator=True, chunksize= 1000000)
    
    while loop:
        # Existing Code ... 
        # ISSUE DESCRIPTION Some codes are omitted due to space reasons 
        
        # ISSUE DESCRIPTION Add your code here
        chunk_array = np.concatenate(column_data_list, axis=1).astype(float)
        # file object 
        f = open(output_path , 'a')
        np.savetxt(f, chunk_array, fmt="%g", delimiter= ',')
        f.close()
    # end while

🍰Example

TBD

@MooooCat MooooCat added good first issue Good for newcomers help wanted Extra attention is needed difficulty-easy labels Oct 26, 2023
@MooooCat MooooCat changed the title 🎆 Add npz format output in transform module [Good First Issue]🎆Add npz format output in transform module Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty-easy good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant