Refactoring: Weird steps in the pipeline (cleaning at various steps that could be streamlined) #61

MinaAlmasi · 2024-08-06T09:05:51Z

The current pipeline is displayed in the image below.

Some steps that may need to be reconsidered

When extracting metrics (step 3) for both human and ai text, AI is lowercased / cleaned here first, but it could be done in a seperate step and saved / stored. The reason I haven't done this is that the repo will end up a little big.
When using the metrics for classification (step 4B), I only then remove the few faulty generations that are below minimum length. It should ideally be removed prior to this steps 4A and 4B to avoid any mistakes (accidentally including them in other analysis work).

MinaAlmasi · 2024-10-14T11:48:12Z

Fixed this in #73 (now cleaning AI data in separate script before extracting metrics). Might want to draw a pipeline again digitally for the README.

MinaAlmasi self-assigned this Aug 6, 2024

MinaAlmasi mentioned this issue Oct 14, 2024

major refactoring: process and split data properly, save splitted data #73

Merged

MinaAlmasi closed this as completed Oct 14, 2024

Provide feedback