A Python utility to consolidate multiple CSV files into a single file while intelligently handling header mismatches and metadata. Designed to work with CSV files that may have metadata headers before the actual CSV data.
- Merges multiple CSV files into one
- Handles CSV files with metadata headers
- Detects and handles header mismatches
- Interactive prompt for handling new headers
- Intelligent date-based output filename generation
- Maintains separate directories for unprocessed and processed files
- Detailed logging of the consolidation process
csv_consolidator/
├── data/
│ ├── unprocessed/ (put your CSV files here)
│ └── processed/ (consolidated output goes here)
├── csv_consolidator.py
├── requirements.txt
└── README.md
- Clone or download this repository
- Install the required dependencies:
pip install -r requirements.txt
- Place your CSV files in the
data/unprocessed
directory - Run the script:
python csv_consolidator.py [output_filename]
The script can be run in two ways:
- Without arguments:
python csv_consolidator.py
- Generates filename based on date range found in files
- Format:
consolidated_MM-DD-YYYY_thru_MM-DD-YYYY.csv
- Example:
consolidated_04-09-2021_thru_03-02-2022.csv
- With filename:
python csv_consolidator.py myoutput.csv
- Uses your specified filename
The script is designed to handle CSV files that have metadata headers before the actual CSV data. It will:
- Skip metadata headers at the top of the file
- Find the actual CSV headers (starting with "ID,Timestamp,Transaction Type")
- Process the data from that point onward
When different CSV files have different headers:
- You'll be shown which files have different headers
- You'll be prompted whether to include the new columns
- If you choose 'no', only common headers will be included
- If you choose 'yes', all unique headers will be included (with NaN for missing values)
The script looks for dates in the following column names (case-insensitive):
- date
- timestamp
- created_at
- datetime
- time
These dates are used to generate the output filename when no filename is specified.
The repository is configured to:
- Ignore all files in
data/unprocessed
anddata/processed
- Keep the directory structure using
.gitkeep
files - Ignore Python virtual environment and cache files
- Ignore system files like
.DS_Store
- Invalid CSV files are logged and skipped
- Detailed error messages for file reading issues
- Verification of input parameters
- Graceful handling of missing date information
The consolidated CSV file will:
- Include all specified headers
- Maintain data integrity
- Be placed in the
data/processed
directory - Include a summary of:
- Number of files processed
- Total rows
- Total columns