A Python tool for efficiently transferring Imaging FlowCytobot (IFCB) data files to and from Amazon S3.
This tool is designed to efficiently upload, download, list, and delete IFCB data files in Amazon S3, with optimized performance for large file sets. It includes features for concurrent processing, progress tracking, and detailed reporting.
- AWS credentials validation
- Support for IFCB data file uploads and downloads
- Recursive directory processing
- Colorized console output
- Concurrent file transfers (up to 32 workers)
- Connection pool optimization (100 connections)
- Automatic retry on failures (3 attempts)
- Batched file submission (1000 files per batch)
- Pre-computed paths for improved performance
- Overall progress tracking with tqdm
- Detailed summary report with:
- Total files processed
- Total data transferred
- Transfer duration
- Average transfer rate
- Files processed per second
- Dry-run mode for testing
- Environment variable configuration (automatically used when no args provided)
- Detailed logging
- S3 bucket listing capabilities
- File filtering options for downloads
- Fast bulk deletion of S3 objects
- Simplified S3 URI handling with the --destination parameter
- Python 3.6 or higher
- Sufficient system resources for concurrent processing
- Recommended: 4+ CPU cores and 8GB+ RAM for large file sets
- AWS credentials with appropriate S3 permissions
- Clone the repository:
git clone https://github.com/yourusername/pt5_s3_tool.git
cd pt5_s3_tool
- Install required packages:
pip install -r requirements.txt
- Configure AWS credentials:
- Create a
.env
file in the project root - Add your AWS credentials:
AWS_ACCESS_KEY_ID=your_access_key AWS_SECRET_ACCESS_KEY=your_secret_key AWS_UPLOAD_URL=s3://your-bucket/path IFCB_DATA_DIR=/path/to/ifcb/data
- Create a
# Using the new --destination parameter (recommended)
python pt5_s3_tool.py --source /path/to/files \
--destination s3://your-bucket/path/in/bucket \
--recursive
# Using legacy parameters
python pt5_s3_tool.py --mode upload \
--source /path/to/files \
--bucket your-bucket \
--prefix path/in/bucket \
--recursive
# Using the new --destination parameter (recommended)
python pt5_s3_tool.py --mode download \
--source s3://your-bucket/path/in/bucket \
--destination /local/path \
--recursive \
--filter "*.png"
# Using legacy parameters
python pt5_s3_tool.py --mode download \
--bucket your-bucket \
--prefix path/in/bucket \
--destination /local/path \
--recursive \
--filter "*.png"
# Using the new --destination parameter (recommended)
python pt5_s3_tool.py --mode list \
--destination s3://your-bucket/path/in/bucket \
--recursive
# Using legacy parameters
python pt5_s3_tool.py --mode list \
--bucket your-bucket \
--prefix path/in/bucket \
--recursive
# Using the new --destination parameter (recommended)
python pt5_s3_tool.py --mode delete \
--destination s3://your-bucket/path/in/bucket \
--recursive
# Alternative syntax with --delete flag
python pt5_s3_tool.py --destination s3://your-bucket/path/in/bucket \
--delete \
--recursive \
--filter "*.tmp"
# Using legacy parameters
python pt5_s3_tool.py --mode delete \
--bucket your-bucket \
--prefix path/in/bucket \
--recursive
If you've set up the .env
file with AWS_UPLOAD_URL
and IFCB_DATA_DIR
, you can run the tool without arguments:
# Uses environment variables for source and destination
python pt5_s3_tool.py
--mode
: Operation mode (upload
,download
,list
, ordelete
)--destination
: S3 destination in format s3://bucket/prefix for uploads, or local directory for downloads--recursive
: Process directories recursively--dry-run
: Show what would be transferred without actually transferring--verbose
: Enable verbose logging--validate
: Only validate AWS credentials and exit
--bucket
: Target S3 bucket name--prefix
: S3 key prefix
--source
: Source file or directory to upload
--source
: S3 prefix to download from (can be in s3://bucket/prefix format)--destination
: Local directory to download files to--overwrite
: Overwrite existing files when downloading--filter
: Filter pattern for files to download (e.g., "*.png")
--delete
: Alternative to --mode delete, confirms deletion intent--filter
: Filter pattern for files to delete (e.g., "*.tmp")
The tool is optimized for large file sets with the following features:
- Batched file submission (1000 files per batch)
- Pre-computed paths
- Optimized connection pooling
- Concurrent processing
- S3 bulk delete API for fast deletions
For optimal performance, ensure your system has:
- Sufficient CPU cores for concurrent processing
- Adequate memory for handling large file sets
- Fast storage for file operations
- Reliable network connection to AWS
The tool includes comprehensive error handling for:
- AWS credential validation
- File system operations
- Network connectivity issues
- S3 transfer failures
Failed operations are logged with detailed error messages.
- Follow PEP 8 guidelines
- Maximum line length: 79 characters
- Maximum function length: 35 lines
- Include docstrings for all functions
- Run tests before submitting changes
- Include new tests for new features
- Maintain test coverage above 80%
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Robert D. Currier ([email protected])
- AWS Boto3 team for the excellent S3 client library
- tqdm team for the progress bar implementation