diff --git a/Docs/Knn.md b/Docs/Knn.md new file mode 100644 index 0000000..f186a3e --- /dev/null +++ b/Docs/Knn.md @@ -0,0 +1,53 @@ +# K-Nearest Neighbors (KNN) - Documentation + +## 📋 Overview + +KNN is a simple, instance-based learning algorithm that classifies data points based on the classes of their k nearest neighbors[web:100][web:102]. + +**Key Characteristics:** +- **Type**: Instance-based Learning +- **Algorithm**: Distance-based classification +- **Output**: Class based on neighbor voting +- **Best For**: Small to medium datasets, pattern recognition + +## 🎯 Purpose and Use Cases + +- **Recommendation Systems**: Similar user preferences +- **Pattern Recognition**: Handwriting, image recognition +- **Anomaly Detection**: Identifying outliers +- **Medical Diagnosis**: Similar patient cases +- **Text Classification**: Document similarity + +## 📊 Key Parameters + +| Parameter | Description | Default | Recommendation | +|-----------|-------------|---------|----------------| +| **n_neighbors (k)** | Number of neighbors | 5 | 3-15 (odd numbers) | +| **weights** | Vote weighting | uniform | uniform/distance | +| **metric** | Distance measure | euclidean | euclidean/manhattan | + +## 💡 Choosing K Value + +- **Small k (3-5)**: More sensitive to noise, complex boundaries +- **Large k (10-20)**: Smoother boundaries, may miss patterns +- **Rule of thumb**: √n where n = number of samples +- **Use odd k**: Avoids tie votes in binary classification + +## 🐛 Common Issues + +### Slow Prediction +- Reduce training data size +- Use approximate methods +- Try other algorithms for large datasets + +### Poor Performance +- Scale features (very important for KNN!) +- Try different k values +- Check for irrelevant features + +--- + +**Last Updated**: October 13, 2025 +**Version**: 1.0 +**Author**: Akshit +**Hacktoberfest 2025 Contribution** 🎃 diff --git a/Docs/Readme.md b/Docs/Readme.md new file mode 100644 index 0000000..812566d --- /dev/null +++ b/Docs/Readme.md @@ -0,0 +1,44 @@ +# ML Simulator - Model Documentation + +Welcome to the ML Simulator documentation! This directory contains comprehensive guides for each machine learning model available in the simulator. + +## 📚 Available Models + +| Model | Type | Documentation | Use Case | +|-------|------|---------------|----------| +| [Logistic Regression](logistic_regression.md) | Classification | Binary classification | Disease prediction, spam detection | +| [Linear Regression](linear_regression.md) | Regression | Continuous prediction | Price prediction, trend analysis | +| [Decision Tree](decision_tree.md) | Classification/Regression | Tree-based decisions | Credit scoring, diagnosis | +| [Random Forest](random_forest.md) | Ensemble | Multiple trees | Complex classification tasks | +| [K-Nearest Neighbors](knn.md) | Classification/Regression | Instance-based | Pattern recognition | +| [Support Vector Machine](svm.md) | Classification | Maximum margin | Text classification, image recognition | + +## 🚀 Quick Start + +Each model documentation includes: +- ✅ **Overview**: What the model does and when to use it +- ✅ **How to Run**: Step-by-step instructions +- ✅ **Parameter Explanations**: What each setting means +- ✅ **Plot Interpretations**: Understanding visualizations +- ✅ **Performance Metrics**: Evaluating model quality +- ✅ **Troubleshooting**: Common issues and solutions +- ✅ **Examples**: Real-world use cases + +## 📖 How to Use This Documentation + +1. Select the model you want to learn about from the table above +2. Click on the documentation link +3. Follow the step-by-step guide +4. Review the screenshot examples +5. Apply to your own dataset + +## 🎯 Contributing + +Found an error or want to improve the documentation? See our [CONTRIBUTING.md](../CONTRIBUTING.md) for guidelines. + +--- + +**Last Updated**: October 13, 2025 +**Version**: 1.0 +**Author**: Akshit +**Hacktoberfest 2025 Contribution** 🎃 diff --git a/Docs/decision_tree.md b/Docs/decision_tree.md new file mode 100644 index 0000000..1a3629e --- /dev/null +++ b/Docs/decision_tree.md @@ -0,0 +1,184 @@ +# Decision Tree - Documentation + +## 📋 Overview + +Decision Tree is a supervised learning algorithm that creates a tree-like model of decisions. It splits data based on feature values to make predictions for both classification and regression tasks[web:100][web:102]. + +**Key Characteristics:** +- **Type**: Supervised Learning - Classification or Regression +- **Output**: Class label or continuous value +- **Algorithm**: Recursive splitting based on information gain +- **Best For**: Non-linear relationships, interpretable models + +## 🎯 Purpose and Use Cases + +### Primary Use +Creating interpretable models that make decisions through a series of yes/no questions. + +### Common Applications +- **Medical Diagnosis**: Decision pathways for treatment +- **Credit Approval**: Loan decision logic +- **Customer Segmentation**: Marketing strategy decisions +- **Fraud Detection**: Rule-based fraud identification +- **Product Recommendations**: Decision logic for suggestions + +## 🚀 How to Run + +### Step 1: Access the Model +1. Navigate to ML Simulator +2. Select **"Decision Tree"** from sidebar + +### Step 2: Choose Data Source +- Upload CSV or use sample dataset +- For classification: binary or multi-class target +- For regression: continuous target + +### Step 3: Configure Parameters + +| Parameter | Description | Default | Range | +|-----------|-------------|---------|-------| +| **Max Depth** | Maximum tree depth | 5 | 1-20 | +| **Min Samples Split** | Minimum samples to split | 2 | 2-20 | +| **Min Samples Leaf** | Minimum samples in leaf | 1 | 1-10 | +| **Criterion** | Splitting metric | gini/mse | gini/entropy | + +### Step 4: Train and Visualize +1. Configure parameters +2. Click **Train Model** +3. View tree structure and results + +## 📊 What Each Plot Shows + +### 1. Tree Visualization + +**What You See:** +Visual representation of the decision tree structure. + +**Components:** +- **Root node**: Top of tree (all data) +- **Internal nodes**: Decision points +- **Leaf nodes**: Final predictions +- **Branches**: Decision paths + +**How to Read:** +- Each node shows: + - Feature and threshold used for split + - Number of samples + - Class distribution or value +- Follow branches from top to bottom +- Leaf nodes contain predictions + +### 2. Feature Importance + +**What You See:** +Bar chart showing which features are most important[web:99][web:101]. + +**Interpretation:** +- Longer bars: More important for decisions +- Features at top of tree: Usually most important +- Zero importance: Feature not used + +### 3. Confusion Matrix (Classification) + +**Same as Logistic Regression** +Shows prediction accuracy breakdown. + +### 4. Performance Metrics + +**Classification:** +- Accuracy, Precision, Recall, F1-Score + +**Regression:** +- R², MSE, RMSE, MAE + +## 🔧 Model Parameters Explained + +### max_depth +**Purpose**: Limit tree depth to prevent overfitting +**Lower values**: Simpler, more general model +**Higher values**: More complex, may overfit +**Recommendation**: Start with 3-7 + +### min_samples_split +**Purpose**: Minimum samples required to split a node +**Lower values**: More splits, complex tree +**Higher values**: Fewer splits, simpler tree +**Recommendation**: 2-10 depending on data size + +### min_samples_leaf +**Purpose**: Minimum samples required in leaf node +**Effect**: Smooths model, prevents overfitting +**Recommendation**: 1-5 + +### criterion +**Classification:** +- **gini**: Gini impurity (default, faster) +- **entropy**: Information gain (more precise) + +**Regression:** +- **mse**: Mean squared error (default) +- **mae**: Mean absolute error (robust to outliers) + +## 💡 Tips and Best Practices + +### Advantages +✅ Easy to understand and interpret +✅ Handles non-linear relationships +✅ No feature scaling required +✅ Handles mixed data types +✅ Provides feature importance + +### Limitations +❌ Prone to overfitting +❌ Unstable (small data changes affect tree) +❌ Biased toward dominant classes +❌ Not optimal for linear relationships + +### Best Practices +- **Start shallow**: Begin with max_depth=3-5 +- **Prune the tree**: Use min_samples parameters +- **Cross-validate**: Check performance on multiple splits +- **Ensemble methods**: Consider Random Forest for better stability +- **Visualize tree**: Understand decision logic + +## 🐛 Troubleshooting + +### Issue: Perfect Training Accuracy, Poor Test Accuracy + +**Diagnosis:** Severe overfitting + +**Solutions:** +1. Reduce max_depth (try 3-7) +2. Increase min_samples_split (try 10-20) +3. Increase min_samples_leaf (try 5-10) +4. Use Random Forest instead + +### Issue: Tree Too Large to Visualize + +**Solutions:** +1. Reduce max_depth +2. Export tree to graphical format +3. Focus on top levels only + +### Issue: Low Accuracy + +**Solutions:** +1. Increase max_depth (try up to 15) +2. Check feature quality +3. Add more relevant features +4. Try ensemble methods + +## 📚 Additional Resources + +- [Scikit-learn Decision Trees](https://scikit-learn.org/stable/modules/tree.html) +- [Understanding Decision Trees](https://developers.google.com/machine-learning/decision-forests/decision-trees) +- [Tree Visualization Guide](https://mljar.com/blog/visualize-decision-tree/) + +## 🎯 Example Use Case + +### Scenario: Loan Approval System + +**Features:** +- income, credit_score, debt_ratio, employment_years + +**Tree might learn:** diff --git a/Docs/linear_regression.md b/Docs/linear_regression.md new file mode 100644 index 0000000..2672a11 --- /dev/null +++ b/Docs/linear_regression.md @@ -0,0 +1,207 @@ +# [Model Name] - Documentation + +## 📋 Overview + +Brief description of what this model does and its use cases. + +## 🎯 Purpose and Use Cases + +- **Primary Use**: [e.g., Binary classification, regression, clustering] +- **Common Applications**: + - Use case 1 + - Use case 2 + - Use case 3 + +## 🚀 How to Run + +### Step 1: Access the Model +Navigate to the [Model Name] page in the ML Simulator application. + +### Step 2: Data Input +Choose one of the following options: +- **Upload CSV**: Upload your own dataset in CSV format +- **Use Sample Dataset**: Use the built-in sample dataset + +### Step 3: Configure Parameters + +| Parameter | Description | Default Value | Range/Options | +|-----------|-------------|---------------|---------------| +| Test Size | Percentage of data for testing | 20% | 10-50% | +| Feature Selection | Choose features for training | First 5 | All available | +| [Other params] | Description | Default | Options | + +### Step 4: Train the Model +Click the **Train Model** button to start training. + +## 📊 What Each Plot Shows + +### Training Results Dashboard +- **Accuracy Metric**: Shows the percentage of correct predictions +- **Training Samples**: Number of samples used for training +- **Test Samples**: Number of samples used for testing +- **Features Used**: Number of features selected for the model + +**Screenshot**: [Include screenshot here] + +**Interpretation**: Higher accuracy indicates better model performance. Aim for >80% for good results. + +--- + +### Predictions Table +- **Actual**: The true label from the dataset +- **Predicted**: The label predicted by the model +- **Probability**: Confidence score of the prediction (0-1) + +**Screenshot**: [Include screenshot here] + +**How to Read**: +- Probability close to 1 = high confidence in positive class +- Probability close to 0 = high confidence in negative class +- Probability around 0.5 = model is uncertain + +--- + +### Confusion Matrix +A heatmap showing the model's prediction accuracy across classes. + +**Screenshot**: [Include screenshot here] + +**Components**: +- **True Positives (TP)**: Correctly predicted positive cases +- **True Negatives (TN)**: Correctly predicted negative cases +- **False Positives (FP)**: Incorrectly predicted as positive +- **False Negatives (FN)**: Incorrectly predicted as negative + +**Interpretation**: +- Diagonal elements (TP, TN) should be high +- Off-diagonal elements (FP, FN) should be low + +--- + +### ROC Curve +Shows the trade-off between True Positive Rate and False Positive Rate. + +**Screenshot**: [Include screenshot here] + +**Components**: +- **Blue Line**: Your model's performance +- **Red Dashed Line**: Random classifier baseline +- **AUC Score**: Area Under the Curve (0-1) + +**Interpretation**: +- AUC = 1.0: Perfect classifier +- AUC > 0.8: Excellent model +- AUC > 0.7: Good model +- AUC = 0.5: No better than random guessing + +--- + +### Feature Importance +Bar chart showing which features have the most impact on predictions. + +**Screenshot**: [Include screenshot here] + +**How to Read**: +- Longer bars = more important features +- Positive values = increases probability of positive class +- Negative values = decreases probability of positive class + +## 🔧 Model Parameters Explained + +### Algorithm-Specific Parameters + +| Parameter | Description | When to Adjust | +|-----------|-------------|----------------| +| max_iter | Maximum iterations for training | Increase if model doesn't converge | +| C (regularization) | Controls model complexity | Lower for simpler models | +| solver | Optimization algorithm | Change based on dataset size | + +## 📈 Performance Metrics + +### Accuracy +Percentage of correct predictions out of total predictions. +- **Formula**: (TP + TN) / (TP + TN + FP + FN) +- **Good Range**: >70% + +### Precision +Of all positive predictions, how many were correct? +- **Formula**: TP / (TP + FP) +- **Use When**: False positives are costly + +### Recall (Sensitivity) +Of all actual positives, how many did we catch? +- **Formula**: TP / (TP + FN) +- **Use When**: False negatives are costly + +### F1-Score +Harmonic mean of precision and recall. +- **Formula**: 2 × (Precision × Recall) / (Precision + Recall) +- **Use When**: Need balance between precision and recall + +## 💡 Tips and Best Practices + +### Data Preparation +- ✅ Ensure your CSV has a clear binary target column (0/1) +- ✅ Remove or handle missing values before upload +- ✅ Normalize features if they have different scales +- ❌ Avoid datasets with too few samples (<100) + +### Feature Selection +- Select features that are relevant to your prediction task +- Avoid highly correlated features (redundant information) +- Start with 3-10 features for interpretability + +### Model Tuning +- Adjust test size based on dataset size (smaller datasets need smaller test size) +- If accuracy is low, try selecting different features +- Check for class imbalance in your target variable + +## 🐛 Troubleshooting + +### Issue: Low Accuracy (<60%) +**Solutions**: +- Check if features are relevant to the target +- Try different feature combinations +- Ensure data quality (no missing/corrupted values) +- Check for class imbalance + +### Issue: Model Takes Too Long to Train +**Solutions**: +- Reduce number of features +- Use smaller dataset for testing +- Check your data for unnecessary large values + +### Issue: Upload Error +**Solutions**: +- Ensure CSV format is correct +- Check for special characters in column names +- Verify file size is reasonable (<10MB) + +## 📚 Additional Resources + +- [Scikit-learn Documentation](https://scikit-learn.org/) +- [Understanding Logistic Regression](https://link-to-resource) +- [ROC Curves Explained](https://link-to-resource) + +## 🎯 Example Use Case + +**Scenario**: Predicting customer churn + +1. Upload customer data CSV with features like age, tenure, monthly charges +2. Select target column: 'churn' (0 = stayed, 1 = left) +3. Choose relevant features: tenure, monthly_charges, total_charges +4. Set test size to 20% +5. Train model and analyze results +6. Use confusion matrix to understand prediction errors +7. Check ROC curve to ensure AUC > 0.7 + +**Expected Results**: +- Accuracy: 75-85% +- AUC: 0.8-0.9 +- High precision on predicting churners + +--- + +**Last Updated**: October 2025 +**Version**: 1.0 +**Maintainer**: [Akshit] diff --git a/Docs/logistic_regression.md b/Docs/logistic_regression.md new file mode 100644 index 0000000..8e19d4e --- /dev/null +++ b/Docs/logistic_regression.md @@ -0,0 +1,78 @@ +# Logistic Regression - Documentation + +## 📋 Overview + +Logistic Regression is a statistical method for binary classification that predicts the probability of an outcome belonging to one of two classes (0 or 1). Despite its name, it's a classification algorithm, not a regression algorithm[web:102][web:103]. + +**Key Characteristics:** +- **Type**: Supervised Learning - Binary Classification +- **Output**: Probability score between 0 and 1 +- **Algorithm**: Uses sigmoid function to map predictions to probabilities +- **Best For**: Linearly separable data with binary outcomes + +## 🎯 Purpose and Use Cases + +### Primary Use +Binary classification problems where you need to predict one of two possible outcomes. + +### Common Applications +- **Medical Diagnosis**: Disease prediction (positive/negative) +- **Spam Detection**: Email classification (spam/not spam) +- **Customer Churn**: Will customer leave? (yes/no) +- **Credit Scoring**: Loan approval (approve/reject) +- **Marketing**: Click prediction (will click/won't click) + +## 🚀 How to Run + +### Step 1: Access the Model +1. Navigate to the ML Simulator application +2. Open the sidebar menu +3. Select **"Logistic Regression"** from the available models + +### Step 2: Choose Data Source +You have two options for providing data: + +**Option A: Upload CSV File** +- Click "Upload CSV" in the sidebar +- Select your CSV file (must contain binary target column with 0/1 values) +- Ensure your data has: + - At least 100 rows + - Numerical features + - A binary target column (0 or 1) + +**Option B: Use Sample Dataset** +- Select "Use Sample Dataset" radio button +- The Breast Cancer dataset will be loaded automatically +- Contains 569 samples with 30 features + +### Step 3: Configure Parameters + +| Parameter | Description | Default Value | Recommended Range | +|-----------|-------------|---------------|-------------------| +| **Target Column** | Column to predict (must be 0/1) | First binary column | Any binary column | +| **Test Size** | Percentage of data for testing | 20% | 10-30% | +| **Feature Selection** | Choose features for training | First 5 features | 3-10 features | +| **max_iter** | Maximum training iterations | 1000 | 500-2000 | + +### Step 4: Train the Model +1. Select your target column from the dropdown +2. Choose features you want to use for prediction +3. Adjust test size slider if needed +4. Click the **🚀 Train Model** button +5. Wait for training to complete (usually 1-5 seconds) + +## 📊 What Each Plot Shows + +### 1. Training Results Dashboard + +**What You See:** +Four gradient-colored metric cards displaying key performance indicators[web:99][web:102]. + +**Components:** +- **Accuracy**: Overall percentage of correct predictions +- **Training Samples**: Number of data points used for training +- **Test Samples**: Number of data points used for testing +- **Features Used**: Number of features selected for the model + +**How to Interpret:** +- diff --git a/Docs/random_forest.md b/Docs/random_forest.md new file mode 100644 index 0000000..c69a1ec --- /dev/null +++ b/Docs/random_forest.md @@ -0,0 +1,59 @@ +# Random Forest - Documentation + +## 📋 Overview + +Random Forest is an ensemble learning method that combines multiple decision trees to make more accurate and stable predictions[web:100][web:102]. + +**Key Characteristics:** +- **Type**: Ensemble - Classification/Regression +- **Algorithm**: Bagging + Random feature selection +- **Output**: Averaged predictions from multiple trees +- **Best For**: Complex patterns, high-dimensional data + +## 🎯 Purpose and Use Cases + +- **Credit Risk Assessment**: More robust than single tree +- **Disease Diagnosis**: Reduces false positives/negatives +- **Image Classification**: Feature extraction +- **Stock Market Prediction**: Complex patterns +- **Customer Churn**: Better generalization + +## 🚀 How to Run + +[Follow same structure as previous models] + +## 📊 Key Parameters + +| Parameter | Description | Default | Recommendation | +|-----------|-------------|---------|----------------| +| **n_estimators** | Number of trees | 100 | 50-500 | +| **max_depth** | Depth per tree | None | 10-30 | +| **min_samples_split** | Samples to split | 2 | 2-10 | +| **max_features** | Features per split | sqrt | sqrt/log2 | + +## 💡 Advantages Over Single Decision Tree + +✅ Reduces overfitting +✅ More stable predictions +✅ Better accuracy +✅ Handles missing values better +✅ Less sensitive to outliers + +## 🐛 Troubleshooting + +### Slow Training +- Reduce n_estimators +- Reduce max_depth +- Use smaller dataset for testing + +### Still Overfitting +- Reduce max_depth +- Increase min_samples_split +- Reduce max_features + +--- + +**Last Updated**: October 13, 2025 +**Version**: 1.0 +**Author**: Akshit +**Hacktoberfest 2025 Contribution** 🎃 diff --git a/Docs/svm.md b/Docs/svm.md new file mode 100644 index 0000000..bd27e3a --- /dev/null +++ b/Docs/svm.md @@ -0,0 +1,66 @@ +# Support Vector Machine (SVM) - Documentation + +## 📋 Overview + +SVM finds the optimal hyperplane that maximally separates different classes in the feature space[web:100][web:102]. + +**Key Characteristics:** +- **Type**: Supervised Learning - Classification +- **Algorithm**: Maximum margin classifier +- **Output**: Class label +- **Best For**: High-dimensional data, clear margins + +## 🎯 Purpose and Use Cases + +- **Text Classification**: Spam detection, sentiment analysis +- **Image Recognition**: Face detection, object classification +- **Bioinformatics**: Protein classification, gene expression +- **Financial**: Stock trend prediction +- **Medical**: Disease classification + +## 📊 Key Parameters + +| Parameter | Description | Default | Recommendation | +|-----------|-------------|---------|----------------| +| **C** | Regularization | 1.0 | 0.1-100 | +| **kernel** | Kernel type | rbf | linear/rbf/poly | +| **gamma** | Kernel coefficient | scale | scale/auto | + +## 💡 Kernel Selection + +- **linear**: Linearly separable data, large features +- **rbf** (radial basis function): Default, most cases +- **poly** (polynomial): Specific polynomial relationships +- **sigmoid**: Neural network-like behavior + +## 🔧 Parameter Tuning + +### C (Regularization) +- **Low C**: Wider margin, more errors (underfitting) +- **High C**: Narrow margin, fewer errors (overfitting) +- **Start with**: 1.0, then try 0.1, 10, 100 + +### Gamma (RBF kernel) +- **Low gamma**: Far-reaching influence, smooth decision boundary +- **High gamma**: Close influence, complex decision boundary +- **Use**: 'scale' (default) or 'auto' + +## 🐛 Troubleshooting + +### Slow Training +- Use linear kernel for large datasets +- Reduce training data +- Scale features first + +### Poor Performance +- Try different kernels +- Tune C and gamma +- Scale features (mandatory for SVM!) +- Check if data is separable + +--- + +**Last Updated**: October 13, 2025 +**Version**: 1.0 +**Author**: Akshit +**Hacktoberfest 2025 Contribution** 🎃 diff --git a/pages/Linear_Regression.py b/pages/Linear_Regression.py index 38e6e94..b60335f 100644 --- a/pages/Linear_Regression.py +++ b/pages/Linear_Regression.py @@ -1,22 +1,380 @@ +# pages/Logistic_Regression.py import streamlit as st +import pandas as pd import numpy as np -from sklearn.linear_model import LinearRegression -from utils.plot_helpers import plot_regression_line +import matplotlib.pyplot as plt +import seaborn as sns +from sklearn.model_selection import train_test_split +from sklearn.linear_model import LogisticRegression +from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc, accuracy_score +from sklearn.preprocessing import StandardScaler +import plotly.graph_objects as go +import plotly.express as px +from io import StringIO -st.header("📈 Linear Regression Simulator") +# Page configuration +st.set_page_config(page_title="Logistic Regression Simulator", layout="wide", page_icon="📊") -# Sample data -X = np.array([[1], [2], [3], [4], [5]]) -y = np.array([2, 4, 5, 4, 5]) +# Custom CSS for better styling +st.markdown(""" + +""", unsafe_allow_html=True) -# Train model -model = LinearRegression() -model.fit(X, y) +# Header +st.markdown('
Logistic Regression is a statistical method for binary classification that predicts the probability + of an outcome belonging to a particular class. It's widely used in medical diagnosis, credit scoring, + and spam detection.
+📁 Dataset Overview
', unsafe_allow_html=True) + + col1, col2, col3 = st.columns(3) + with col1: + st.metric("Total Rows", df.shape[0]) + with col2: + st.metric("Total Columns", df.shape[1]) + with col3: + st.metric("Missing Values", df.isnull().sum().sum()) + + with st.expander("👀 View Dataset"): + st.dataframe(df.head(10), use_container_width=True) + + # Feature selection + st.markdown('🎯 Model Configuration
', unsafe_allow_html=True) + + col1, col2 = st.columns([2, 1]) + + with col1: + target_column = st.selectbox("Select Target Column (0/1):", df.columns) + + with col2: + test_size = st.slider("Test Size (%)", 10, 50, 20) / 100 + + # Select features + available_features = [col for col in df.columns if col != target_column] + selected_features = st.multiselect( + "Select Features for Training:", + available_features, + default=available_features[:min(5, len(available_features))] + ) + + if len(selected_features) > 0 and st.button("🚀 Train Model"): + # Prepare data + X = df[selected_features] + y = df[target_column] + + # Handle missing values + X = X.fillna(X.mean()) + + # Split data + X_train, X_test, y_train, y_test = train_test_split( + X, y, test_size=test_size, random_state=42 + ) + + # Scale features + scaler = StandardScaler() + X_train_scaled = scaler.fit_transform(X_train) + X_test_scaled = scaler.transform(X_test) + + # Train model + with st.spinner('🔄 Training model...'): + model = LogisticRegression(max_iter=1000, random_state=42) + model.fit(X_train_scaled, y_train) + + # Predictions + y_pred = model.predict(X_test_scaled) + y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] + + # Store in session state + st.session_state['model'] = model + st.session_state['scaler'] = scaler + st.session_state['features'] = selected_features + + st.success("✅ Model trained successfully!") + + # ==================== TRAINING RESULTS ==================== + st.markdown('📈 Training Results
', unsafe_allow_html=True) + + col1, col2, col3, col4 = st.columns(4) + + accuracy = accuracy_score(y_test, y_pred) + + with col1: + st.markdown(f""" +🔮 Predictions
', unsafe_allow_html=True) + + predictions_df = pd.DataFrame({ + 'Actual': y_test.values, + 'Predicted': y_pred, + 'Probability': y_pred_proba + }) + + col1, col2 = st.columns([1, 1]) + + with col1: + st.write("**Sample Predictions:**") + st.dataframe(predictions_df.head(10), use_container_width=True) + + with col2: + # Prediction distribution + fig_pred = px.histogram( + predictions_df, + x='Probability', + color='Actual', + nbins=30, + title='Prediction Probability Distribution', + labels={'Probability': 'Predicted Probability', 'count': 'Frequency'}, + color_discrete_map={0: '#ff7675', 1: '#74b9ff'} + ) + fig_pred.update_layout(height=400) + st.plotly_chart(fig_pred, use_container_width=True) + + # ==================== CONFUSION MATRIX ==================== + st.markdown('🎯 Confusion Matrix
', unsafe_allow_html=True) + + col1, col2 = st.columns([1, 1]) + + with col1: + # Create confusion matrix + cm = confusion_matrix(y_test, y_pred) + + # Plot using plotly for better interactivity + fig_cm = go.Figure(data=go.Heatmap( + z=cm, + x=['Predicted 0', 'Predicted 1'], + y=['Actual 0', 'Actual 1'], + text=cm, + texttemplate='%{text}', + textfont={"size": 20}, + colorscale='Blues', + showscale=True + )) + + fig_cm.update_layout( + title='Confusion Matrix', + xaxis_title='Predicted Label', + yaxis_title='True Label', + height=400 + ) + + st.plotly_chart(fig_cm, use_container_width=True) + + with col2: + # Classification report + st.write("**Classification Report:**") + report = classification_report(y_test, y_pred, output_dict=True) + report_df = pd.DataFrame(report).transpose() + st.dataframe(report_df.style.background_gradient(cmap='RdYlGn', subset=['precision', 'recall', 'f1-score']), + use_container_width=True) + + # ==================== ROC CURVE ==================== + st.markdown('📉 ROC Curve
', unsafe_allow_html=True) + + # Calculate ROC curve + fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba) + roc_auc = auc(fpr, tpr) + + col1, col2 = st.columns([2, 1]) + + with col1: + # Plot ROC curve + fig_roc = go.Figure() + + fig_roc.add_trace(go.Scatter( + x=fpr, y=tpr, + mode='lines', + name=f'ROC Curve (AUC = {roc_auc:.3f})', + line=dict(color='#0984e3', width=3) + )) + + fig_roc.add_trace(go.Scatter( + x=[0, 1], y=[0, 1], + mode='lines', + name='Random Classifier', + line=dict(color='#d63031', width=2, dash='dash') + )) + + fig_roc.update_layout( + title='Receiver Operating Characteristic (ROC) Curve', + xaxis_title='False Positive Rate', + yaxis_title='True Positive Rate', + height=500, + hovermode='x', + legend=dict(x=0.6, y=0.1) + ) + + fig_roc.update_xaxes(range=[0, 1]) + fig_roc.update_yaxes(range=[0, 1]) + + st.plotly_chart(fig_roc, use_container_width=True) + + with col2: + st.markdown(f""" +⭐ Feature Importance
', unsafe_allow_html=True) + + feature_importance = pd.DataFrame({ + 'Feature': selected_features, + 'Coefficient': model.coef_[0] + }).sort_values('Coefficient', key=abs, ascending=False) + + fig_importance = px.bar( + feature_importance, + x='Coefficient', + y='Feature', + orientation='h', + title='Feature Coefficients', + color='Coefficient', + color_continuous_scale='RdBu_r' + ) + fig_importance.update_layout(height=max(300, len(selected_features) * 30)) + st.plotly_chart(fig_importance, use_container_width=True) + +else: + st.info("👆 Please upload a dataset or select the sample dataset to get started!") + + st.markdown(""" + ### 📋 Instructions: + 1. Choose a data source from the sidebar (Upload CSV or use sample dataset) + 2. Select your target column (binary: 0/1) + 3. Choose features for training + 4. Adjust the test size if needed + 5. Click **Train Model** to see results + + ### ✨ Features: + - 📊 Interactive confusion matrix + - 📈 ROC curve with AUC score + - 🎯 Detailed predictions with probabilities + - ⭐ Feature importance visualization + - 📉 Model performance metrics + """) + +# Footer +st.markdown("---") +st.markdown(""" +🎃 Hacktoberfest Contribution | Built with Streamlit & Scikit-learn
+