Skip to content

A collection of multiple projects involving tasks such as classification, time series forecasting , regression etc. on a number of datasets using different machine learning algorithms such as random forest, SVM, Naive Bayes, Ensemble, perceptron etc in addition to data cleaning and preparation.

Notifications You must be signed in to change notification settings

AlrikF/Data-science-statistical-modelling-projects

Repository files navigation

Collection of Data Science Notebooks

Exploratory Analysis and data curation coupled with utilization of different on a number of datasets on a number of datasets.

1) Housing

2) Power plant data: (Regression)

Power Plant Dataset :: The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant. A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another.

The notebook involves EDA for visualization and analysis of data as well as finding the significant features. In addition algorithms utilized on the data are Linear Regression, Multiple Regression and KNN followed by a comparative analysis

3) Random Forest and Trees (Classification)

APS Failure Dataset The datasets' positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS.

The notebook involves data analysis and data preparation such as exploring different methods for imputation of data, checking measures of central tendency and dispersion and checking for imbalance and outliers. Various algorithms were adopted and tried for classification such as Random Forest and XGBoost as well as Smote to resample and tackle class imbalance.

4) Urinary Tract Infection Diagnosis and Crime Rate in Communities datasets (Using Decision Trees and Regularisation )::

Acute Inflammations Dataset :: The main idea of this data set is to prepare the algorithm of the expert system, which will perform the presumptive diagnosis of two diseases of urinary system. It will be the example of diagnosing of the acute inflammations of urinary bladder and acute nephritises.

Communities and Crime dataset:: Communities within the United States. The data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR.

NOTEBOOK Part 1 (For Urinary tract infection diagnosis) ::

The notebook consists of EDA as well as decision trees to correctly split the features based of Gini values and using cost complexity pruning to find a decison tree that is highly interpretable.

NOTEBOOK Part 2 (For Crime and communities regression) ::

Perfored EDA and performe comparative analysis of Linear Regression , Ridge Regression (L1) ,Lasso Regression (L2), Principal Component Regression and Boosting on the data.

About

A collection of multiple projects involving tasks such as classification, time series forecasting , regression etc. on a number of datasets using different machine learning algorithms such as random forest, SVM, Naive Bayes, Ensemble, perceptron etc in addition to data cleaning and preparation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published