Implementing Logistic Regression from Scratch
In statistics and machine learning, classification refers to a type of supervised learning. For this task, training data with known class labels are given and is used to develop a classification rule for assigning new unlabeled data to one of the classes. A special case of the task is binary classification, which involves only two classes. Some examples:
- Classifying an email as
spam
ornon-spam
- Classifying a tumor as
benign
ormalignant
The algorithms that sort unlabeled data into labeled classes are called classifiers. Loosely speaking, the sorting hat from Hogwarts can be thought of as a classifier that sorts incoming students into four distinct houses. In real life, some common classifiers are logistic regression, k-nearest neighbors, decision tree, random forest, support vector machine, naive Bayes, linear discriminant analysis, stochastic gradient descent, XGBoost, AdaBoost and neural networks.
Many advanced libraries, such as scikit-learn, make it possible for us to train various models on labeled training data, and predict on unlabeled test data, with a few lines of codes. While it is very convenient for day-to-day practice, it does not give insight into the details of what really happens underneath, when we run those codes. In the present notebook, we implement a logistic regression model manually from scratch, without using any advanced library, to understand how it works in the context of binary classification. The basic idea is to segment the computations into pieces, and write functions to compute each piece in a sequential manner, so that we can build a function on the basis of the previously defined functions. Wherever applicable, we have complemented a function which is constructed using for loops, with a much faster vectorized implementation of the same.