Password Strength Classifier: Is it really safe?

Password strength measures how well a password can withstand being guessed or hacked. External attacks can come in the form of password cracking or brute-force attacks, which aim to gain unauthorized access to a computer system or network.
The strength of a password is typically measured by its complexity, length, and unpredictability. Unfortunately, cybercrimes and data breaches have been increasing, and a big reason for this is that passwords are often compromised and weak.
Plese visit the Project GitHub repository to view the full project.
Objective
This project focuses on building a password strength classification model with high accuracy using Supervised Machine Learning. The model will output the strength of a password on a scale from 0 (lowest strength) to 2 (highest strength).
Data
The dataset is downloadabale for free from Kaggle, and contains passwords from the 000webhost leak that is available online. The dataset has been pre-classified by Georgia Tech University’s PARS tool, which rates passwords as weak, medium, or strong. There are 700k records in the dataset.
Below is a view of the dataset.
Data Wrangling
Cleaning
Missing values make the dataset incomplete, and this gives inaccurate results. Similarly, duplicates can also skew the analysis. Therefore dataset was checked for any missing or duplicate values.
There was one missing value in password
column and no duplicates.
It is obvious that we cannot predict the strength of a password without having a password. Therefore, I directly dropped the missing observation.
Dealing with Class Imbalance
In order to return an unbiased classification, it is important to train data to be balanced prior to model training. Upon initial inspections, it was noticed that our dataset was highly biased towards the medium(1) strength class.
The below graph shows the heavy imbalance of the dataset.
We can either upsample the minority classes or downsample the majority classes to have an even and unbiased distribution. One can argue that downsampling is not the best option as it will remove information from the dataset. However, I downsampled the majority class using imblearn
python package as that would leave me with fewer but considerably enough, and better observations for our analysis.
This is what the balance distribution looks like after down-sampling.
Dataset now has 255,975 rows and 2 columns.
Preprocessing
First, I converted the dataframe to an array, as it is much easier to work with. In order to apply ML algorithms, we need to tokenize the data first. The process of tokenization simply involves breaking down text data into simpler characters, called tokens. I created a custom function to break our password text down into word tokens, which were then used as inputs for our model. This introduced considerably high dimensionality to our model. The dataset now has 255,975 rows and 144 columns.
Machine learning models do not comprehend text. We, therefore, need to further convert the word tokens to numeric data. The TfidfVectorizer
package was used to achieve this by creating vectors based on the frequency of occurrence of each word token.
Model Development
Logistic Regression Model
After splitting the dataset into 80% train and 20% test sets, Fist algorithm I tried was Logistic Regression, as it is always a good base performer to start with. An 80% accuracy score was observed for both train and test sets with no overfitting evidence. The model was further evaluated confusion matrix and classification report.
A confusion matrix, also known as an error matrix, is a summarized table used to assess the performance of a classification model. The number of correct and incorrect predictions are summarized with count values and broken down by each class.
The classification report records three important metrics to evaluate the classification models on.
Precision: Percentage of correct positive predictions relative to total positive predictions. Recall: Percentage of correct positive predictions relative to total actual positives. F1 Score: A weighted harmonic mean of precision and recall. The closer to 1, the better the model.
The below graph shows the classification report and the confusion matrix for the Logistic Regression.
To interpret the results, let’s look at the medium strength class(1).
- Precision: Out of all the passwords the model predicted to be medium in strength, only 73% were truly medium in strength.
- Recall: Out of all the passwords that were truly medium in strength, the model only predicted 67% correctly as a medium in strength.
- The F1 Score is 0.7, which isn’t very close to 1. It tells us that the model performs poorly in predicting the medium strength class.
Random Forest Model
Second algorithm i tried was Random Forest, a tree based classifier that naturally performs well.
The defautl model gave a 99.9% accuracy on the train set and 95.1% on the test set. Due to the discrepancy of the accuracy scores, I concluded that the model is overfitting to the train data. Hyperparameter optimization was performed using a Grid Search Crossvalidation to overcome this.
It took 2 hours and 43 minutes to complete the grid search for our chosen hyperparameter ranges. Yet the overfitting was still present with a slight improvement in the test accuracy. Further improvement is possible throuch searching for wider ranges and using other tuning methods, such as Random Search cross-validation or Bayesian optimization. However, that can be a computationally exhaustive task.
The below graph shows the classification report and the confusion matrix for the Random Forest.
Let’s again use medium strength class(1). to interpret our final Random Forest model.
- Precision: Out of all the passwords that the model predicted to be medium in strength, only 95% were truly medium in strength.
- Recall: Out of all the passwords that were truly medium in strength, the model only predicted 91% correctly as medium in strength.
- F1 Score is 0.93, which closer to 1 and therefor a good predictor.
Further Evaluated Random Forest using The Receiver Operator Characteristic (ROC) curve
The ROC curve is an evaluation metric in sklearn library forclassification problems that plots the true positive rate against the false positive rate at various threshold values. The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.
The best model will never make a misclassification, so it will always have a true positive rate of 1.0 and a false positive rate of 0. This means that its ROC curve is a flat horizontal line at 1, and the area under this curve is 1.
From the above graph, we have near-perfect AUC (area = 1) for all three classes. Therefore, our model is doing extremely well in distinguishing between positive and negative classes.
Conclusion
We trained two password-strength classifier models using machine learning. Random Forest is capable of classifying password strength across all classes with a high f1-score. This model can now further be deployed and used as a real application. It can be used by social media companies and other web platforms to incorporate as a suggestive feature of password strength when signing up for new accounts. It can be developed as an auto password generator to generate stronger and more secure passwords. The public can also use it to check for various password combinations if deployed as a public web app.
Future improvement
The project can be further improved by adding more observation and retraining on more advance ML and DL algorithms as well as fine-tuning the hyperparameters.
- Python
- Classification
- Data Visualization
- Supervised Learning
- Natural Language Processing
- Multi Class