Customer Segmentation using supervised and unsupervised learning

Ricardo Rosas
6 min readNov 17, 2019

In this post I will walk through a project I have been working on for the past days as part of my Data Science nanodegree from Udacity.

The question that I tackled was: How can a mail-order company tackle acquire more clients more effectively?

In order to tackle this question I engaged in both supervised and unsupervised learning. Using unsupervised learning, I was able to create over 10 clusters of the population and determine in which of those clusters the customers are over or under represented. This clustering can help the company target more effectively or discover segments of untapped potential. Based on data of existing responses, I was able to test Machine Learning algorithms to predict better those who are likely to respond.

In this post I will go over a description of the data, the unsupervised algorithms, and the Machine Learning algorithms

Understanding the data

The data provided by the company shows:

  • Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
  • Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
  • Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
  • Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood

Although this seems like a data-paradise, there had to be some cleaning. In fact, ~200 000 values were NA in the Azdias data-set and >6% of all rows had over 50% of missing values.

In the end, I chose to remove >40 columns due to the high number of missing values or challenges for the machine learning algorithms.

Principal Component Analysis

According to [Wikipedia]( https://en.wikipedia.org/wiki/Principal_component_analysis)

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components

In plain English, I used this method to reduce the number of columns / dimensions in my analysis. This is possible as different variables may be relating to the same latent features.

In this chart, I show how you can explain a majority of the variance of the data set with a much lower number of variables. In fact, with less than half the variables, you could explain over 90% of the variance. And ~90 dimensions would already prepare you to account for 80% of the variance in the data.

In the principal component analysis, multiple variables are merged into components based on their correlation. For example

The first component contains information about number of houses in the same postal code and car-ownership. This could be analyzed as a measure of density.

Clustering

After having done the Principal Component Analysis I proceeded to cluster, first in the general population and then apply the same clusters to the companies customers to study differences and similarities.

Clusters can be difficult to visualize because of the number of dimensions. To illustrate clusters in a small scale, I visualized 3 clusters in 3 dimensions:

Clustering based on 3 dimensions

The different colors shows the different clusters alongside the main variable of the principal component. However, this was simply illustrative, as the number of clusters I considered more adequate was larger than 3.

In order to determine a good number of clusters to use I plotted the sum of squared errors (SSE) on the Y axis to the number of clusters (K).

Based on this illustration, I chose to have 15 clusters, a number that is large enough to reduce significantly the SSE but at the same time still be a manageable number of clusters.

After doing this analysis I compared the % of the general and customer population in the different clusters

The customer population seems over-represented in clusters 13, 11, 5, and 4. This could be helpful for the company to evaluate in more detail. After a little exploring, some of the elements that make out “Cluster 13” different from the average population is that they are more residential and are more frequent buyers of technological products.

Top largest positives differences between cluster 13 and general population

Supervised Learning

The company provided data from past mail orders. In this data-set we can see that only a minority of people answered.

Response rates from training data

This poses a challenge for machine learning models as there is a large class inbalance. This means that any algorithm will be tempted to predict a non-response because the vast majority of times it would be right.

But even before getting into machine learning, we can already see which dimensions are correlated with responding:

In order to avoid over-fitting and be able to assess the results of the ML algorithms, I first splitted my data into training and testing, giving 25% of all the data to training. In order to make this data work with the sklearn algorithms which cannot process missing values, I had to impute the missing values, replacing them with the average response. Additionally, I used standardscaling to standardize my data.

I tested out different models:

a) Random Forest Classifier: Did not produce good results due to class inbalance (even after using grid-search)

b) Naive Bayes: Performed better, particularly getting a big amount of those people who actually responded (see image below)

c) AdaBoostClassifier: Performed quite well, particularly using precision as a metric

Result for Naive Bayes

Optimizing Models:

In order to have the best result, I used GridSearch, which is a tool that enables you to try out different combination of the hyper-parameters of all the models. In this way, you can try out different combinations based on a specific parameter you wish you to optimize. In my case, I used both precision and ROC AUC as the units to optimize.

Below you can the output of a sample grid search.

Results from AdaBoostClassifier
Most important features of the AdaBoost Classifier

Potential for improvement

Although this work could already help the company significantly to target the audience, there are several steps that could be used to improve even more the targeting. For example

  • Using K-Fold validation : This tool can be helpful in order to use all the data to both train and test. This could improve the results
  • Testing out more models: Using more models such as Support Vector Machines, Gradient Boosting Regression, or even neural networks
  • Using the clusters from supervised learning as input for the supervised learning algorithms
  • Altering the balance within training data set. Because of the class imbalance, the people who responded to the campaign have a low representation in the training data. Altering this balance can help the algorithm over-estimate how many people will respond. This would likely reduce accuracy but increase other metrics we may care more about such as precision.

Conclusion

With this analysis the company is able to target much better customers in their mailing campaigns, as well as having an overall framework on how to think about customers by using clusters.

--

--