Arvato Project Report
This blog post is about the Udacity Arvato project. The data is provided by the company Bertelsmann Arvato Analytics. The goal of the project is to predict persons who are likely to become customers from the population. Additionally, it should be predicted, which persons are likely to become customers after being targeted by a mailout campaign.
Description of Input Data
The Arvato project consists of a large dataset, where each row represents a single person. Each person has different attributes like number of kids, type of car and other personal and demographic details. The data also has information about the customer group of each customer, if it is an online purchase and the product group. A separate dataset which consists of data of the general population and not only customers is missing these customer related columns. Furthermore, there is a mailout dataset, which provides information if a mail receiver later decided to make a purchase or not. The purpose of the latter dataset is to be used for a classification task.
Strategy for solving the problem
In a first step, the customers should be grouped in meaningful clusters, so that target group advertising can be performed. This can be achieved via clustering techniques on the customer and the general population data. Furthermore, a classification model should be trained on the mailout data, where the response can be taken as a label. With this model, we can later decide if it makes sense to address specific customers in a mailout campaign.
To measure the performance of the clusters, there is no single point of truth, thus we can minimize the intra cluster distance. For the classification task it is straight-forward to use the Accuracy as a metric. Furthermore we check for imbalance in the data and use appropriate measurements like Cohens Kappa score accordingly.
Discussion of the expected solution
The solution provides at least one clearly separated cluster, which represents people from the population, which are likely to become customers. Furthermore, a classification model is provided, which can predict which people of the population are likely to respond to a mailout campaign.
Metrics with justification
As clustering is an ill-posed problem, we can only use heuristics to evaluate the clusters. We use the intracluster distance as well as visualization techniques to evaluate our clusters. In detail we use PCA to project the data into a two dimensional space and then visualize it in a plot.
For the classification task, we use Accuracy as well as Cohens Kappa Score to report general accuracy as well as the performance of our classifier compared to a majority class classifier. This allows for interpretation on imbalanced data.
Exploratory Data Analysis
As all three datasets are equal except for a few attributes, we focus on the largest dataset to derive preprocessing steps. At a first glance, we can see, that we have a few non-numeric attributes which will need preprocessing later on.
We also have an Excel file with encodings, where we can see, that the meaning of some values actually means, that the entry for this attribute is missing.
We remove all attribute values which have the meaning, that the information is missing.
Now, that we have removed these values, we check for missing values in our dataset.
We can see, that we have some columns with a lot of missing values. From the graphics, we will draw a line at 25 % and drop all columns with more than 25 % missing values in the preprocessing.
Furthermore, we also have a look at missing values per row. Here, we also see that some rows have lots of missing values, which makes sense to remove these persons from the data. Around 8 % the curve flattens thus we use this as a threshold per row. We loose around 140,000 of our 890,000 rows. This trade-off seems fine.
Regarding preprocessing, we have removed the missing values according to the threshold mentioned in the previous section. Furthermore, the non-numeric attributes will be one-hot-encoded, as they have no ordinal relevance to each other. Only those with only two categories can be encoded to binary. To impute missing values, we used different techniques for each type of column. For binary columns, we imputed with the mode (most frequent value). The same is applied for categorical values. The numerical columns are imputed with the median per column. After the imputation is done, the categorical values will be one hot encoded and numerical values will be Z-transformed.
To solve the problem the clustering problem, we are using k-means as a straightforward and easy to interpret clustering algorithm. We use inertia to find the best number of clusters. Hence, we try different number of clusters and try to find a reasonable number of k.
Note that elbow criteria is a heuristic and clustering is an ill-posed problem. However, at around 10 the curve seems to flatten, thus we choose 10 clusters.
For the classification task, we first tried a simple RSLVQ as it is nicely interpretable by its prototypes in the euclidean space. However, at our first evaluation, we noticed, that while accuracy is around 99 %, the algorithm mostly returns the majority class. The dataset is highly imbalanced and we only have a few responses in a large dataset, thus most of the labels are 0. We then tried different techniques like SMOTE with RSLVQ. This also lead to a bad Kappa score, thus we focussed on classifiers for imbalanced data, like RUS, GradientBoosting, XGB and Balanced Random Forest, all with and without SMOTE. According to the Kappa score, Balanced RF performed best by a small advantage, but still has problems to distinguish the classes.
We thus used a grid search with cross validation to tune the hyperparameters of the balanced RF algorithm, which lead us to our final model.
In the image above, the used hyperparameter grid is shown. We have tried to use a broad variety of parameters, which make sense according to the algorithm specification.
As shown above, we have performed the grid search w.r.t. Kappa score instead of Accuracy, because the dataset is highly imbalanced. On the test data set, the Kappa score has improved from 1.18 to 2.2 due to the grid search.
Results & Conclusion
To interpret the results of the clustering, we look at the members per cluster in the population data compared to our customer data.
We can see, that we have a lot more members in the overall population data, because this dataset of course is a lot larger. Hence, in a next step we normalize the data and look at the percentual cluster size w.r.t. dataset size.
We now can see, that many clusters are around the same size, but some are over or underrepresented, these are the important clusters for our customer segmentation. Hence, we plot differences between the clusters in a more straightforward way.
Green means, we have more values (ratio) in our customer group than in the population. Thus the green clusters are those people who are interested in our products. Red clusters are people who are not interested in our products. Thus, we can now look on the cluster means.
Our customers seem to be segmented by the cars by the cluster algorithm. Our customers tend to like the Customer Joruney type 1, 2, while non buyers prefer type 5 more than customers. Our customers also like to save money traditionally or as an investor while non-customers seem to be more of the ones who like to be prepared or are minimalists. Also our customers are younger than non-customers. Thus, we should address people which have chosen CJT Type 1 or 2 and are of younger age and like to invest money.
Looking at the PCA plot, we can see, that the clusters seem to be well separated, especially cluster 3 differs a lot from the rest of the data when projected on the eigenvectors corresponding to the two largest eigenvalues.
Regarding the classification task, we have optimized w.r.t Kappa score, while we can get 99 % Accuracy with simple models, this is not useful as the dataset is highly imbalanced.
Thus, we show the results of our grid search optimized balanced Random Forest classifier. Unfortunately, also with the optimized model, we only have a Kappa score of 2.2 %, which is only a bit better than a majority class classifier.
The final confusion matrix, also shows, that also the algorithm optimized for imbalanced data mostly predicts the 0 class label.
Below you can find an overall comparison between all tested algorithms.
Note, that the above comparison shows the results for each algorithm before grid search, as grid searching all algorithms is too expensive. Based on the Accuracy and Kappa score, we have chosen Balanced RF over RUSBoost, as it has a much worse Accuracy score.
While the unsupervised part of the projcet cannot be evaluated because of its unsupervised nature, it has brought up understandable clusters. We can interpret the clusters as following:
Our customers seem to be segmented by the cars via the cluster algorithm. Our customers tend to like the Customer Journey type 1, 2, while non buyers prefer type 5 more than customers. Our customers also like to save money traditionally or as an investor while non-customers seem to be more of the ones who like to be prepared or are minimalists. Also our customers are younger than non-customers.
Thus, we should address people which have chosen CJT Type 1 or 2 and are of younger age and like to invest money.
The supervised part basically provides very good results w.r.t. accuracy, however we should look at the unbalanced nature of the dataset. The dataset is huge but only a small amount have become customers, thus a good accuracy is achieved by predicting 0 all the time. Hence, we have focused on algorithms which are optimized on unbalanced data. Unfortunately, even after a gridsearch w.r.t. Kappa score with the most promising algorithm the performance stays on a bad level.
In general, further improvements could focus on using embeddings or projections to reduce the dimensionality of the data before applying clustering or classification algorithms. To further improve the classification task some could try to sample data even better than with SMOTE in the preprocessing to obtain more relevance for the 0 class.
The unsupervised as well as the supervised parts of this project, have been really interesting and also challenging. In general, the preprocessing has needed some additional cleanup logic, like encoding some given values back to nan based on a separate encoding file. Furthermore, unsupervised tasks are always hard to handle, as there is no single point of truth. Using other cluster algorithms or other hyperparameters, could lead to different clusters with different characteristics. You would need a domain expert to check the clusters if they really make sense. The supervised task is also very challenging due to the imbalance in the labels. As previously mentioned also with specialized techniques for imbalanced data, the performance w.r.t. imbalanced measures are on a low level, while a high Accuracy could be easily achieved.
We want to thank Bertelsmann Arvato Analytics for providing the dataset and we are also thankful, that Udacity brought up this interesting project. Huge thanks goes to the open source community, for providing dataset libraries and knowledge around them.