Cosine Similarity and Handling Categorical Variables

Machine Learning Recommender Systems depend heavily on the ability to compare multiple characteristics of a large pool of contenders and identify the ones that best match the required characteristics. Comparing characteristics (or parameters) when they are numeric is fairly straightforward. But a lot of characteristics tend to be ‘categorical variables’ that take on a value from a limited, usually fixed, set of values. As an example, the categorical variable, ‘eye color’ can be blue, brown, amber, hazel, green and grey. Comparing categorical variables can be tricky because it is not easy to map these variables onto a numeric model that represents them accurately. This article shares the approach our team followed to model categorical variables when using Cosine Similarity to build a recommender system.

Background

A web portal offers nonprofits the ability to post their projects and solicit crowdfunding. When donors make donations on the portal, the transaction information is stored. Therefore, the web portal maintains the following datasets:

  1. A list of donors and their donation history, and
  2. A list of projects that needed to be funded (initiatives) and information on them

Our charter was to develop a recommender system that will compare a donor’s previous donation history to the characteristics of the nonprofit projects soliciting funding and recommend the projects that best match the donor’s preferences. For example, If ‘Donor A’ had donated to a ‘Project X’ in the past, we could then find projects most similar to ‘Project X’ and recommend them for donations to ‘Donor A’. We can use ‘Donor A’s donation history to analyse his donation preferences. For better understanding, see Figure 1.

Figure 1: Overview of Recommender System

Analyzing Donor’s Preferences

The sample dataset below describes the donation transactions. A donor donated “x” amount of money to a project (needing “project cost”) belonging to a “project category”.

Table 1: Donation History Table

Finding Similarity between Projects

This can be done by using one of the distance measures to find distance between project vectors. Project Vector consists of variables: ‘Project Category’ and ‘Project Cost’. Project Category is a categorical variable that needs to be mapped/converted to numeric values in order to use cosine similarity measure.

Categorical Variable Handling

There are two popular methods to convert the categorical data to numeric values :

  1. Label encoding: The label encoding for categorical values assigns the values with range 0 to n-1, where n is number of possible values for a variable.
  2. Dummies method: This method takes each category value and turns it into a binary vector of size n (number of values in the category), assigns 1 if value contains a particular categorical value otherwise 0.
    The following “Table 2” will become “Table 3” after applying the “Dummies method” on the column “Project Category”.

Table 2: New Projects Table

Table 3 is created by applying the Dummies Method to the “Project Category” feature in Table 2. Table 3 has three new columns ‘Education’, ‘Literacy’, and ‘Music’, as shown in the table below.

Table 3: Similarity Calculation Table

Our aim is to determine which of the projects in Table 2 are most similar to project p1 that was funded by donor D1. So, using the information in Table 1, we enter project p1’s data in Table 3.

Now, using the Cosine similarity method let us find the projects similar to the one to which donor D1 has made a donation (p1).

Finding Similarity using Cosine Method

The Cosine similarity is a way to measure the similarity between two non-zero vectors with n variables. If the cosine value of two vectors is close to 1, then it indicates that they are almost similar. A zero value indicates that they are dissimilar.

Now, from Table 3, find the similarity between p1 and each of the other projects. So, consider projects p1, newp1, newp2, newp3 and newp4:

p1 = (1,0,0,150), newp1 = (1,0,0,100), newp2 = (1,0,0,200), newp3 = (0,0,1,135) and newp4 = (0,1,0,250)

  • Similarity(p1,newp1) = 0.999994
  • Similarity(p1,newp2) = 0.999998
  • Similarity(p1,newp3) = 0.99995
  • Similarity(p1,newp4) = 0.99994

While coding, similarity can be determined by using method cosine_similarity() from sklearn module. The similarity values are close to 1. The projects most similar to project p1 are newp2 and newp1. These projects can thus be recommended to donor D1.

Conclusion

The human brain can differentiate between two scenarios based on a few differences in attributes. We built a mathematical tool to model the way the brain perceives differences (even when the differences are not numeric). In our example, we recognized the similarity between project p1 and projects newp2 and newp1. Running the data through the model and calculating the cosine similarity value, we confirmed that projects p1, newp2 and newp1 are indeed most similar. When their cosine similarity value is close to 1, the projects are very similar and when the cosine similarity value is near zero, they are dissimilar.

This mathematical model’s ability to handle non-numeric differences through categorical variables and its ability to scale to a large number of attributes makes it extremely useful in recommender systems.

In the meantime, if you are looking for Cloud Services – ONAP, OpenStack, Kubernetes, Cloud Native Application, DevSecOps and Infrastructure Modernization please contact us.

References

Item-to-Item Collaborative Filtering: https://dl.acm.org/citation.cfm?id=642471(https://www.computer.org/csdl/mags/ic/2003/01/w1076.pdf)
http://dataaspirant.com/2015/05/25/collaborative-filtering-recommendation-engine-implementation-in-python/
http://recommender-systems.org/collaborative-filtering
https://www.researchgate.net/post/How_can_we_get_the_similarity_score_between_two_vectors

Contributor: Rahul Kuntala