{"id":21772,"date":"2018-07-12T14:49:11","date_gmt":"2018-07-12T18:49:11","guid":{"rendered":"https:\/\/www.crim.ca\/blogue\/manipuler-les-variables-categoriques-dans-un-jeu-de-donnees\/"},"modified":"2023-05-25T12:22:04","modified_gmt":"2023-05-25T16:22:04","slug":"manipulating-categorical-variables-in-a-dataset","status":"publish","type":"blogue","link":"https:\/\/www.crim.ca\/en\/blogue\/manipulating-categorical-variables-in-a-dataset\/","title":{"rendered":"Manipulating categorical variables in a dataset"},"content":{"rendered":"<p id=\"062a\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">Broadly speaking, a dataset (excluding textual data and images) has two types of variables: quantitative and qualitative.<\/p>\n<figure class=\"la lb lc ld hf le gt gu paragraph-image\">\n<div class=\"gt gu kz\"><img fetchpriority=\"high\" decoding=\"async\" class=\"cf lf lg\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/512\/1*k6yfyTO1pLGKXXcpLxuAFw.jpeg\" alt=\"\" width=\"256\" height=\"343\" \/><\/div><figcaption class=\"lh bm gv gt gu li lj bn b bo bp co\" data-selectable-paragraph=\"\">As early as antiquity, the concept of categories was formalized by Aristotle in his book, Categories.<\/figcaption><\/figure>\n<p id=\"e2ba\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">A quantitative variable is a variable that admits numerical values, continuous or discrete. For example, the height of an individual, the salary of an employee, and the speed of a car are quantitative variables. As these variables are numerical, their treatment by machine learning algorithms is more straightforward, i.e. they can be used directly without requiring a prior transformation.<\/p>\n<p id=\"5331\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">A qualitative variable takes on values called categories, modalities or levels that have no quantitative meaning. For example, the gender of an individual is a categorical variable with two (or more) modalities: male and female. Also, statistics such as the mean do not make sense in this data. The presence of these variables in the data generally complicates learning. Indeed, most machine learning algorithms take numerical values as input. Thus, we must find a way to transform our modalities into numerical data.<\/p>\n<p id=\"6cec\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">Moreover, how this transformation is carried out is very important. Indeed, the coding of categorical variables generally affects the performance of learning algorithms, and one type of coding may be more appropriate than another. For example, random forests, a type of machine learning algorithm, have difficulty capturing the information of categorical variables with a large number of modalities if treated with the one-hot encoding technique presented in the next section. Thus, more specific learning algorithms, such as Catboost, which we describe below, have emerged.<\/p>\n<p id=\"8ef5\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">This article presents different methods and tips for managing categorical variables.<\/p>\n<p id=\"6023\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">The example below serves as an illustration of some methods throughout the article.<\/p>\n<figure class=\"la lb lc ld hf le gt gu paragraph-image\">\n<div class=\"gt gu ll\"><img decoding=\"async\" class=\"alignnone wp-image-21777 size-full\" src=\"https:\/\/www.crim.ca\/wp-content\/uploads\/2018\/07\/manipulating-categorial-variables-1.jpg\" alt=\"Manipulating categorical variables in a dataset\" width=\"291\" height=\"202\" \/><\/div><figcaption class=\"lh bm gv gt gu li lj bn b bo bp co\" data-selectable-paragraph=\"\"><em>Example of a dataset<\/em><\/figcaption><\/figure>\n<h2 id=\"5145\" class=\"lm ln jd bn lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj gi\">One-hot encoding<\/h2>\n<p id=\"2c19\" class=\"pw-post-body-paragraph ka kb jd kc b kd mk kf kg kh ml kj kk kl mm kn ko kp mn kr ks kt mo kv kw kx iw gi\" data-selectable-paragraph=\"\"><em>One-hot encoding<\/em> is the most popular method for transforming a categorical variable into a numerical variable. Its popularity lies mainly in the ease of application. Moreover, for many tasks, it gives good results. Its principle is as follows:<\/p>\n<p id=\"7963\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">Let us consider a categorical variable X which admits <em>K<\/em> modalities<em> m1, m2<\/em>, &#8230;, <em>mK<\/em>. One-hot encoding consists in creating <em>K<\/em> indicator variables, i.e. a vector of size <em>K<\/em> which has 0&#8217;s everywhere and a 1 at position i corresponding to the modality <em>mi<\/em>. We thus replace the categorical variable with<em> K<\/em> numerical variables.<\/p>\n<p id=\"7d4c\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">If we consider the previous example and assume that the available categories are only those displayed, we then have:<\/p>\n<figure class=\"la lb lc ld hf le gt gu paragraph-image\">\n<div class=\"gt gu mp\"><img decoding=\"async\" class=\"alignnone wp-image-21779 size-full\" src=\"https:\/\/www.crim.ca\/wp-content\/uploads\/2018\/07\/manipulating-categorial-variables-2.jpg\" alt=\"Manipulating categorical variables in a dataset\" width=\"642\" height=\"202\" srcset=\"https:\/\/www.crim.ca\/wp-content\/uploads\/2018\/07\/manipulating-categorial-variables-2.jpg 642w, https:\/\/www.crim.ca\/wp-content\/uploads\/2018\/07\/manipulating-categorial-variables-2-300x94.jpg 300w\" sizes=\"(max-width: 642px) 100vw, 642px\" \/><\/div><figcaption class=\"lh bm gv gt gu li lj bn b bo bp co\" data-selectable-paragraph=\"\"><em>Result of one-hot encoding.<\/em><\/figcaption><\/figure>\n<p id=\"eb44\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\"><strong>Advantages:<\/strong> simple, intuitive, and quick set up.<\/p>\n<p id=\"628c\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\"><strong>Disadvantages:<\/strong> When the number of modalities is high (more than 100 for example), the number of new variables created is also high. Thus, we end up with a much larger dataset, which occupies more memory space and whose processing by the learning algorithms becomes more difficult. Also, some algorithms, in particular some implementations of decision tree forests, do not best use the information contained in these variables when the number of modalities is too great (see [1] for more details).<\/p>\n<h2 id=\"cd9f\" class=\"lm ln jd bn lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj gi\">Reduction in the number of modalities<\/h2>\n<p id=\"c8a7\" class=\"pw-post-body-paragraph ka kb jd kc b kd mk kf kg kh ml kj kk kl mm kn ko kp mn kr ks kt mo kv kw kx iw gi\" data-selectable-paragraph=\"\">In-depth knowledge can reduce the number of modalities. Indeed, an understanding of the categories can enable them to be grouped efficiently. A natural grouping occurs when the categories are hierarchical, i.e. it is possible to define a new category that includes other categories. Suppose a variable whose categories are a city&#8217;s neighbourhoods: these categories can, for example, be grouped by borough, i.e., the neighbourhoods of the same borough will have the same modality. This is a relatively common situation. However, we should point out that such groupings may introduce a bias in the model.<\/p>\n<p id=\"9282\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">A second way to get by with a high number of categories is to try to merge modalities with low numbers. Modalities that appear very infrequently in the data can be combined. A table of the number of modalities is made, and those whose frequency is lower than a certain threshold are put together in the same category, &#8220;other&#8221; for example. An on-hot encoding can then be applied to the new variable.<\/p>\n<h2 id=\"d134\" class=\"lm ln jd bn lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj gi\">Ordinal variables<\/h2>\n<p id=\"da2e\" class=\"pw-post-body-paragraph ka kb jd kc b kd mk kf kg kh ml kj kk kl mm kn ko kp mn kr ks kt mo kv kw kx iw gi\" data-selectable-paragraph=\"\">Ordinal variables are categorical variables that show a notion of order, i.e. a ranking of their modalities is available. For example, the variable\u00a0Age Range, which would take the values<em>\u00a0baby, adolescent, child, adult, and elderly<\/em>, is an ordinal variable. Indeed we can order the modalities ascendingly as follows:<em>\u00a0baby &lt; child &lt; teenager &lt; adult &lt; elderly.<\/em><\/p>\n<p id=\"7ee2\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">In the case of such variables, an alternative to <em>one-hot-encoding<\/em> is to use the rank to encode the modalities, which then makes the variable quantitative. In the <em>Age Range<\/em>\u00a0example, we would have, for instance,\u00a0<em>baby = 1, child = 2, teenager=3<\/em>, etc.<\/p>\n<p id=\"7597\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">Knowing the modalities can allow us to associate numerical values other than rank. In the case of\u00a0Age Range, we know that adolescence runs from about 12 to 17 years and adulthood from 25 to 65 years. Thus the mean ((12 +17) \/ 2 = 14.5) of the range of values can be used instead of rank.<\/p>\n<p id=\"be6c\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">Days of the week and days of the month can also be treated as ordinal variables.<\/p>\n<h2 id=\"a3d4\" class=\"lm ln jd bn lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj gi\"><em>Impact encoding<\/em><\/h2>\n<p id=\"807e\" class=\"pw-post-body-paragraph ka kb jd kc b kd mk kf kg kh ml kj kk kl mm kn ko kp mn kr ks kt mo kv kw kx iw gi\" data-selectable-paragraph=\"\">Encoding by indicator variables may become inconvenient when the number of categories becomes very large. An alternative method to clustering or truncating categories consists of characterizing the categories by their link with the target variable\u00a0<em>y: this is called\u00a0impact encoding.<\/em><\/p>\n<p id=\"73c3\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">This method is also known as\u00a0<em>likelihood encoding<\/em>,<em>\u00a0target coding,\u00a0conditional-probability encoding,\u00a0and weight of evidence<\/em>\u00a0([9]).<\/p>\n<p id=\"c676\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\"><strong>Definition:<\/strong> For a regression problem with target variable\u00a0<em>y<\/em>, say,\u00a0<em>X<\/em>, a categorical variable with\u00a0<em>K<\/em> categories <em>m1<\/em>, &#8230;, <em>mK<\/em>. Each category\u00a0mk\u00a0is encoded by its impact value:<\/p>\n<figure class=\"la lb lc ld hf le gt gu paragraph-image\">\n<div class=\"gt gu mq\"><img loading=\"lazy\" decoding=\"async\" class=\"cf lf lg\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/534\/1*C3ULJRD5KpCh874mu-Qq9g.png\" alt=\"\" width=\"267\" height=\"19\" \/><\/div>\n<\/figure>\n<p id=\"2b08\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\"><em>\ud835\udd3c[y|X=mk]<\/em> is the expectation of the target\u00a0<em>y<\/em>, knowing that the variable\u00a0<em>X<\/em>\u00a0is set to modality\u00a0<em>mk<\/em>. For a training set of size\u00a0n\u00a0containing independent and identically distributed samples\u00a0<em>{(xi, yi), 1\u2264i\u2264n}<\/em>, the estimator of this expectation is the average of the values of\u00a0<em>yi<\/em>\u00a0for which the modality\u00a0<em>xi\u00a0<\/em>is equal to<em>\u00a0mk<\/em>:<\/p>\n<figure class=\"la lb lc ld hf le gt gu paragraph-image\">\n<div class=\"gt gu mr\"><img loading=\"lazy\" decoding=\"async\" class=\"cf lf lg\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/396\/1*EhparLW2K8aDfeId1zevTQ.png\" alt=\"\" width=\"198\" height=\"48\" \/><\/div>\n<\/figure>\n<p id=\"541e\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">Where<em> Sk<\/em> is the set of indices i of the observations such that xi is equal to <em>mk<\/em> and <em>nk<\/em> is the cardinality of this set.<\/p>\n<p id=\"af96\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">The estimator of the expectation of y is simply its empirical mean:<\/p>\n<figure class=\"la lb lc ld hf le gt gu paragraph-image\">\n<div class=\"gt gu ms\"><img loading=\"lazy\" decoding=\"async\" class=\"cf lf lg\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/max\/232\/1*waM1SCPlHyuzfhZhr9qxhg.png\" alt=\"\" width=\"116\" height=\"51\" \/><\/div>\n<\/figure>\n<p id=\"283a\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">Note that there are other variants to the definition given above. For example, in [8], both expressions are weighted by a parameter between 0 and 1.<\/p>\n<p id=\"cdc7\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">To illustrate impact encoding, let us consider the product variable in our previous example. It has three modalities, namely shirts, shoes and perfumes. The empirical mean is (50.99 + 44 + 120 + 45)\/4 = 64.9975. We then have:<\/p>\n<p id=\"889f\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">Impact (shirts) = (50,99 + 45)\/2\u201364,9975 = -17<\/p>\n<p id=\"05d6\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">Impact (shoes) = 44\u201364,9975 = -21<\/p>\n<p id=\"9349\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">Impact (perfume) = 120\u201364,9975 = 55<\/p>\n<figure class=\"la lb lc ld hf le gt gu paragraph-image\">\n<div class=\"mu mv dq mw cf mx\" tabindex=\"0\" role=\"button\">\n<div class=\"gt gu mt\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-21781 size-full\" src=\"https:\/\/www.crim.ca\/wp-content\/uploads\/2018\/07\/manipulating-categorial-variables-3.jpg\" alt=\"Manipulating categorical variables in a dataset\" width=\"713\" height=\"234\" srcset=\"https:\/\/www.crim.ca\/wp-content\/uploads\/2018\/07\/manipulating-categorial-variables-3.jpg 713w, https:\/\/www.crim.ca\/wp-content\/uploads\/2018\/07\/manipulating-categorial-variables-3-300x98.jpg 300w\" sizes=\"(max-width: 713px) 100vw, 713px\" \/><\/div>\n<\/div><figcaption class=\"lh bm gv gt gu li lj bn b bo bp co\" data-selectable-paragraph=\"\">Calculation of impact encoding of the shirt modality.<\/figcaption><\/figure>\n<p id=\"cc80\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">This encoding has the advantage of being very compact:\u00a0<strong>the number of descriptors of a variable is constant<\/strong> compared to the number of categories. However, this encoding leads to a<strong> loss of information:<\/strong> only a &#8220;correlation&#8221; value with the target is retained. This means that if two categories have close values for the target y on average, the model cannot distinguish them.<\/p>\n<p id=\"8570\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">To remedy this loss of information, indicator variables can be kept for the most dominant categories. Alternatively, it is possible to encode combinations of variables. Sometimes, a category alone will not be correlated with the target\u00a0y, but the combination of two categorical variables can be &#8220;predictive&#8221; of the target. In practice, this consists in identifying the pairs (or even triplets) of categorical variables that can interact with the target variable and encode the concatenation of the instances of these variables; in-depth knowledge often helps to identify these interesting combinations of variables. This method can even be extended to the encoding of non-categorical variables by transforming continuous variables into binary variables, by discretization and binarization ([6], [7]).<\/p>\n<p id=\"f598\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\"><em>Impact encoding<\/em><strong> is a form of model stacking<\/strong>: first, simple models (<em>\ud835\udd3c[y|xk]<\/em> calculations) are trained on each of the categorical variables. The output predictions of these models are then used as descriptors in a second model.<\/p>\n<figure class=\"la lb lc ld hf le gt gu paragraph-image\">\n<div class=\"mu mv dq mw cf mx\" tabindex=\"0\" role=\"button\">\n<div class=\"gt gu my\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-21783 size-full\" src=\"https:\/\/www.crim.ca\/wp-content\/uploads\/2018\/07\/manipulating-categorial-variables-4.jpg\" alt=\"Manipulating categorical variables in a dataset\" width=\"911\" height=\"372\" srcset=\"https:\/\/www.crim.ca\/wp-content\/uploads\/2018\/07\/manipulating-categorial-variables-4.jpg 911w, https:\/\/www.crim.ca\/wp-content\/uploads\/2018\/07\/manipulating-categorial-variables-4-300x123.jpg 300w, https:\/\/www.crim.ca\/wp-content\/uploads\/2018\/07\/manipulating-categorial-variables-4-768x314.jpg 768w\" sizes=\"(max-width: 911px) 100vw, 911px\" \/><\/div>\n<\/div><figcaption class=\"lh bm gv gt gu li lj bn b bo bp co\" data-selectable-paragraph=\"\">Illustration of impact encoding.<\/figcaption><\/figure>\n<div><\/div>\n<p id=\"746e\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">As for any stacking method, there is a substantial risk of overlearning. Indeed, the new descriptors can be highly correlated to the target, giving overly optimistic results. It is then necessary to calculate the impact on a small data set distinct from the training set used for the model. Usually, a cross-validation method is used.<\/p>\n<p id=\"9364\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">These methods are applied in <strong>Catboost<\/strong>, an implementation of the gradient boosting algorithm maintained by Yandex. By default, all categorical variables with more than two categories are encoded by impact encoding (or variants).<\/p>\n<p id=\"cc31\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">In Catboost, impact encoding is only performed with binary target variables (0\/1). In a case where the problem is not a binary classification, the target variable is transformed into several binary variables, and, for each of these variables, an impact encoding is performed. For example, if, for a classification problem, the target variable is categorical, it is converted into K binary variables through one-hot encoding.<\/p>\n<p id=\"7b8c\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\"><strong>Advantages:<\/strong> The number of descriptors produced does not depend on the number of categories; this is a relatively simple form of encoding.<\/p>\n<p id=\"728b\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\"><strong>Disadvantages:<\/strong> Complex implementation due to the risk of overlearning (libraries like vtreat or catboost are recommended).<\/p>\n<h2 id=\"ad03\" class=\"lm ln jd bn lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj gi\">Methods of embedding<\/h2>\n<p id=\"ad4b\" class=\"pw-post-body-paragraph ka kb jd kc b kd mk kf kg kh ml kj kk kl mm kn ko kp mn kr ks kt mo kv kw kx iw gi\" data-selectable-paragraph=\"\">This method uses deep learning techniques; it draws inspiration from models like <a href=\"https:\/\/arxiv.org\/pdf\/1301.3781.pdf\" target=\"_blank\" rel=\"noopener\">word2vec<\/a>\u00a0on textual data and yields very impressive results.<\/p>\n<p id=\"64a8\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">It consists of creating a representation of each modality of a categorical variable in a fixed-size numerical vector (e, for example). Thus the Dior modality of the brand variable in our example can be represented by the vector (0.54, 0.28) for e=2, for instance. The use of embeddings allows, among other things, a reduction in dimensionality since the size of the vector e can be selected to be very small compared to the number of modalities. Indeed, rather than creating a variable for each modality, only e variables are created.<\/p>\n<p id=\"7453\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\">Concretely, these embeddings are obtained by training a neural network (often a multilayer perceptron) with only categorical variables as input. First, a one-hot encoding is applied to the variable in order to be put into the network input. Generally, one or two hidden layers are sufficient. The first hidden layer has e neurons. The network is then trained on the same task as the one initially defined. Then, the output of the first hidden layer constitutes the vector of embeddings ([10]). This vector is subsequently concatenated with the initial data (creation of e variables). These data are then used in the final model fitting. There are different variants in the literature on how to obtain these embeddings. Moreover, nothing prevents one from putting more than two hidden layers and retaining the output of the second rather than the first. Also, the network can be trained on a task other than the initial one.<\/p>\n<figure class=\"la lb lc ld hf le gt gu paragraph-image\">\n<div class=\"gt gu na\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-21785 size-full\" src=\"https:\/\/www.crim.ca\/wp-content\/uploads\/2018\/07\/manipulating-categorial-variables-5.jpg\" alt=\"Manipulating categorical variables in a dataset\" width=\"482\" height=\"373\" srcset=\"https:\/\/www.crim.ca\/wp-content\/uploads\/2018\/07\/manipulating-categorial-variables-5.jpg 482w, https:\/\/www.crim.ca\/wp-content\/uploads\/2018\/07\/manipulating-categorial-variables-5-300x232.jpg 300w\" sizes=\"(max-width: 482px) 100vw, 482px\" \/><\/div><figcaption class=\"lh bm gv gt gu li lj bn b bo bp co\" data-selectable-paragraph=\"\">Multilayer perceptron with two hidden layers (image drawn from [11]).<\/figcaption><\/figure>\n<p id=\"18fb\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\"><strong>Advantages:<\/strong> Embeddings allow a reduction of the dimension; in some cases, their use gives significantly better performances. In addition, they can avoid introducing bias during the reduction of modalities through in-depth knowledge.<\/p>\n<p id=\"3c57\" class=\"pw-post-body-paragraph ka kb jd kc b kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx iw gi\" data-selectable-paragraph=\"\"><strong>Disadvantages:<\/strong> The need to train a neural network may slow down some users (unfamiliar with deep learning), especially if the final model selected is a simple model such as linear regression. Also, we lose the interpretability of categorical variables.<\/p>\n<h2 id=\"1a77\" class=\"lm ln jd bn lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj gi\">Conclusion<\/h2>\n<div><\/div>\n<p id=\"ddf8\" class=\"pw-post-body-paragraph ka kb jd kc b kd mk kf kg kh ml kj kk kl mm kn ko kp mn kr ks kt mo kv kw kx iw gi\" data-selectable-paragraph=\"\">Categorical variables are widespread in data and should receive careful attention. Indeed, their proper treatment can significantly improve the performance of a model. It should be noted that the techniques presented here are only a few strategies to explore and that there are many others. Also, when faced with a new task, it is necessary to test the different encodings to see which ones give the best results.<\/p>\n<h2 id=\"c726\" class=\"lm ln jd bn lo lp lq lr ls lt lu lv lw lx ly lz ma mb mc md me mf mg mh mi mj gi\">References<\/h2>\n<ol class=\"\">\n<li id=\"390e\" class=\"nb nc jd kc b kd mk kh ml kl nd kp ne kt nf kx ng nh ni nj gi\" data-selectable-paragraph=\"\"><a class=\"au mz\" href=\"https:\/\/roamanalytics.com\/2016\/10\/28\/are-categorical-variables-getting-lost-in-your-random-forests\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/roamanalytics.com\/2016\/10\/28\/are-categorical-variables-getting-lost-in-your-random-forests\/<\/a><\/li>\n<li id=\"ffdb\" class=\"nb nc jd kc b kd nk kh nl kl nm kp nn kt no kx ng nh ni nj gi\" data-selectable-paragraph=\"\"><em>vtreat:<\/em> a data.frame Processor for Predictive Modeling, an article associated with the vtreat descriptor engineering library, used as the primary reference for this post: <a class=\"au mz\" href=\"https:\/\/arxiv.org\/pdf\/1611.09477.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/arxiv.org\/pdf\/1611.09477.pdf<\/a><\/li>\n<li id=\"b230\" class=\"nb nc jd kc b kd nk kh nl kl nm kp nn kt no kx ng nh ni nj gi\" data-selectable-paragraph=\"\"><em>Transforming categorical features to numerical features<\/em>, from Catboost documentation: <a class=\"au mz\" href=\"https:\/\/tech.yandex.com\/catboost\/doc\/dg\/concepts\/algorithm-main-stages_cat-to-numberic-docpage\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/tech.yandex.com\/catboost\/doc\/dg\/concepts\/algorithm-main-stages_cat-to-numberic-docpage\/<\/a><\/li>\n<li id=\"7a3e\" class=\"nb nc jd kc b kd nk kh nl kl nm kp nn kt no kx ng nh ni nj gi\" data-selectable-paragraph=\"\">Tutorial on impact encoding for categorical variables: <a class=\"au mz\" href=\"https:\/\/github.com\/Dpananos\/Categorical-Features\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/github.com\/Dpananos\/Categorical-Features<\/a><\/li>\n<li id=\"7ccd\" class=\"nb nc jd kc b kd nk kh nl kl nm kp nn kt no kx ng nh ni nj gi\" data-selectable-paragraph=\"\"><em class=\"ky\">Efficient Estimation of Word Representations in Vector Space<\/em>:\u00a0<a class=\"au mz\" href=\"https:\/\/arxiv.org\/pdf\/1301.3781.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/arxiv.org\/pdf\/1301.3781.pdf<\/a><\/li>\n<li id=\"efcb\" class=\"nb nc jd kc b kd nk kh nl kl nm kp nn kt no kx ng nh ni nj gi\" data-selectable-paragraph=\"\">\u201cBinarization\u201d page of the Catboost documentation: <a class=\"au mz\" href=\"https:\/\/tech.yandex.com\/catboost\/doc\/dg\/concepts\/binarization-docpage\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/tech.yandex.com\/catboost\/doc\/dg\/concepts\/binarization-docpage\/<\/a><\/li>\n<li id=\"3e9f\" class=\"nb nc jd kc b kd nk kh nl kl nm kp nn kt no kx ng nh ni nj gi\" data-selectable-paragraph=\"\"><a class=\"au mz\" href=\"https:\/\/books.google.fr\/books?id=MBPaDAAAQBAJ&amp;pg=PT102#v=onepage&amp;q&amp;f=false\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/books.google.fr\/books?id=MBPaDAAAQBAJ&amp;pg=PT102#v=onepage&amp;q&amp;f=false<\/a><\/li>\n<li id=\"3aa9\" class=\"nb nc jd kc b kd nk kh nl kl nm kp nn kt no kx ng nh ni nj gi\" data-selectable-paragraph=\"\"><a class=\"au mz\" href=\"https:\/\/kaggle2.blob.core.windows.net\/forum-message-attachments\/225952\/7441\/high%20cardinality%20categoricals.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/kaggle2.blob.core.windows.net\/forum-message-attachments\/225952\/7441\/high%20cardinality%20categoricals.pdf<\/a><\/li>\n<li id=\"bc8e\" class=\"nb nc jd kc b kd nk kh nl kl nm kp nn kt no kx ng nh ni nj gi\" data-selectable-paragraph=\"\"><em>Weight of Evidence<\/em>, a tool mainly used in econometrics, is similar to impact encoding: <a class=\"au mz\" href=\"https:\/\/www.listendata.com\/2015\/03\/weight-of-evidence-woe-and-information.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/www.listendata.com\/2015\/03\/weight-of-evidence-woe-and-information.html<\/a><\/li>\n<li id=\"f6d5\" class=\"nb nc jd kc b kd nk kh nl kl nm kp nn kt no kx ng nh ni nj gi\" data-selectable-paragraph=\"\">Entity Embeddings of Categorical Variables: <a class=\"au mz\" href=\"https:\/\/arxiv.org\/pdf\/1604.06737.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/arxiv.org\/pdf\/1604.06737.pdf<\/a><\/li>\n<li id=\"73ac\" class=\"nb nc jd kc b kd nk kh nl kl nm kp nn kt no kx ng nh ni nj gi\" data-selectable-paragraph=\"\"><a class=\"au mz\" href=\"https:\/\/commons.wikimedia.org\/wiki\/File:Perceptron_4layers.png\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/commons.wikimedia.org\/wiki\/File:Perceptron_4layers.png<\/a><\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Broadly speaking, a dataset (excluding textual data and images) has two types of variables: quantitative and qualitative. As early as antiquity, the concept of categories was formalized by Aristotle in his book, Categories. A quantitative variable is a variable that admits numerical values, continuous or discrete. For example, the height of an individual, the salary [&hellip;]<\/p>\n","protected":false},"author":18,"featured_media":16720,"menu_order":0,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":"","_links_to":"","_links_to_target":""},"mots_cles":[529,528,527],"categorie_blogue":[457],"class_list":["post-21772","blogue","type-blogue","status-publish","format-standard","has-post-thumbnail","hentry","mots_cles-categorical-variables","mots_cles-coding","mots_cles-structured-data","categorie_blogue-data-science"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.crim.ca\/en\/wp-json\/wp\/v2\/blogue\/21772","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.crim.ca\/en\/wp-json\/wp\/v2\/blogue"}],"about":[{"href":"https:\/\/www.crim.ca\/en\/wp-json\/wp\/v2\/types\/blogue"}],"author":[{"embeddable":true,"href":"https:\/\/www.crim.ca\/en\/wp-json\/wp\/v2\/users\/18"}],"version-history":[{"count":6,"href":"https:\/\/www.crim.ca\/en\/wp-json\/wp\/v2\/blogue\/21772\/revisions"}],"predecessor-version":[{"id":21800,"href":"https:\/\/www.crim.ca\/en\/wp-json\/wp\/v2\/blogue\/21772\/revisions\/21800"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.crim.ca\/en\/wp-json\/wp\/v2\/media\/16720"}],"wp:attachment":[{"href":"https:\/\/www.crim.ca\/en\/wp-json\/wp\/v2\/media?parent=21772"}],"wp:term":[{"taxonomy":"mots_cles","embeddable":true,"href":"https:\/\/www.crim.ca\/en\/wp-json\/wp\/v2\/mots_cles?post=21772"},{"taxonomy":"categorie_blogue","embeddable":true,"href":"https:\/\/www.crim.ca\/en\/wp-json\/wp\/v2\/categorie_blogue?post=21772"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}