# Data Preparation for Machine Learning: Vector Spaces

Machine learning algorithms often rely on certain assumptions about the data being mined. Ensuring data meet these assumptions can be a significant portion of the preparation work before model training and predicting begins. When I began my data science journey, I was blissfully unaware of this and thought my preparation was done just because I had stuffed everything into a table. Feeding these naïvely compiled tables into learners I wondered why some algorithms never seemed to perform well for me. As I began digging into the algorithms themselves, many referred to vector space operations as being fundamental to their function. Answering the simple question of “what is a vector space” took me on quite an enlightening journey. Vector spaces are really a neat tool for organizing data and require more than getting all your data into a table. Perhaps most mind-bending is that a vector space can be constructed from anything, even if mortals like myself tend to just use numbers and categorical values.

I believe part of the reason I was late to understanding vector spaces is because vectors are a much easier concept, and the assumption that a vector space is just a pile of vectors (in the same coordinate system) which can be represented as a table was an easy progression. But not every collection of vectors conforms to the rules of a vector space. The rules are what makes vector spaces useful – if you don’t conform to the rules, downstream transformations distort the “space” incorrectly. That means the relationships within the vector space will be inconsistent following transformation, and therefore any machine learning algorithm which performs transformations is going to be introducing error into the relationships it is trying to learn.

Let’s use an example. Perhaps we have data about countries over time for a few measures, like agricultural yield and precipitation. We can represent this data as vectors defined by using the attributes as the coordinate system (or dimensions). Some sample vectors might be:

Vector 1 –
Country: Atlantis
Year: 2015
Agricultural Yield: 5 tons
Total Precipitation: 100 cm/m2

Vector 2 –
Country: Avalon
Year: 2015
Agricultural Yield: 20 tons
Total Precipitation: 50 cm/m2

These vectors can be easily written as a table (I added two additional vectors for our example):

The danger is assuming that we now have a vector space. This is where the rules of a vector space start to matter. There are a bunch of rules, but I don’t think folks dealing with empirical data samples often encounter situations where you need to worry about most of them. The ones you do need to worry about I believe aren’t hard to understand with a little “common sense” interpretation. You must be able to add any combination of vectors (any rows in our table) in any order, and the result must “make sense” and be consistent (i.e. changing the order of addition doesn’t change the result).

Since I am not dealing with strictly mathematical concepts in my interpreted rules, we should spend some time defining terms. Specifically, what I mean by having a result “make sense.” I like to use this phrasing because data science is often as much about perspective and how data is represented as it is about mathematical precision. It turns out that you can have a vector space which follows all the mathematical rules but which doesn’t make as much sense as the same data represented in a different way in another vector space. Even though both vector spaces represent the exact same data, machine learning algorithms will usually be more accurate when trained on the space that “makes the best sense.” So, what makes sense? Consider the result of adding vectors 1 and 2 from the above table (recall that to add any two vectors just add the values in the same column):

The new vector that is created by the addition of vectors 1 and 2 still lives in the same coordinate system, but gives us a strange result. The addition seems to have predicted the rain and yield in the year 4030 for a strange combined country entity called “Atlantis+Avalon”. This doesn’t align with my expectation that when adding the yearly production of two countries I should just get their combined production for the year. The result from adding two rows from our current table doesn’t make sense, so the table can’t be a vector space. Even more telling, if you reverse the order of addition the combined country entity changes, becoming “Avalon+Atlantis”, again violating the vector space rules.

There are several methods to get the vector space we need. Looking at the bad result in the previous example, we see that the issue is with the combined country entity and the future year. The Yield and Precipitation columns seem to have combined in an understandable way. So, our task is to represent the Country and Year dimensions of our vectors differently. Let’s walk through 4 options that will help with our example.

Option 1: One-hot Encoding

One representation option for the problem columns in our example is called “one-hot encoding,” which is a way of taking any enumerated list of values and pivoting the values into dimensions. This means creating a column for every possible value, and using a true/false flag (or 1/0) to represent which value is “active” for a row. Using this method, the prepared table would look like:

And the addition of the first two rows becomes:

This result is mathematically valid, and makes sense… with a little study. We can see that the combined result is from 2 different counties both from 2015. If we reverse the order of the addition the result is the same. So, this is a vector space. Since this still takes a but of study to really interpret, perhaps another representation should be considered to get the highest-performing model. In this case I don’t like how this space keep countries and years separate from each other with purely binary on/off values. Without discussing the specifics, this makes it harder for an algorithm to learn about relationships between countries in the same year, or between years for the same country. Keep in mind that one-hot encoding creates lots of dimensions, which can dramatically increase model processing resource requirements (time and/or computation) – the curse of dimensionality.

Option 2: Change Coordinate System

Another option could be using a different coordinate system for some of your data points. In our example, the Year is a good candidate for such a transformation if our analysis is really interested in estimating the production from a period of time. In this case, we may not care about the time period’s “name” (i.e. “2015” or “2014), as long as we can predict the correct amount of production from our model for a time period given the other attributes (i.e. “Country” and the other columns from our example). A table with this representation might convert the Year into a duration, and one-hot encode the country:

We no longer know which named year each vector is representing, but we do know the amount of time which is represented by each vector. When we add the first two vectors we now get:

This makes sense, but a machine will still have some trouble sorting out the contribution to production from each country during the time period since there isn’t a one-to-one alignment. This issue is mitigated by having lots of year-duration input vectors which, when combined in different groupings, inform the model on individual country contributions to production.

Option 3: Combine Dimensions

Another option could be to combine our columns. For example, the enumerated value columns could have their values distributed to the other columns, creating columns for every combination. Doing this just with Country in our example, and one-hot encoding the Year, this looks like:

This representation is already adding vectors together and mapping multiple vectors from my original table into a single vector in this table. When we add two vectors together from this latest table we get:

This addition makes sense and the order of addition can be reversed (note that the “Vector ID” is not actually part of our data, just a tracking column). What I like about this representation is that the relationships between countries in a given year can now be learned more easily. We still have trouble learning relationships across years since they are one-hot encoded. A downside of this method is that if you keep creating “combination columns” for every combination of interest you wind up with all of your data in very few really long rows, which is useless since you tend to want lots of rows to for an algorithm to train on. The curse of dimensionality rears its ugly head with this method as well.

Option 4: Tensor Representation

Tensors can be thought of as vector spaces of vector spaces (remember how I mentioned a vector space can be made of anything?). Much like folks will mistake any table for a vector space, people often assume that any multi-dimensional matrix or array is a tensor. This is not the case, because vector space rules must be followed. Vector spaces alone are powerful, but tensors really get to a whole new level of power and are needed to study relativity and quantum mechanics. I am glad we have folks like Einstein who could invent such math, because the leap from a vector space to a dual space would never have occurred to me. For those who find math’s abstraction beautiful, I recommend watching Pavel Grinfeld’s lecture introducing tensors (https://youtu.be/e0eJXttPRZI) and the series of lectures on tensors from “XylyXylyX” (https://youtu.be/_pKxbNyjNe8).

The downside of using tensors is that most machine learning algorithms are designed to handle low-rank tensors. A tensor’s rank, or order, is the number of dimensions needed to reference a vector in the tensor. A scalar is a rank-0 tensor, a vector is rank-1, a vector space is rank-2, and beyond this tensors are referred to only by their rank and are considered high-rank tensors. The word “dimension” gets overused in data science, referring to both the number of coordinates in a vector and the number of directions needed to describe a tensor. This is not a big deal because ultimately both concepts refer to an indexing system and can be converted into the other. Squeezing high-rank tensors into low-rank algorithms is a robust area of study, as is the creation of new algorithms to work on high-rank tensors directly.

A tensor representation of our example data can be visually depicted as a collection of tables. This is because a table can only easily show two directions (dimensions) at a time if we don’t want to have arrays in each cell. In our example, we have 3 dimensions – Countries, Years, and Measures. If we make our tables “Measures by Years”, then we will need to have a table for each country. This looks like:

To validate that we are following vector space rules, we can add two vectors specified by any combination of our 3 dimensions. For example, adding (Country=Atlantis, Year=2015, Measure=Yield) and (Country=Avalon, Year=2015, Measure=Yield) results in 5+20, or 25. Since our dimensions are no longer cells in our table, they don’t combine using the same addition that caused our original table problems. However, you might ask where the “Atlantis and Avalon” combination lives in our tensor. The answer is that “Atlantis and Avalon” is an index within our Country dimension, and since the value for that combination doesn’t change with the order they are added, vector space rules are maintained. When materialized using the “Measures by Years” table for each country, the way we visualized our tensor previously, this looks like:

Everything makes sense, at least until you try training working with a low-rank machine learning algorithm. Methods for getting this rank-3 tensor into machine learning algorithms that operate on rank-2 tensors are outside the scope of this discussion. This is an area NuWave specializes in, and there are some tremendous benefits to working with tensors including being robust against sparsity and dirty data.

Each of the 4 options discussed help a data scientist get from raw source data to a vector space more compatible with machine learning algorithms, and each has pros and cons. The right method will depend on your data, use case, resources (time, compute, and storage availability), and goals. These methods are not mutually exclusive of each other, and can be combined in different ways. This is why the data scientist needs to be creative and inquisitive – there might be a different way to look at your problem to find a better solution.

Be aware that getting a vector space is not always a requirement (i.e. decision trees can work without them), and that data preparation is rarely finished once you have a vector space (scaling values, normalizing distributions, and many other feature engineering and selection tasks should be considered). Foundations run deep, and relying on a well-built vector space will hold up when we place lots of machine learning machinery on top of it.

Posted in Blog