By popular demand from friends and coworkers, here is the most basic introduction to machine learning I can muster. I assumed only a passing familiarity with math, and tried to keep things really at the basic level so that everyone can follow along in the thought process. Once you understand the principles, you can begin to learn more math (you will need a lot), and ultimately, discover the genuinely mind-blowing intellectual bliss that is statistical learning theory.

When I first began to explore the field of machine learning, the subject was shrouded in mystery. It brought to mind sophisticated algorithms, closely-guarded secrets of technology companies and hedge funds, capable of extracting incredible meaning from large volumes of seemingly useless data. It almost seemed like magic to me, as I couldn’t begin to wrap my head around what exactly machine learning was, how it worked, and how I could learn about it. This post is intended to help you understand exactly what machine learning is, and therefore, how it might be useful to you.

**Introduction**

My personal journey into machine learning began in eighth grade, with a half-baked idea to construct, essentially, a chatbot — a small computer program that could simulate another person having a (rather shallow) conversation with the user. I quickly began to realize I was treading in unfamiliar waters — every Wikipedia page I went to, the math looked like alien chicken scratch. I decided my best chance was to try out different approaches until I found something that worked well enough to impress an eighth grade science teacher (read: not that well).

And what I came up with was awful. It basically scanned your sentence for words that it recognized, assigning “happy or sad points” to each word based on whether it held positive or negative connotations, and would respond in a certain way by summing over these values. I did this in anticipation of the kinds of things a bunch of eighth grade kids would likely request I put into the program — computerized insults, I figured, would be expected and appreciated by that audience.

All said and done, it was a pretty pitiful attempt at “machine learning.” There was no learning taking place — I now know, it was a handwritten linear bag-of-words model, and as such perhaps the simplest form of what we might call “artificial intelligence.” It couldn’t learn, it didn’t perform well, and it generated rather silly answers sometimes.

But there are some things to learn from this “model.” It’s really the mental process that’s important. The first thing you have to do is look at your problem on the face, and figure out how you can convert this problem into math. Here, we have text documents, but you can imagine a similar situation if you are working with images, time series, or anything else. You need to transform your object of study into something that we can use to do math.

In this case, I did it by simply looking at each word individually, no matter where it was in the sentence. As it turns out, this is a very popular way for machine learning researchers to transform text (our object of study) into math (which has tools of value to them). It’s called “bag of words.”

**The Representation**

What is bag of words? Well first we have to talk about what a vector is. A vector is a list of numbers. Each number comes from a particular set (e.g. the reals **R**, the integers **Z**, the complex numbers **C**, the rationals **Q**, etc.). A superscript tells you how many numbers there will be in that vector. So, for example, [1, 3, 5] is a vector in **Z**^{3}, and [2.2222…, 3.5, pi] is a vector in **R**^{3}. In the bag of words model, a whole text document can be represented by a vector in the nonnegative integers of dimension equal to the number of words in the “vocabulary” — that is, words significant enough to keep track of. The representation is simply that each spot in the vector corresponds to a particular word in the vocabulary, and the number in the spot equals the count of that word in that particular document. So here we have it, one vector per document — we have transformed our object of study into something we can work with. It’s a simple representation, but for now, it’s one we’ll work with. (And by the way, it’s above and away the most common choice to convert the object of interest into a vector in **R**^{n} of some sort.)

Let’s start with an example of three sentences.

- Sentence A: “The cat chased the mouse.”
- Sentence B: “The cat ate the mouse.”
- Sentence C: “The dog ate the cat.”

Let’s choose a four-word vocabulary V = {“dog”, “cat”, “mouse”, “ate”}. Now our three vectors are as follows:

- Sentence A: [0, 1, 1, 0]
- Sentence B: [0, 1, 1, 1]
- Sentence C: [1, 1, 0, 1]

Now that we have vectors, we have a bunch of things we can do with them. For one, we can compare them to each other — since vectors can be thought of as points in a plane of their dimension, we can compare them by looking at how close those points are to each other! That way, we can quantitatively compare text documents. Now we’re getting somewhere!

**Getting to the Mathematics**

We’ve figured out how to turn our text documents into vectors, which are mathematical objects of value to us. Now we have tools at our disposal to try to start solving problems involving these text documents.

*Task #1: Searching*

Say we’re given a new document: “The dog ate the trash.” The user wants to find the document from our corpus with the closest subject matter. You can think of this as a “More Like This” feature, such as you would see on search engine results.

So let’s convert that into a bag-of-words vector using the above vocabulary: [1, 0, 0, 1]. Now we need an intuitive way to compare these document vectors to each other. We alluded above to the fact that, since each vector represents a point in 4-dimensional Euclidean space, it would be natural to consider our document similarity to be related to how far these points lie from each other. When I say “far,” I truly mean “distance.” If these were two-dimensional vectors, we would indeed be talking about the distance between the two points when plotted on the graph. To do this, we need the Euclidean distance formula. Yes, it’s basically the same as the Pythagorean Theorem you used in grade school (for obvious reasons).

Now, we can create our first model truly capable of learning. It takes a document, and finds similar documents by calculating the Euclidean distance between training points (the first three documents) and the test point (the new one). How do we know this is true “learning”? The definition of machine learning is a computer program that gets better at performing a task the more training data it’s given. As you would see, the more information you gave to this model, the higher quality the matches would be.

*Task #2: Supervised Learning*

Now, let’s take one more leap. Say each of our three training documents also were “labeled.” They could be labeled however you want — by topic or subject matter, by sentiment (positive or negative), or however else. Then, when we did this same searching as described above, we would have another result: the labels of the nearest *k* matching objects.

Not surprisingly, this algorithm is called the *k*-nearest neighbor algorithm. Below is a simple visual layout of how the decision process would look in two dimensions:

The dotted and solid circle represent different choices of the parameter *k*, which is essentially a smoothing parameter over the decision space. The green circle is the test point, and the others are the training set. The classification is made on a “majority vote” basis — for this reason, small odd numbers are usually used for *k* (e.g. 1, 3, 5).

Despite its disarming simplicity, the *k*-nearest neighbor algorithm is actually quite effective under certain circumstances. In fact, I do know of a hedge fund that uses this algorithm to drive their quantitative investment process. In their particular case, the data has certain properties that make kNN a reasonable choice. Their data are all of the same magnitude, and in the same units, and there is little to no covariance structure between explanatory variables. This prevents many of the usual kNN pitfalls from negatively affecting their results. But while kNN is a good benchmark, machine learning is generally based on much more sophisticated optimization problems representing some intuition about the data.

*Task #3: Unsupervised Learning*

There’s one more machine learning task that is worth mentioning, although unlike the previous two examples, there is no exceedingly simple solution to the problem. It’s called unsupervised learning.

Unsupervised learning solves a task that you may not be immediately familiar with. The most common example is called “clustering.” The idea here is, given *unlabeled* data, to break the set of objects into logical groups. Clustering has a number of different relevant applications. For example, say we have the results to a web search, and we want to group the documents together by topic to make them easier to navigate. We would need a clustering algorithm. Or say we had a list of stocks and we wanted to split them into groups that tend to move together. Again, clustering is a good solution.

The equivalent of k-NN for clustering (read: the simplest possible algorithm) is called k-means. It is an iterative algorithm that starts with an arbitrary set of cluster labels for each object. It then makes pass after pass, reassigning the objects to a new group in order to improve the objective function. In this case, the objective function refers to the average distance between a cluster’s elements and its *centroid*, which is just the average of all vectors in the cluster. After enough rounds have passed or the objective function has reached a sufficiently low level, it terminates.

The other common form of unsupervised learning is called “dimensionality reduction” and it solves a problem that may be very new to you. Say each of our feature vectors, unlike the above example, had 1,000 dimensions. In this case, there are so many numbers to consider that we can’t even start analyzing it.

Let’s consider a new example, because it will be easier to mentally visualize. Say we gave a poll to a group of registered voters, asking them to rate how they felt about 1,000 different political issues on a scale from -10 to 10. So each person hands you a vector of dimension 1,000, with each component being a number between -10 and 10.

However, there is definitely more structure to this data set than a data set with, for example, vectors of 1,000 identically and independently distributed random numbers. Such a data set would have very little structure. But with out political survey, we expect the opposite. Most people will fall into an obvious political party category — either Democrat or Republican. People who are pro-choice are also likely to be anti-war, pro-immigration and pro-gun control.

Dimensionality reduction methods discover and exploit this covariance structure between the variables in the training vectors. It allows you to convert your vectors of 1,000 dimensions into vectors of, say, 4 dimensions, by projecting the data onto a new basis (a linear algebra term) that most efficiently represents the data.

The most common example of this is called Principal Components Analysis, and it is widely considered one of the most exciting results of modern linear algebra. The method proceeds by running an eigenvector decomposition on the covariance matrix of a data set.

If, for example, the covariance matrix were the sample covariances between a set of traded assets, the eigenvectors represent components of the market that move together. The vectors themselves can be interpreted intuitively, and will often represent sector groups, market capitalization levels, and so on.

**Getting to the Real Thing**

I hope these examples helped you get a better understanding of what machine learning is, and how it could help you. I think looking at these very simple examples helped me understand the big picture, making the rest a lot more logical. The main thought process, to recap, is simply a.) represent your data as mathematical objects, b.) translate an intuition about the data into a mathematical optimization problem, and c.) solve this problem to get your model parameters, and use these to make predictions about new data.

In reality, machine learning researchers use much more complicated algorithms than the ones we went over here. I will touch on just one example, to give you a taste of the kind of thinking we’re dealing with.

The first is called a support vector machine, and it’s probably my favorite. Much like k-NN, the intuition is simple: find a separating hyperplane (or a line, in 2-d) that best separates the data of one category from the data of another. To do this requires quadratic programming, and usually transforming the optimization problem into its dual, for efficiency.

But here’s the kicker. Since SVMs only rely on dot products between samples, researchers can apply a technique known as the *kernel trick*. This basically means that they can transform a linear algorithm like SVM into a nonlinear one by transforming the data set prior to learning using something called a *kernel*. Kernels take data that may not be linearly separable in the space they live in, and project them into a much higher dimensional space in which the data can be separated linearly. In fact, the space is often infinite dimensional! In college, I wrote a paper on the application of SVMs to long-term investing based on fundamentals, and the SVM performed incredibly well due to its use of kernels.

**Future Reading**

Fellow blogger Quantivity put together a very nice list of suggest reading, available here, here and here. Those lists should be enough to give you some nice background reading, after which I have one simple suggestion: *start reading papers*. Even if you don’t understand them, reading a lot of papers (usually found on Google as PDFs) will help you hone your skills and learn about the scientific thought process.

**Conclusion**

Machine learning algorithms represent one of the most exciting advancements in technological capability in a long time, and for a large part they haven’t made their way into the mainstream. In fact, when they do, people rarely realize that machine learning powers what they’re using (perhaps the most famous example of this is Google, powered by an algorithm surprisingly similar to PCA, described above).

The problem for most is that the algorithms often require sophisticated mathematical tools, that make them inaccessible to most users. Some solutions currently exist, like MATLAB and Mathematica, but even those require some serious math and programming expertise. That’s why I’ll be releasing a product in the next few weeks that will allow financial professionals and investors to swiftly apply machine learning, as well as a large number of other quantitative financial models, to researching, building, optimizing and managing risk in their portfolio. But more on that later!