Last week we talked about what machine learning is and some of the basics of how it works. The essential feature of machine learning is that computers take the information you present and learn from it. While last time we left how they do that tricksy learning bit a mystery, today we discuss a bit more regarding this aspect. Or in other words, how do machines learn and what training is needed, for both the machines and for us the humans? For simplicity, we will divide this topic into the three major modes or categories of training: supervised, unsupervised, and semi-supervised learning.
When the machine is instructed, we call that supervised learning. When a machine studies on its own, we call it unsupervised learning. When the machine receives a little instruction, then ventures out to do more in-depth studying, we call it semi-supervised learning. By way of comparison, these could map to a child going to school, a child being homeschooled (with two working parents), and a child enrolled in a self-guided study program where teachers give basic guidance as the student chooses their own path within the framework. As these different situations work with varying success based directly upon the predilections of the child, each of the modes of training works with varying success based on the goal of the program. Furthermore, each also requires a different level of commitment from a teacher.
Supervised learning is a process wherein an algorithm is trained to perform a function based on labeled example data. The machine takes the input data and the desired output and figures out a mapping such that inputs lead to the requested answers. Subsequently, any item can be ingested and assigned to one of the outputs, based on its similarity to the derived connections. This class of algorithms is comprised of many machine learning staples such as support vector machines, linear regression, logistic regression, Naive Bayes, decision trees, k-nearest neighbor algorithm, and more! While this whole process sounds great and is great in practice, it often requires a significant investment in data tagging; the process by which the data are labeled for the training.
Unsupervised learning has the distinct advantage that tagged and labeled data are not needed. Indeed, there is no right answer in unsupervised learning; there is only the pattern in the data that the algorithm finds. Think of a random lady at a salad bar asking Al Gore to bring her something to eat. He goes to the salad bar and lets the rhythm guide him. The lady, as a result, gets what she gets. She didn’t say “I want a selection of cheeses,” so she can’t be mad at Al when he dances back with a pile of tapenade and a single cracker. All data can be categorized by some structure. Does that structure make sense? Does it help us? Sometimes yes, yes it does; and other times, it’s a solid nope. This type of training is useful on systems such as transactional data where like customers may be targeted in campaigns. The oft-heard jargon associated with this class of algorithm includes, but are not limited to, clustering, hierarchical clustering, k-means, anomaly detection, neural networks, and generative adversarial networks. Here you have the benefit of eliminating the need for a human to tag the data, but at the cost of control of the types of answers that may be derived and how those are interpreted.
Semi-supervised learning takes a little bit of supervised learning, and a smidge of unsupervised learning and shakes them up with some blue curaçao and tops it with a garnish (note that the curaçao and garnish are optional). One can think of this as using a small amount of data to train a model that then runs through and classifies un-labeled data. Alternatively, you could envision running an unsupervised algorithm on a large set of unlabeled data and then associating the labeled data with the found patterns. This form of training also includes reinforcement learning mechanisms.
Each of these approaches needs different setups and different approaches to the data used to train the machine. Thus, when choosing between training approaches, the main question to ask yourself is actually the relevant business question. This really should drive the methodology selection, not the resources available. Having some understanding of what data science is, what can be accomplished, and how to approach data sets for training, what remains is having a realistic view of where you are in your current process.
So, next read the exciting conclusion to our series, Lesson 5: Where are you on Dr. Croxall’s Scale of Advanced Analytics?!
Kevin Croxall is Director of Data Science for Expeed Software. He is a data and research scientist with more than a decade of comprehensive experience in data science project design and implementation. He has a broad range of experience in software development geared toward pipeline development, statistical analysis, and data visualization and presentation.