What is the newfangled topic of data science really about? If you don’t know, check out (or go reread) Lesson 1: What is Data Science?. What do I look for to spot one of these reclusive data scientists? If you’re unsure, check out (or go re-read) Lesson 2: What Makes a Data Scientist? If you do know the answers to these question, this means you have internalized the previous content from this series and are ready to start learning about some of the tools and sorcery that data scientists use to bamboozle unbelievers. Our first foray into this mysticism lies in the misunderstood realms of Machine Learning.
According to Merriam Webster, a machine is “a constructed thing whether material or immaterial” and learning is “knowledge or skill acquired by instruction or study.” We can logically join these two disparate linguistic machinations to devise that Machine Learning is when a constructed thing gains in skill through instruction or study. Perhaps surprisingly, that is a fairly robust description of what machine learning is. Even though we may use different technical terms, for example referring to accuracy and precision rather than skill, the gist of the language is the same.
On a more pedantic level, machine learning, often abbreviated ML by those who enjoy shorthand, is a subfield of artificial intelligence that focuses on statistical models and algorithms that perform a specific task without using explicit instructions. To do this, a sample of data is used to train the model to come to a conclusion without the need for step-by-step instructions. This process is often advantageous when the steps to arrive at a conclusion are not necessarily straightforward or even known. In Deep Learning with Python, François Chollet gives the wonderful comparison:
In classical programming, the paradigm of symbolic AI, humans input rules (a program) and data to be processed according to these rules, and out come answers. With machine learning, humans input data as well as the answers expected from the data, and out come the rules.
Let us pull a page from Einstein’s book and perform via this email a Gedankenexperiment ( “thought experiment”). One day you are sitting in a supermarket crying because you can never choose a good cantaloupe for purchase. A wizened figure in robes tells you that “By their sound, ye shall know them.” So, you thank Jeffery the store clerk and do the logical thing; you purchase every cantaloupe in the store.
Arriving home with your mountain of fruit you construct a knocking device that records the sound the melon makes when tapped. Jeffrey didn’t tell you what sound good melons make. So after thumping each melon, you taste it and file the recording under delicious or nasty, which you do know how to judge.
Once you have eaten lots of cantaloupes, you are ready to listen to the sounds. How do you categorize the sound of a good melon? Hollow? Round? Muffled? Metallic? No universal word fully describes that sound of a good melon in your mind relative to a bad melon. So, you tell a computer to listen. It rates the sounds as it perceives them. You told it which were good and which bad, then let it decide in which category a new sound goes. Your thumping machine can then tell if a cantaloupe in the store will taste good. This is machine learning, or a form of machine learning referred to as a classification algorithm.
Now, what if you thump a squash? An avocado? Well, we didn’t train on those, and they sound different than a cantaloupe. So, our system doesn’t tell you about them. What if your friend thinks you always pick cantaloupe that are overly ripe? In this case, the judgment of which cantaloupe was good was made by you, so “good” was a subjective term. If someone views the term differently, it may not work for them. The same is true of machine learning algorithms. It only trains based on what you give it. If your input needs to be more specific, you need to give it more data to allow it to acquire skill or knowledge.
Machine learning has become a staple of data science as faster hardware and larger datasets have become widely available for minimal cost. Indeed, given the wealth of complex datasets, understanding the patterns without a computer would be ridiculously impractical. This challenge also means, as pointed out by Chollet as well, that machine learning is largely a hands-on endeavor that gives little thought to the theoretical expanse it empirically explores.
Now that we understand how a machine learns, one may wonder, how would I go about impressing friends and family with my fruit picking prowess? Well, stay tuned for the next action-packed email Lesson 4: How Machines Learn, where we discuss methods of training a model; which model is right for you and how to be ready to train that model!
Kevin Croxall is Director of Data Science for Expeed Software. He is a data and research scientist with more than a decade of comprehensive experience in data science project design and implementation. He has a broad range of experience in software development geared toward pipeline development, statistical analysis, and data visualization and presentation.