Terms like AI, Machine Learning or Data Science are often used interchangeably in newspapers, businesses, and job advertisements. This creates false expectations about what these algorithms can really do.
It started a few years ago with Data Science, which only meant performing traditional data analysis on an increasingly larger dataset.
Most industries already had a Data Science team back then, but the tools they were using pertained more to the field of statistics than computer science. Soon, the attention shifted more towards computer science, mostly due to the technical challenge of analysing large datasets. Nevertheless, predictive modelling in Data Science, which uses statistics to predict outcomes, remained mostly similar to the one used in traditional statistics, like principal component analysis, regression, K-means, random forests, and Xgboost. Since these algorithms learn a set of “rules” from data, they have been dubbed as Machine Learning algorithms to emphasise that these rules are not hard coded by the programmer.
Machine Learning is not as new as people think it is, it spanned at least 60 years of research and development. Some people think of Machine Learning algorithms as traditional software; the reality is that a Machine Learning project is built around data. For this reason, a development process in Machine Learning is more similar to a research project than a software development one.
Imagine that you are in your lab doing some experiment; you have an initial idea, a hypothesis or a question you need to verify by collecting data, but you don’t know if your picture is right, or if the steps you’re going to follow are correct. By measuring the outcome of the experiment, you may need to change your approach, your initial idea or even challenge established knowledge. This process may be long or short, successful or not, and it is really difficult to say before you start collecting and analysing data. In the data science world, this phase is called “exploratory analysis”, followed by data analysis and modelling (these last two may be part of the same process).
While natural sciences experiments can be performed in a controlled environment, the data we work with in the Data Science domain are often “dirty”; meaning they contain a substantial portion of random or systematic errors, they are often incomplete and are generally collected by third parties under often unknown conditions. While the former sources of errors may be accounted for within the statistical analysis (systematic errors are trickier and not often correctly taken into account), the latter are impossible to consider. Data may have been collected under a certain set of hypotheses that led to biased results. This makes the development part of the Data Science workflow very dependent on data and often difficult to formalise at the very beginning of the project.
All these issues are absent in a traditional software development workflow that does not depend on the quality or structure of data but only on the project's needs, e.g. when the user interacts in this way, the software responds in this other way.
Some of the products that come out from Machine Learning include email spam detectors, Google’s search recommendations, and business insights on how to efficiently store goods, deliver your orders or efficiently take you from one place to the other with services like Uber, Grab or Lyft.
Neural Networks (NN) were originally born as a simplified representation of how the human brain works.
The basic computational unit of a NN is a simplified version of a neuron (a nerve cell), collecting data from external sources, processing them and returning the desired answer. The way the network learns also defines its structure, traditionally classified as supervised or unsupervised, although other intermediate options exist. The former learns from example, i.e. we need to give the NN both the question and its answer to “train it” until it is able to generalize to new questions for which the answer is not known. So far, this has been the most successful form of learning, especially for applications. The latter form of learning is only given an input dataset and needs to find structures by itself. Although unsupervised learning is more interesting, it is still in the research stage.
A NN has many parameters that need to be fixed (even millions!); those that are fixed during training are called “learnable parameters”, while those that need to be fixed “by hand” (or using some other software) are called hyperparameters. These are not really part of the model, but they define how learning is going to work; they are defined through a lengthy exploration that can take on a considerable time out of all Data Science processes.
Often, the initial path we try ends up failing and new directions need to be explored. Sometimes, the client’s expectations in terms of model accuracy is based on results coming from different datasets, but the reality is that the data structures and quality really change the performance of a model in a non-trivial way.
Finally, models that perform well sometimes cannot be used in production due to other constraints such as: Do the results need to be curated in real time? How fast is fast? Where is the data going to be stored and how is the model going to be deployed? Ultimately, these aspects can have a big impact in the model selection process.
While Machine Learning works well for many traditional domains, it doesn’t fare too well when it comes to computer vision due to the inability to gather information from visual imagery or Natural Language Processing tasks (understand and generate languages). This is where Neural Networks really outperform traditional statistical analysis. These algorithms try to mimic the way neurons in the brain work, so they have more of an ability to gather information from such sources. This is why many people in the field think that they are the gateway to AI.
Although Neural Networks have amazing generalisation capabilities compared to traditional Machine Learning algorithms, they still fail to generalise as well as humans do. For example, in image recognition tasks, we train the Neural Network to recognise whether an object is a car, a dog or something else. Although we may obtain very good prediction accuracy on the test set, we usually observe a drop in accuracy when the model is asked to infer on data coming from a different distribution or if the data have been slightly corrupted (also known as adversarial attack). This is also true for speech recognition tasks and in particular for conversational AI, where most algorithms do not really understand the meaning of a sentence, or most importantly, how to link sentences used at different stages of the conversation.
To view the full content, sign up for a free account and unlock 3 free podcasts, power reads or videos every month.
Senior Data Scientist | Former Data and Machine Learning Scientist