Know the limitations of your machine

More and more data is collected today at faster and faster
rates, currently more than 2.5 exabytes (the equivalent of 5M
laptop hard drives) per day, according to recent estimates by domo,
an American software company. Together with these ever-growing data
resources, technologies and methodologies are being developed to
extract “actionable insight”. One successful approach is machine
learning (ML) which extracts structure and relationships from the
data. One common setting, so-called ‘supervised learning’,
establishes a mapping from independent variables, typically called
features, to a dependent variable, typically called target.

There have been considerable developments in the field of ML.
For instance, in 2013, speech recognition technology still
delivered error rates of 1 in 5 words, but today it gives largely
accurate results (< 5% error), as we ourselves can easily
experience through our smartphones. Image classification is
on-a-par with human performance. Tech optimists argue that ML and
Artificial Intelligence (AI) will soon be solving most of our
problems.

I will argue here that whilst it is certainly true that
considerable progress has been made in the field, certain
challenges will remain. These are fundamental in nature and cannot
be easily overcome.

Structuring problems and providing
interpretations

How to frame problems in a certain context, what data to use,
which relationships to expect? Answers to these questions are
typically provided by humans with relevant domain expertise, and it
is currently not clear how these questions could realistically be
answered by ML. Complex machine learning models deliver accurate
performance in many contexts. This does, however, not mean that it
is easy to understand and communicate how predictions are made. For
human decision makers, it is often important to understand why a
certain result is the outcome of an algorithm. In certain cases,
like in the credit space, it is even mandatory to provide
explanations, for example, for the rejection of a mortgage
application. There are tools which locally approximate complex
models by simpler ones or use other criteria to measure the impact
of features on predictions. Interpretability however remains a
challenge that is unlikely to be easily overcome.

Separating signal from noise

How many people came to a large event? What is the true value of
a company? How many aeroplanes are currently in the air? Complex
questions like these often produce unprecise answers, manifested as
noisy data. Thus, noise is not restricted to unprecise sensor
measurements, but rather a ubiquitous feature accompanying many
data sets. In certain settings, such as financial data, noise even
dominates the signal. This doesn’t mean, however, that nothing can
be done. Indeed, a small edge in understanding can lead to
formidable returns as demonstrated by certain hedge funds. In
situations where the objective is to anticipate a share price one
month ahead, filtering techniques and a careful selection of
features can help substantially to extract signal. However, the
challenge of separating signal from noise will persist.

Managing low frequency or small data sets and rare
events

Machine learning is a greedy animal hungry for data. If not fed
properly, it can behave in an uncontrolled fashion; this is
referred to as a model with high variance or simply overfitting.
Some argue that this no longer occurs in the information age, where
data is plentiful, but this is certainly not true. Sometimes data
naturally comes in low-frequent intervals (quarterly or annually)
such as economic figures, company accounts, defects/accidents or
default events. In such small data situations, ML researchers
should remember that this is exactly the setting where many
techniques in statistics were developed. In the past, statisticians
were not blessed with large data sets and had to come up with an
ingenious tool box to deal with this challenge. For instance,
Bayesian approaches can include prior knowledge or carefully framed
assumptions, which are updated based on few data points.

Dealing with regime shifts or
non-stationarity

Many machine learning applications implicitly assume static
relationships between a set of features and a target. Even if time
is explicitly included in the modelling process, the underlying
assumption is that of reoccurring patterns. Indeed, it is hard to
deal with regime shifts and non-stationarity. There are techniques
to identify and model such situations, such as the use of latent
state variables to encode regime switches, however, no generic
solution exists.

As we develop and apply machine learning at IHS Markit more and
more, we too face these challenges as we further expand our product
portfolio, whether it’s focused on forecasting dividends,
predicting global vehicle prices or demand, or estimating oil
production curves, just to name a few examples. Boundaries are
being pushed and many exciting developments are on the way, but
these fundamental issues will remain. For me personally, this is a
welcome challenge and it makes working in this space more
interesting, since creative thought and critical thinking, in
combination with ML, can lead to very powerful applications. A
machine can only be operated safely and successfully when its
limitations are clear.

Posted 11 July 2019 by Johannes Bauer, Ph.D., Data Analytics Director, Advanced Analytics, IHS Markit

Source link