On algorithmic bias

The latest entrant to the increasingly shrill debate on artificial intelligence is Alexandria Ocasio-Cortez, first-time Congresswoman from New York. At a recent event, Ocasio-Cortez sounded a warning bell about AI noting “[a]lgorithms are still made by human beings, and those algorithms are still pegged to basic human assumptions.” She went on to say: “They’re just automated assumptions. And if you don’t fix the bias, then you are just automating the bias.” Given to hyperbole (and enticed by prospects of a handsome royalty check), one American humanities professor even published a book titled “Algorithms of Oppression.” Closer home, two Indian experts — in a pioneering 2017 article — noted: “As coders and consumers of technology are largely male, they are crafting algorithms that absorb existing gender and racial prejudices.”

While well-meaning and presumably driven by social concerns, these statements — taken at their face value — are inaccurate. And the fallacy in all of them has to do with an incorrect characterization of what algorithms are and what they do. To see this, let us recap some basic notions from computing.

There are essentially three related and distinct notions in computer science and software engineering:

  1. The first step is fixing the application objective. This is what businesses, governments, computer scientists and software engineers have in mind when they set out on a project. For example, this could be a high-level objective such as making a computer sort through a large number of digital images and identify the ones with cats in them and the ones with dogs. This is the practical top-level goal.
  2. This application objective is then transformed to a well-defined computational problem if that is indeed possible. So the cats/dogs problem hopefully becomes a (very complicated) mathematical poser that a computer can understand. This is, more often than not, extremely hard and something most computer scientists spend their energy on.
  3. Finally, to solve this mathematical problem one uses algorithms which are essentially recipes that take a set of inputs and converts them into outputs. Algorithms have been around long before computers were, the term itself going back to 11th century AD and the concept to ancient Greece. When you add two numbers by hand, you are implementing an algorithm: the inputs are the two numbers and the output a single one. When you are dividing two numbers through long division, you are implementing another (which, incidentally, has the algorithm for subtraction as a component). Not all computational problems can be solved using a fixed algorithm. You may need a large set of them and then creatively choose or combine them to solve your problem.  Even when you can in principle solve a computational problem with a given algorithm, it may not be the most efficient way of doing so, in which case you may choose another or — and this is much more difficult — create a new one.

To reiterate in the context après Ocasio-Cortez:

  1. An algorithm does not have data in it. In an algorithm, data is fed as an input and retrieved as an output. Mathematically, an algorithm is a function.
  2. Algorithms are mathematical and logical recipes. Euclid’s algorithm to find the greatest common divisor of two numbers is a basic example. That said, making a computer carry out an algorithm can be tricky.
  3. Algorithms are not, as the congresswoman from New York claims, “pegged to basic human assumptions.” The only thing assumed in writing an algorithm is that basic principles of logic hold.

To see this, let us go back to our example of cats/dogs pictures. A commonly used machine-learning (ML) algorithm for such classification problems are ‘support vector machines’ (SVM). And such algorithms have purely mathematical statements. One online tutorial puts this very succinctly: “The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space (N — the number of features) that distinctly classifies the data points.” This is a pure-mathematical statement — no “human assumptions” (unless you count the veracity of mathematics as one). The inputs to an SVM are data points and the output their classification. Ditto for other ML algorithms like logistics regression. Indeed, in many cases, these algorithms — as mathematical recipes — date back to the 1950s, long before they had any chance of being implemented on modern computers.

So what gives? The issue is with data that is fed into the algorithm and therefore the application objective. ML algorithms input training data which enables them to generate patterns and solve specific queries. (These patterns are then tested against testing data — to see how well the algorithm is doing its job — its predictive capacity, in many cases.) If your training data is problematic — such as taking historical crime records in countries known to have disproportionately incarcerated minorities — your algorithm will misbehave. But this is not the algorithm’s problem.

It must be noted that these issues — in various guises — have been around since the beginning of computing, colloquially known as “garbage in, garbage out.” Statisticians have also grappled with the problems of sampling, inference, selection bias, and errors. In other words, this stuff is old hat — and not specific to AI. A final related remark: machine learning and other branches of applied statistics are fundamentally probabilistic: by their very nature, you will never be able to obtain fool-proof answers to all your questions. With skill, you can only minimize your error.

All of the above may seem as semantic hairsplitting, but it is not. In this post-truth era, words and definitions are being increasingly stretched. Influencers and politicians — especially those interested in STEM policy — must stay faithful to both.

Author: Abhijnan Rej

Analyst, researcher, and consultant