Data, Algorithms, and Philosophy

Posted On: 2020-05-11

By Mark

One of my eternal fascinations is the ambiguity between data and algorithms. This ambiguity is central to the concept of software - without which we could not have the extraordinary computing advances of the past century. What are perhaps more fascinating still, however, are the many places where we can see this ambiguity leaking out of otherwise pristine edifices, reminding us everything we know is wrong.

But, I am getting a bit ahead of myself. To understand my fascination, you need to start with understanding data and algorithms.

What is data?

In computing, data is (basically) any information stored. Data can be anything from a text file saved on the computer to color of background on a website. Data is (generally) considered inert: it doesn't do anything, it is simply a record. It is only when data is combined together with algorithms* that one is able to actually see it - whether that be opening the text file to view it or displaying the web page in a browser.

As a real-world example, consider a grocery list written on a piece of paper. The list is data: a series of items that should be purchased. The list itself doesn't do anything (it won't walk to the store and buy things for you.) When a human uses the grocery list in the store, however, then the correct goods are purchased. What's more, it doesn't matter who uses the list, the same goods will be purchased*.

What about algorithms?

Algorithms are instructions, or rules, about what to do. In computing, an algorithm is most commonly a sequence of instructions that are followed, in order, to produce some desired outcome*. Importantly, most algorithms are designed to use data in some way: for example, when you opened this web page in a browser, the browser has an algorithm that it uses to display any page, and it uses the data for this specific page in order to show you what you're seeing now.

To return to the real-world example of shopping for groceries, the sequence of steps (going to the store, picking up items, paying for them, etc.) makes up the algorithm of shopping for groceries. Following those steps, in order, produces the desired outcome (purchased groceries.) However, many times one will want to modify what is purchased - in that case, the shopper can use a list to guide the shopping. If they only pick up the items on the list, then they will always get the correct items*. Thus, data (the list) can be used by one algorithm (shopping for groceries) to produce the desired outcome (the correct groceries for that particular day.)

Where's the ambiguity?

Using data to make algorithms more flexible is great, but it eventually comes up against limitations. At some point, the algorithm itself needs to change: perhaps the process has changed, or some new flexibility is needed*. The solution to such a problem, however, is both simple and transformative: make the algorithm itself into data. Thus, as the algorithm needs to change, one simply needs to change the data that describes the algorithm**.

In computing, the concept of software is built upon using algorithm as data: computer programs (or "apps") are stored as data, and when "run", the computer then treats it as an algorithm and begins following the instructions. Operating systems (the algorithms that are responsible for running programs) are, themselves, also stored as data. What's more, many of these algorithms are sophisticated enough that they can make data alterations to themselves while running, thereby enabling the "auto-update" world of computing that we see today.

Who cares?

So, in computers, algorithms are really just data about how to use other data. So what?

Well, the first (and perhaps most unsettling) reason you should care is that this is that it has extreme software security implications. Programmers and security experts work tirelessly to make sure that they limit who is permitted to modify data that will be executed. If a malicious actor gains the ability to modify an algorithm as it's running, that is known as an arbitrary code execution vulnerability - one of the most severe categories of vulnerability*.

The second (and likely more enjoyable) reason is that it can be used to create some rather impressive spectacles. Speedrunners* frequently use bugs/exploits in the games they run, and some of the most spectacular of these bugs are built upon the player modifying data of the game while it is running. When these bugs include arbitrary code execution, some truly absurd things can happen, such as turning the 1990s hit "Super Mario World" into "Flappy Bird".

Why do I care?

While the ambiguity of algorithms and data has many practical implications, I am personally most interested in its philosophical implications. As a programmer, I meticulously plan and write out my code, thinking only of it in terms of being an algorithm. Likewise, when I use software (as a consumer) I also approach it as an algorithm - ultimately coding my own behavior with the software based upon my understanding of the rules that control the software.

Yet, regardless of how I think about it, all software is actually stored as data. The systems and patterns that I perceive reflect the data, true, but I often forget the data even exists. Even if something were to come along and change that data right out from under me, I would likely continue thinking in terms of algorithms and systems, perhaps by assuming that I found some new edge case or some such.

Thus, the models I use for thinking about software, while useful, are fundamentally inaccurate. Yet, to think about the algorithm in terms of its data would make it much more difficult to reason about the correct behavior of the system(s). I therefore find myself compromising: thinking systemically when things go right, and only thinking about the data after exhausting all other explanations.

Transferring those lessons into other aspects of life, I find myself wondering how much of what I know is mere shorthand for a reality that is too complex to to be practically reasoned about. Certainly there is some truth to that: facial recognition, for example, provides far more value than directly a perceiving a face without it. What's more, it's been argued that such simplification should be expected from evolutionary biological processes.

The wrinkle in all this, is that my knowledge about the data underlying the computing algorithms I use is based upon lessons I've been taught and diagnostic tools I have access to. In any other aspect of my life that I might transfer this onto, I have no such background or tools. What's more, as the embodied perceiver of the world around me, I am likely very poorly equipped to be able to even perceive inaccuracies in my perception - much the way we only know illusions by how they betray our expectations. Thus, I find that I don't know whether or not I know, and, in all likelihood, I genuinely can't know either way.

Conclusion

Storing algorithms as data is both foundational for modern programming and yet so easy to forget - as convenience and layers of abstraction hide the need to ever deal with such things. Yet, whenever I do pause to remember, I find myself wondering: what else have I forgotten, and, more perplexingly, what else might I take for granted without even knowing that it is otherwise?