Posted On: 2022-06-27
One of the topics I am passionate about is data serialization - that is, saving and loading. It is an essential part of nearly every digital task, and many languages/frameworks provide built-in support for object serialization, yet it remains surprisingly difficult to implent from a technical standpoint. I've written previously about the save system used in my current project, but for today's post I thought I'd look at serialization more generally - to explain why it's important, and why it always seems to be so difficult.
I am writing this blog post inside an application: as I type, words appear on the screen. Yet, before you can read any of this, I have to save it to a file (and upload that file to my website). This process - saving a document - is ubiquitous in digital life: writers save their writing, artists save their digital paintings, programmers save their software code, and, of course, players save their games. Even when users don't strictly need to save (ie. a game that auto-saves progress upon exit), the program is still performing that same essential process: saving.
The technical term for saving is "Serialization", and it refers to the converting of (usually in-memory) data into a format that can be persisted. As a general rule, data that is being used (ie. displayed on a screen, modified as you type, etc.) is organized in a way that facilitates and supports that particular use case. In order to store that data, however, it must be reorganized - such as writing all the data as a series of bytes (hence the term, "serialization").
As a metaphor, you can think of having in-progress work spread out on a desk. Everything on the desk is organized to make working with it easier (ie. spread out, with few documents overlapping), but in order to file those documents away (so that others can use them), they must first be reorganized: open books closed, loose pages placed one on top of another, and everything filed away according to the storage system (ie. alphabetically). Making a mistake during this process can be catastrophic: documents filed incorrectly will be lost, open books placed on shelves will be damaged, and documents that are mistaken for temporary notes will be thrown away - permanently losing that information. Serialization is the same way: it only works if everything goes in the proper place - mistakes and omissions can easily lead to lost or corrupt data.
For a developer intending to use serialization, there are two main categories of challenges: technical and design. Technical challenges pertain to how data is serialized - what tools/methods are used, what format is it converted to, etc. Design challenges, by contrast, deal with decisions about what to save - which portions of the application's current state should be saved, and how should they be organized when using multiple files/storage locations.
Fortunately, modern software development has (generally) solved the fundamental technical challenges for the developer. Most file formats (text files, images, etc.) are standardized and libraries are readily available* to help developers get their data into these standard formats. Additionally, modern hardware is so powerful that concerns about storage size or write speeds are largely irrelevant**, meaning developers don't need to think about tradeoffs when picking which library to use.
The design challenges for serialization, however, have become a bigger obstacle on modern systems. As the tools for serialization have become ubiquitous, developers have to put an increasing amount of thought into which data to feed into those tools. Consider, for example, writing a text editing application: the text being edited obviously needs to be saved, but what about other details, like the current cursor position? Text formatting (bolds, italics, etc.)? Open windows, window positions, background colors? While it's often possible to design a system in such a way that serialization tools will automatically do what you want, deciding which details should be saved versus omitted can be a significant effort.
As an added wrinkle on top of all of this: when using third-party code (as is common in modern software), a mismatch between design goals can generate technical challenges where they wouldn't otherwise exist. Consider, for example, Unity's animation system. The developers of that system anticipated serializing animation data, so it's (relatively) simple to save whole new animations. They did not, however, design the system to support serializing the current animation state: if a game developer wants to preserve the exact frame of animation during a save, it'll be a fight against the animation system with every line of code.
Deserialization is the opposite side of the process: loading save data. For the most part, serialization and deserialization challenges are paired together: if it's easy to save, it's easy to load, while the parts that are hard to save are often even harder to load. There are, however, two challenges that are unique to deserialization: order and versioning.
At the most fundamental level, the order in which data is deserialized shouldn't make a difference: you can read the data files in any order, after all. Once the data is loaded, however, the program needs to actually use it: to move windows into position or update the text displayed on the screen. For many systems, the order these operations occur can impact the stability and consistency of the overall program. If, for example, a window is displayed first, and then moved, the user may see a flash as the window briefly appears in the wrong spot before moving to the right one. When dealing with more interdependent systems, there is much higher risk of introducing instability - especially when those systems don't provide for the ability to initialize state prior to use*.
Versioning is another challenge unique to deserialization. As software changes over its lifespan, design decisions change and new features creep in. When data is serialized in one version of the application and deserialized in another, one often needs to perform some corrective action, to update the data to account for those changes. Keeping track of those changes across multiple versions can require significant development effort, and massively increase the complexity of any deserialization code. As frequent software updates become increasingly common, the burden of accurately handling data from older versions becomes increasingly important.
As you can see, serialization is a fundamental part of most (if not all) software, and, while improved tools have reduced some of the technical challenges, it's still by no means easy to accomplish. Hopefully this post has increased your appreciation for serialization, and its importance in our digital lives (even though it didn't dive into why I personally find it interesting) . As always, if you have any thoughts or feedback, please let me know.