Pitfalls of Alternate String Implementations

Posted On: 2021-03-08

By Mark

Strings in C# are ubiquitous: take any library and chances are good you'll find at least one (but likely many more) functions that uses or returns a string. This makes sense; strings are essentially representations of human language: a series of letters as words, sentences, etc. As such, it's reasonable to expect that most (if not all) libraries and programs will, at some point, need to communicate something - whether that's displaying a message to a user or reporting the details of an error.

For the most part, the ubiquity of strings is actually quite a good thing. The C# language designers put a lot of effort into making them user-friendly - so developers can effortlessly create, access, and modify* strings. This does, however, come with some interesting problems: if, for some reason, one needs to use an alternate implementation for strings (such as the ill-fated SecureString), that represents a daunting (if not impossible) engineering feat.

What is a string?

In non-technical terms, a string is a bunch of letters strung together in order. This very sentence, for example, is a string: it's a bunch of letters (and punctuation symbols) that are stored and displayed in a specific order to convey their meaning. Strings can be very short (the shortest is an empty string) or very long (entire books can be stored in a string) - yet developers (generally) don't need to do anything different for either of these: it just works.

No, really, what is a string?

Technically, a string is a reference to an immutable object whose location and lifetime are managed by the C# runtime*. Alongside that, C# provides a number of convenience methods to simplify creating such objects - for example, when two strings are "combined" using the + operator, the runtime is actually creating a whole new object rather than modifying either one**. The reasons for this approach are many (and interesting in their own right), but they are often summarized as providing the best performance based on the language designers' understanding of how strings are used by developers.

What's the problem?

Sometimes, the implementation details of strings can be problematic - for example, the lifetime of a string being arbitrary (ie. controlled by garbage collection) is often considered a security issue for sensitive data. At first glance, this may seem like a straightforward problem to solve: instead of using strings themselves, use an alternative implementation that does let you have fine control over how long it remains in memory*. In practice, however, that rarely works out.

There are two main reasons why using an alternate string implementation is impractical. The first of these is that, for any non-trivial problem, one will likely need to make use of functions in an existing library - whether that be a built-in one or third-party. Unfortunately, nearly all existing libaries only support regular strings* (they are ubiquitous, after all.) Thus one must either convert from the alternative's type into a regular string** or re-implement the desired function so that it supports the alternative. The latter, being the only real solution, represents a substantially larger engineering investment than simply using strings.

The second reason is, perhaps, more dire. Just the same way any libraries that you call only use strings, any libraries/frameworks that call into your code will also only use strings. This is particularly important since most software solutions are built on top of other frameworks - whether that's a GUI library for client applications or some kind of web server implementation (cloud or otherwise.) In such cases, rewriting that other framework is simply not realistic: the amount of work that goes into securing and stabilizing such foundational systems far exceeds what is reasonable to ask of any development team (either your team makes such things as your core competency, or you don't make them at all.) There's really nothing you can do about this: like it or not, the data will be in a string before your code receives it.

Is this always a problem?

It is worth mentioning that, sometimes, this isn't necessarily a problem. Unity's ECS, for example, is designed to maximize performance (and multi-threading). As a part of this, it requires that developers make use of alternative string implementations (such as FixedString) in order to store or use text content inside ECS. While this is subject to all the constraints mentioned above (can't use other libraries and upstream text sources are stored as strings), that's actually something that one would expect by default: Unity's ECS is subject to so many other constraints that using pre-existing libraries to solve problems was already an unrealistic prospect. Likewise, getting data into/out of ECS is expected to be an odious experience - doubly so when working with any of the (many) non-blittable types. In essence, when working with ECS, you're expected to be reinventing everything - so all the engineering hurdles associated with an alternate string implementation are also expected*.

Conclusion

As you can see, while strings have provided a lot of value by being a common data type for text storage, all those benefits vanish the moment one needs to use an alternate implementation. The ubiquity of strings thus becomes a double-edged sword: making conforming to the status quo easier while also making it harder for alternatives to gain traction. In many cases, one simply can't make the whole stack conform to using the alternative - it's simply too much work and risk. Yet, as we see with Unity's approach, some compromises around this can be made - provided that the developers saddled with the burden of using these alternate implementations think the benefits are worth it.