Imagist Imagist - 1 month ago 12
C Question

Why null-terminated strings? Or: null-terminated vs. characters + length storage

I'm writing a language interpreter in C, and my

string
type contains a
length
attribute, like so:

struct String
{
char* characters;
size_t length;
};


Because of this, I have to spend a lot of time in my interpreter handling this kind of string manually since C doesn't include built-in support for it. I've considered switching to simple null-terminated strings just to comply with the underlying C, but there seem to be a lot of reasons not to:

Bounds-checking is built-in if you use "length" instead of looking for a null.

You have to traverse the entire string to find its length.

You have to do extra stuff to handle a null character in the middle of a null-terminated string.

Null-terminated strings deal poorly with Unicode.

Non-null-terminated strings can intern more, i.e. the characters for "Hello, world" and "Hello" can be stored in the same place, just with different lengths. This can't be done with null-terminated strings.

String slice (note: strings are immutable in my language). Obviously the second is slower (and more error-prone: think about adding error-checking of
begin
and
end
to both functions).

struct String slice(struct String in, size_t begin, size_t end)
{
struct String out;
out.characters = in.characters + begin;
out.length = end - begin;

return out;
}

char* slice(char* in, size_t begin, size_t end)
{
char* out = malloc(end - begin + 1);

for(int i = 0; i < end - begin; i++)
out[i] = in[i + begin];

out[end - begin] = '\0';

return out;
}


After all this, my thinking is no longer about whether I should use null-terminated strings: I'm thinking about why C uses them!

So my question is: are there any benefits to null-termination that I'm missing?

Answer

The usual solution is to do both - keep the length and maintain the null terminator. It's not much extra work and means that you are always ready to pass the string to any function.

Null-terminated strings are often a drain on performance, for the obvious reason that the time taken to discover the length depends on the length. On the plus side, they are the standard way of representing strings in C, so you have little choice but to support them if you want to use most C libraries.