String Initialization is Tricksy

A friend of mine had to hand in an assignment for Computer Science in university. As I understood, it was a relatively easy assignment. And the point is that that friend is very experienced programmer and knows a thing or two about it. Anyway, he had a line in the code which goes like this:
char buf[1024] = “abc”;

You don’t even need to know C in order to understand that line, right? I assume we all agree to that. It simply initializes the buffer with a constant string literal. So his lecturer asked him, what does this line do precisely. And to his surprise his answer was incorrect. The correct answer is that the whole buffer is initialized and then the string constant is copied (this can be done in a few ways, for example copying a buffer with the zeros at the end of it). So today another friend called me on the phone to ask about this thing, why our first friend was wrong about it. Now as a reverser, I suppose I need to know the answer to such a simple matter as well. But, the sad part was that I was wrong as well as the two of them. I fired up the C standard and started to search for the solution. I wanted al iving proof to the matter at hand. Looking here and there it took me around 15 mins to lie my hands on the piece of sentence that settled all that matter down. And I quote:

“If there are fewer initializers in a brace-enclosed list than there are elements or members of an aggregate, or fewer characters in a string literal used to initialize an array of known size than there are elements in the array, the remainder of the aggregate shall be initialized implicitly the same as objects that have static storage duration.”

The underlined text is the answer – If there are less characters than the size of the array to initialize, the remainder has to be initialized as well. There is another clause which explains how the initialization is being done, but for now, let it be ‘zeroing’.

Now the reason I was wrong about it is because I happened to see many (for example):
mov [buf+1], ‘a’
mov [buf+2], ‘b’
mov [buf+3], ‘c’
mov [buf+4], ‘\0’

in lots of functions, and that means the source C code is:

char buf[] = “abc”;

The standard says about this case that the size of the buffer is to be acquired from the size of the literal constant string, don’t forget the null termination character as well. So that’s why I didn’t see the memset coming in to initialize the all buffer. Besides, maybe most of the people code it this way:
char buf[1024];
strcpy(buf, “abc”);

Which doesn’t lead to a memset or other way of initialization of the rest of the array.

This entry was posted on Monday, December 29th, 2008 at 1:43 pm and is filed under Assembly, C++. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

10 Responses to “String Initialization is Tricksy”

AmiRach says:

December 29, 2008 at 2:16 pm

I think that beside the Standard C , all “normal” Compilers , compiles like what you all 3 thought at the beginning. Now, why is that?
arkon says:

December 29, 2008 at 3:38 pm

I checked other compilers, they all initialize the buffer.

That’s why I added, down there, another sample of code which shows how people write their code, or a different approach – would be that they don’t indicate a size for the buffer in use.
Yoni says:

December 29, 2008 at 3:53 pm

There is a glaring discrepancy between the two words in the post that are in bold. So – what’s initialized? The whole buffer before copying, or the remainder after copying?

(I hope it’s the latter. Seems to be an awful waste to zero-then-write. But then, I used to cringe when memsetting structs to 0 right before filling some of their fields. Maybe I should accept that this is the way to be safe and sound.)
arkon says:

December 29, 2008 at 3:56 pm

Hey Yoni,
well sorry for the confusion. I tried to say that it is implementation defined how the object, after initialization, will have it’s string and other zeros appended.
So to speak, some implementations copy the buffer from a data section with the zeros already inside of it. Some memset the all buffer and only then strcpy the string, etc…

Eventually the rest of it has to be initialized to zero.
arkon says:

December 29, 2008 at 3:57 pm

Oh and about you filling zeros first, it’s a part of defensive-programming. Sometimes for the sake of optimizations I don’t zero buffers when the next line is filling them in with other data. But otherwisely, I totally agree with that.
StatusReport says:

December 29, 2008 at 4:12 pm

:)

The funny part here is that:
char buf[1024];
strcpy(buf, “abc”);

Is actually faster than
char buf[1024] = “abc”;

And because this zero padding is an ANSI C standard compilers cannot optimize this to achieve better performance.

Oh well, only 2 points were reduced :P
arkon says:

December 29, 2008 at 4:47 pm

Well yeah specifically in that case it’s faster. Mind you it can substitute the strcpy with a simple DWORD copy.
But anyway, it’s *stupid* to code a predefined size for a specific constant string.
Aram Havarneanu says:

December 30, 2008 at 8:46 am

Heh, I always initialize arrays with:

type buf[bsize] = {0};

I think it is quite elegant. :).
arkon says:

December 30, 2008 at 9:23 am

Usually I, personally, do it too, but again, for performance it sucks.
Chris says:

May 12, 2009 at 10:04 pm

> Well yeah specifically in that case it’s faster. Mind you it can substitute the strcpy
> with a simple DWORD copy.
> But anyway, it’s *stupid* to code a predefined size for a specific constant string.

This may be for writing to a file something like:
write(fp, buf, 1024);

Insanely Low-Level

String Initialization is Tricksy

10 Responses to “String Initialization is Tricksy”

Leave a Reply