Alex Kale Alex Kale - 1 month ago 14
C Question

unicode string comparison in C

I'm learning UNIX system programming and I'm writing a simple shell application for UNIX (I'm on OS X Yosemite ver 10.10.5 and I use Xcode). I had some experience with C but not much.

Utility programs work fine and will print unicode characters (though ls prints '????' instead of it in Xcode console, but it seems to be the problem of the debugger itself)

I've made a little research and found out that strcmp should work fine with it too, as far as it just compares bytes and looks for a zero byte in the end. Reading input should be ok too, as you just read bytes.

I've also read that unicode string shouldn't contain null bytes. However, some input will cause EXC_BAD_ACCESS when doing strcmp.

Code:


1)Reading user input:

char* readCommand(void) {
int buffer_size = LINE_BUFFER_SIZE;
char *buffer = malloc(sizeof(char) * buffer_size);
int position = 0;
int character;

if(!buffer)
{
fprintf(stderr, "readCommand failed: memory allocation error");
exit(ALLOCATION_ERROR);
}

while (1) {
character = getchar();
if(character == EOF || character == '\n')
{
buffer[position] = '\0';
char* cmd = buffer;
free(buffer);
return cmd;
}
else {
buffer[position] = character;
}
if(++position >= sizeof(buffer))
{
buffer_size += LINE_BUFFER_SIZE;
buffer = realloc(buffer, sizeof(char) * buffer_size);
if(!buffer) {
fprintf(stderr, "readCommand failed: memory reallocation error");
free(buffer);
exit(ALLOCATION_ERROR);
}
}
}
return NULL;
}




2)Split args:

int split_string_quotes(char* source, char** argv, size_t arg_count)
{
enum split_states state = DULL;
char* p, *word_start = NULL;
int character;
int argc = 0;
for(p = source; argc < arg_count && *p != '\0'; p++)
{
character = (unsigned char) *p;
switch (state) {
case DULL:
if(isspace(character))
{
continue;
}
if(character == '"')
{
state = IN_STRING;
word_start = p+1;
continue;
}
state = IN_WORD;
word_start = p;
continue;

case IN_WORD:
if(isspace(character))
{
state = DULL;
*p = 0;
argv[argc++] = word_start;
}
continue;

case IN_STRING:
if(character == '"')
{
state = DULL;
*p = 0;
argv[argc++] = word_start;
}
continue;
}
}

if(state != DULL && argc < arg_count)
{
argv[argc++] = word_start;
}
argv[argc] = NULL;
return argc;
}


3)That's where strcmp is:

int shell_execute(char **args)
{
for(int i = 0; i < 3; i++)
{
if(strcmp(args[0], commands[i]) == 0)
{
return (*standardFuncs[i])(args);
}
}
shell_launch(args);
return 0;
}




4)And the main loop

char* current_dir = malloc(sizeof(char)*PATH_MAX);
char* args[MAX_ARGS];
char* command;
printf("dolphinShell (c) Alex Kale 2016\n");
while (1)
{
getwd(current_dir);
printf("dsh: %s-> ", current_dir);
command = readCommand();
printf("%s\n", command);
split_string_quotes(command, args, MAX_ARGS);
if(shell_execute(args) == -1) break;
}
free(current_dir);
return 0;




So the problem is following: some unicode strings I type work fine and never cause EXC_BAD_ACCESS but when I type 'фывпфвыапы', for example, it breaks. I think the problem is with accessing args[0], but here's debugger's output:



Printing description of args:
(char **) args = 0x00007fff5fbff900
*args char * 0x101800a00 0x0000000101800a00
Printing description of *(*(args)):
(char) **args = '\xd1'


So it thinks that args[0] is empty, but is it empty? Or is it confused by all the zeroes?


I'm really confused, I've spent a lot of time researching and seem to be stuck here.

I have also tried using
wchar_t
and
wcscmp
, but it doesn't work good with
execvp()
and doesn't solve the problem.

I have also tried
gcc -Wall -Wextra
and here's the output:



main.c:53:26: warning: comparison of integers of different signs: 'int' and
'size_t' (aka 'unsigned long') [-Wsign-compare]
for(p = source; argc < arg_count && *p != '\0'; p++)
~~~~ ^ ~~~~~~~~~
main.c:92:30: warning: comparison of integers of different signs: 'int' and
'size_t' (aka 'unsigned long') [-Wsign-compare]
if(state != DULL && argc < arg_count)
~~~~ ^ ~~~~~~~~~
main.c:124:23: warning: comparison of integers of different signs: 'int' and
'unsigned long' [-Wsign-compare]
if(++position >= sizeof(buffer))
~~~~~~~~~~ ^ ~~~~~~~~~~~~~~
main.c:180:18: warning: unused parameter 'args' [-Wunused-parameter]
int dHelp(char **args)
^
main.c:203:18: warning: unused parameter 'args' [-Wunused-parameter]
int dExit(char **args)
^
main.c:210:14: warning: unused parameter 'argc' [-Wunused-parameter]
int main(int argc, const char** argv)
^
main.c:210:33: warning: unused parameter 'argv' [-Wunused-parameter]
int main(int argc, const char** argv)
^
7 warnings generated.


But I don't think that's the case (correct me if I'm wrong).

That's my first question on SO, so I'm sorry for the mistakes I've made (if any).

Answer

There are multiple bugs in the shown code.

        char* cmd = buffer;
        free(buffer);
        return cmd;

This returns a pointer to a deleted char buffer. Continuing use of this pointer results in undefined behavior.

        if(++position >= sizeof(buffer))

buffer is a char *. This is equivalent to:

        if(++position >= sizeof(char *))

Which would be either 4 or 8 bytes, depending on your hardware platform. This needlessly resizes the buffer, every time it grows larger than 4 or 8 bytes.

You seem to believe that sizeof() gives the size of the malloc-ed buffer. It does not.

In conclusion: your overall approach here is to write a big pile of code, and then try to see if it works correctly. This is the wrong way to do it. You need to write one, small function. Such as the one that reads a line into a buffer. Test it. Verify that it works. Now that you know it works, move on and write the next small part of your overall program.