Franky Franky - 7 days ago 5
C Question

Implementing a matcher for the regex '[ab][^r]+r]' in assembly

I need help with my assembly code. I need to use write code, that will find range, that suit to my regex expression.

My regex:

[ab][^r]+r
, so first i'm looking for if there is 'a' or 'b' and jump to 'Start' section. Now i have a problem how to save only first occurrence of this letter.
Program should display:
5, 10
- it means, that matching string starts at 5 position and it has 10 length. Result of program i want to save in
'ecx'
and
'edx'
registries(or can i simplify it?)

I would appreciate all suggestions and help :)

Here is a code:

#include <stdio.h>

int main(void)
{
char *s = "fqr b qabxx xryc pqr"; // string length= 22, first occurence: 5 position, second one: 9
int x, y;

asm volatile (
".intel_syntax noprefix;"
"mov eax, %2;"

"xor ecx, ecx;"
"xor edx, edx;"

"lea ebx, [eax];"

"loop:"
"mov al, [ebx];"
"or al, al;"
"jz Finish;"
"inc edx;"

"cmp al, 'a';"
"je Start;"

"cmp al, 'b';"
"je Start;"

"jmp JumpNext;"

"Start:"
"mov ecx, edx;"
"jmp loop;"

"JumpNext:"
"inc ebx;"
"jmp loop;"

"Finish:"
"mov %0, ecx;"
"mov %1, edx;"

".att_syntax prefix;"
:"=r" (x), "=r" (y)
:"r" (s)
:"eax", "ebx", "ecx", "edx"
);

printf("%d, %d \n", x, y);
return 0;
}





EDIT: Here is a finished code:

#include <stdio.h>


int main(void)
{
char *s = "fqr b qabxx xryc pqr";
int x, y;

asm volatile (
".intel_syntax noprefix;"

"xor ecx, ecx;" // First occurrence of letter 'a' or 'b'
"mov edx, 1;" // Matching string length

"lea ebx, [%2];"

"loop:"
"mov al, [ebx];"
"cmp al, 0;"
"jz ThereisNoChars;"

"cmp al, 'a';"
"je FoundAorB;"

"cmp al, 'b';"
"je FoundAorB;"

"inc ecx;"
"inc ebx;"
"jmp loop;"

"FoundAorB:"
"inc ebx;"
"inc edx;"

"mov al, [ebx];"
"or al, al;"
"jz ThereisNoChars;"

"cmp al, 'r';"
"je isRafterAorB;"
"jne ThereIsNoR;"

"isRafterAorB:"
"mov edx, 1;"
"inc ebx;"
"inc ecx;"
"jmp loop;"

"ThereIsNoR:"
"inc ebx;"
"inc edx;"
"mov al,[ebx];"
"or al, al;"
"jz ThereisNoChars;"
"cmp al, 'r';"
"je Finish;"
"jmp ThereIsNoR;"

"ThereisNoChars:"
"mov ecx, 0;"
"mov edx, 0;"
"jmp Finish;"

"Finish:"
"mov %0, ecx;"
"mov %1, edx;"

".att_syntax prefix;"
:"=r" (x), "=r" (y)
:"r" (s)
:"eax", "ebx", "ecx", "edx"
);

printf("%d, %d \n", x, y);
return 0;
}


It displays expected result(5, 10). It means, that matching regex starts at 5 position and it has 10 length

Answer

First of all, you are a bit unclear about your requirements. When I first read your post, it looked like you were trying to write a complete "regex" routine in assembler (<blech>). But looking closer, it appears that all you are really doing is "hardcoding" this one, very specific regex search in assembler. This misunderstanding is probably why this question isn't getting any responses.

Second, you should talk to this guy who is apparently in the same class as you. Perhaps you two can share notes.

Third, someone should talk to your instructor about his assignments. Using gcc's "inline asm" to teach asm is probably the hardest possible approach. Does he hate his students? And I'm not impressed with the "template" that he provides which (apparently?) you are not permitted to change. I can see at least a half dozen things I'd change here.

Fourth, you say that the regex string "[ab][^r]+r" should print out 5, 10 for "fqr b qabxx xryc pqr". I'm not sure how that can be. The match does start at (zero-based) 5, but it does not end at position 10:

          1         2
0123456789012345678901
fqr  b qabxx  xryc pqr
     ^         ^
   start      end

The end is position 15. The matching string (b qabxx xr) is 11 characters long, so apparently you aren't looking for the length. There is a second 'start' that occurs at position 8, a third at position 9, and there are multiple possible endpoints as well. None of which explains where you are supposed to get "10". I'm going to assume you meant 5, 15.

All that said, processing [ab][^r]+r breaks down into essentially 3 parts:

  1. [ab] Find either 'a' or 'b'. Exit with error upon encountering end of string if you can't find them.
  2. [^r] If the letter immediately after (1) is 'r', goto 1.
  3. +r Walk the rest of the string and exit with success on next 'r', or exit with error at end of string.

If you don't understand why these are the parts, try playing with https://regex101.com/r/E3nI1F/1 (it lets you try various regex search strings to see what it finds).

Looking at your current code, I don't think you handle either (2) or (3) correctly (actually, I don't think you handle them at all). While there are other things I would change in your code, perhaps tuning should wait until the thing works correctly.

Given that this is homework, I'm not keen on just posting my code. If you just copy/paste my work, you aren't going to learn anything.

If you want to edit your question once you have added the work for 2 and 3, I could review this again or give more suggestions. If you post a working copy, I can share mine and we can compare them.

----------- Edit 1 --------------

my teacher does not seem to hate us

Oh? Consider this code (a simplified version of yours):

asm volatile (
   "xor %0, %0;"
   "mov %1, %2"
   :"=r" (x), "=r" (y)
   :"r" (s));

Seems pretty straight-forward, right? Zero out x, and copy s to y. However, due to something called "early clobber" (see '&' on https://gcc.gnu.org/onlinedocs/gcc/Modifiers.html), it is possible (not guaranteed) that when optimizing, gcc will choose the same register for both %0 and %2 (or maybe %1 and %2). So when you zero out %0, you could also be zeroing out %2.

This can be fixed by adding ampersands to ensure there's no overlap:

:"=&r" (x), "=&r" (y)

But how are you expected to know this? And knowing this detail doesn't help you learn assembler. It's just a weird quirk about how gcc's inline asm works. If you were writing an actual asm routine (which is what I'd recommend), you'd never need to know this.

And wouldn't this be easier to read if you used symbolic names?

asm volatile (
   "xor %[x], %[x];"
   "mov %[y], %[s]"
   : [x] "=&r" (x), [y] "=&r" (y)
   : [s] "r" (s));

I find it easier to read. But this is another thing that isn't really about assembly language. It's just a trick about how to shove inline asm into c code when using gcc (something you should almost never do).

What else? Some other issues with this template: The volatile qualifier doesn't belong here. It's missing the "cc" clobber. And the "memory" clobber. And you end up clobbering more registers than you need. Oh and why not just tell people to compile with -masm=intel and avoid that ".intel_syntax noprefix;" and ".att_syntax prefix;" junk (yet more gcc quirks).

Using assembly language can be useful. I'm not trying to say that it isn't. But trying to use gcc's inline asm is filled with quirks. Since functions written in pure assembler can be called from C code, and since that method has none of these issues, I can only conclude that you are being forced to do this because you were mean to him/her and (s)he hates you.

Comments