SpliFF SpliFF - 9 months ago 54
C Question

How to create a lightweight C code sandbox?

I'd like to build a C pre-processor / compiler that allows functions to be collected from local and online sources. ie:

#fetch MP3FileBuilder http://scripts.com/MP3Builder.gz
#fetch IpodDeviceReader http://apple.com/modules/MP3Builder.gz

void mymodule_main() {
MP3FileBuilder(&some_data);
}


That's the easy part.

The hard part is I need a reliable way to "sandbox" the imported code from direct or unrestricted access to disk or system resources (including memory allocation and the stack). I want a way to safely run small snippets of untrusted C code (modules) without the overhead of putting them in separate process, VM or interpreter (a separate thread would be acceptable though).

REQUIREMENTS


  • I'd need to put quotas on its access to data and resources including CPU time.

  • I will block direct access to the standard libraries

  • I want to stop malicious code that creates endless recursion

  • I want to limit static and dynamic allocation to specific limits

  • I want to catch all exceptions the module may raise (like divide by 0).

  • Modules may only interact with other modules via core interfaces

  • Modules may only interact with the system (I/O etc..) via core interfaces

  • Modules must allow bit ops, maths, arrays, enums, loops and branching.

  • Modules cannot use ASM

  • I want to limit pointer and array access to memory reserved for the module (via a custom safe_malloc())

  • Must support ANSI C or a subset (see below)

  • The system must be lightweight and cross-platform (including embedded systems).

  • The system must be GPL or LGPL compatible.



I'm happy to settle for a subset of C. I don't need things like templates or classes. I'm primarily interested in the things high-level languages don't do well like fast maths, bit operations, and the searching and processing of binary data.

It is not the intention that existing C code can be reused without modification to create a module. The intention is that modules would be required to conform to a set of rules and limitations designed to limit the module to basic logic and transformation operations (like a video transcode or compression operations for example).

The theoretical input to such a compiler/pre-processor would be a single ANSI C file (or safe subset) with a module_main function, NO includes or pre-processor directives, no ASM, It would allow loops, branching, function calls, pointer maths (restricted to a range allocated to the module), bit-shifting, bitfields, casts, enums, arrays, ints, floats, strings and maths. Anything else is optional.

EXAMPLE IMPLEMENTATION

Here's a pseudo-code snippet to explain this better. Here a module exceeds it's memory allocation quota and also creates infinite recursion.

buffer* transcodeToAVI_main( &in_buffer ) {
int buffer[1000000000]; // allocation exceeding quota
while(true) {} // infinite loop
return buffer;
}


Here's a transformed version where our preprocessor has added watchpoints to check for memory usage and recursion and wrapped the whole thing in an exception handler.

buffer* transcodeToAVI_main( &in_buffer ) {
try {
core_funcStart(__FILE__,__FUNC__); // tell core we're executing this function
buffer = core_newArray(1000000000, __FILE__, __FUNC__); // memory allocation from quota
while(true) {
core_checkLoop(__FILE__, __FUNC__, __LINE__) && break; // break loop on recursion limit
}
core_moduleEnd(__FILE__,__FUNC__);
} catch {
core_exceptionHandler(__FILE__, __FUNC__);
}
return buffer;
}


I realise performing these checks impact the module performance but I suspect it will still outperform high-level or VM languages for the tasks it is intended to solve. I'm not trying to stop modules doing dangerous things outright, I'm just trying to force those dangerous things to happen in a controlled way (like via user feedback). ie: "Module X has exceeded it's memory allocation, continue or abort?".

UPDATE

The best I've got so far is to use a custom compiler (Like a hacked TCC) with bounds checking and some custom function and looping code to catch recursions. I'd still like to hear thoughts on what else I need to check for or what solutions are out there. I imagine that removing ASM and checking pointers before use solves a lot of the concerns expressed in previous answers below. I added a bounty to pry some more feedback out of the SO community.

For the bounty I'm looking for:


  • Details of potential exploits against the theoretical system defined above

  • Possible optimisations over checking pointers on each access

  • Experimental open-source implementations of the concepts (like Google Native Client)

  • Solutions that support a wide range of OS and devices (no OS/hardware based solutions)

  • Solutions that support the most C operations, or even C++ (if that's possible)



Extra credit for a method that can work with GCC (ie, a pre-processor or small GCC patch).

I'll also give consideration to anyone who can conclusively prove what I'm attempting cannot be done at all. You will need to be pretty convincing though because none of the objections so far have really nailed the technical aspects of why they think it's impossible. In the defence of those who said no this question was originally posed as a way to safely run C++. I have now scaled back the requirement to a limited subset of C.

My understanding of C could be classed as "intermediate", my understanding of PC hardware is maybe a step below "advanced". Try to coach your answers for that level if you can. Since I'm no C expert I'll be going largely based on votes given to an answer as well as how closely the answer comes to my requirements. You can assist by providing sufficient evidence for your claims (respondents) and by voting (everyone else). I'll assign an answer once the bounty countdown reaches 6 hours.

Finally, I believe solving this problem would be a major step towards maintaining C's relevance in an increasingly networked and paranoid world. As other languages close the gap performance-wise and computing power grows it will be harder and harder to justify the added risk of C development (as it is now with ASM). I believe your answers will have a much greater relevance than scoring a few SO points so please contribute what you can, even if the bounty has expired.

Answer Source

Since the C standard is much too broad to be allowed, you would need to go the other way around: specify the minimum subset of C which you need, and try to implement that. Even ANSI C is already too complicated and allows unwanted behaviour.

The aspect of C which is most problematic are the pointers: the C language requires pointer arithmitic, and those are not checked. For example:

char a[100];
printf("%p %p\n", a[10], 10[a]);

will both print the same address. Since a[10] == 10[a] == *(10 + a) == *(a + 10).

All these pointer accesses cannot be checked at compile time. That's the same complexity as asking the compiler for 'all bugs in a program' which would require solving the halting problem.

Since you want this function to be able to run in the same process (potentially in a different thread) you share memory between your application and the 'safe' module since that's the whole point of having a thread: share data for faster access. However, this also means that both threads can read and write the same memory.

And since you cannot prove compile time where pointers end up, you have to do that at runtime. Which means that code like 'a[10]' has to be translated to something like 'get_byte(a + 10)' at which point I wouldn't call it C anymore.

Google Native Client

So if that's true, how does google do it then? Well, in contrast to the requirements here (cross-platform (including embedded systems)), Google concentrates on x86, which has in additional to paging with page protections also segment registers. Which allows it to create a sandbox where another thread does not share the same memory in the same way: the sandbox is by segmentation limited to changing only its own memory range. Furthermore:

  • a list of safe x86 assembly constructs is assembled
  • gcc is changed to emit those safe constructs
  • this list is constructed in a way that is verifiable.
  • after loading a module, this verification is done

So this is platform specific and is not a 'simple' solution, although a working one. Read more at their research paper.

Conclusion

So whatever route you go, you need to start out with something new which is verifiable and only then you can start by adapting an existing a compiler or generating a new one. However, trying to mimic ANSI C requires one to think about the pointer problem. Google modelled their sandbox not on ANSI C but on a subset of x86, which allowed them to use existing compilers to a great extend with the disadvantage of being tied to x86.