Uberfuzzy Uberfuzzy - 14 days ago 12
PHP Question

parsing raw email in php

I'm looking for good/working/simple to use php code for parsing raw email into parts.

I've written a couple of brute force solutions, but every time, one small change/header/space/something comes along and my whole parser fails and the project falls apart.

And before I get pointed at PEAR/PECL, I need actual code. My host has some screwy config or something, I can never seem to get the .so's to build right. If I do get the .so made, some difference in path/environment/php.ini doesn't always make it available (apache vs cron vs cli).

Oh, and one last thing, I'm parsing the raw email text, NOT POP3, and NOT IMAP. It's being piped into the php script via a .qmail email redirect.

I'm not expecting SOF to write it for me, I'm looking for some tips/starting points on doing it "right". This is one of those "wheel" problems that I know has already been solved.

Answer

What are you hoping to end up with at the end? The body, the subject, the sender, an attachment? You should spend some time with RFC2822 to understand the format of the mail, but here's the simplest rules for well formed email:

HEADERS\n
\n
BODY

That is, the first blank line (double newline) is the separator between the HEADERS and the BODY. A HEADER looks like this:

HSTRING:HTEXT

HSTRING always starts at the beginning of a line and doesn't contain any white space or colons. HTEXT can contain a wide variety of text, including newlines as long as the newline char is followed by whitespace.

The "BODY" is really just any data that follows the first double newline. (There are different rules if you are transmitting mail via SMTP, but processing it over a pipe you don't have to worry about that).

So, in really simple, circa-1982 RFC822 terms, an email looks like this:

HEADER: HEADER TEXT
HEADER: MORE HEADER TEXT
  INCLUDING A LINE CONTINUATION
HEADER: LAST HEADER

THIS IS ANY
ARBITRARY DATA
(FOR THE MOST PART)

Most modern email is more complex than that though. Headers can be encoded for charsets or RFC2047 mime words, or a ton of other stuff I'm not thinking of right now. The bodies are really hard to roll your own code for these days to if you want them to be meaningful. Almost all email that's generated by an MUA will be MIME encoded. That might be uuencoded text, it might be html, it might be a uuencoded excel spreadsheet.

I hope this helps provide a framework for understanding some of the very elemental buckets of email. If you provide more background on what you are trying to do with the data I (or someone else) might be able to provide better direction.