Apache access log regex parsing

I have a custom access LOG for Apache:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{JSESSIONID}C %D %V" mylog

I am trying to parse from Python the LOGs generated; but I have two problems:

  • Requests without request method (HTTP/1.0 or HTTP/1.1) are not parsed correctly.

  • Request with spaces in the requested path are not parsed correctly (I don't know if Apache saves this path encoded or keeps the spaces, but I could generate a LOG line making a request by hand in telnet).

Using this regex:

(?P<ip>.*) (?P<remote_log_name>.*) (?P<userid>.*) \[(?P<date>.*)(?= ) (?P<timezone>.*?)\] \"(?P<request_method>.*) (?P<path>.*)(?P<request_version> HTTP/.*)\" (?P<status>.*) (?P<length>.*) \"(?P<referrer>.*)\" \"(?P<user_agent>.*)\" (?P<session_id>.*) (?P<generation_time_micro>.*) (?P<virtual_host>.*)

The parsing fails with the first 3 lines of this LOG: - - [11/Nov/2016:03:04:55 +0100] "GET /" 200 83 "-" "-" - 9221 - - [11/Nov/2016:14:24:21 +0100] "GET /uno dos" 404 298 "-" "-" - 400233 - - [11/Nov/2016:14:23:37 +0100] "GET /uno dos HTTP/1.0" 404 298 "-" "-" - 385111 - - [11/Nov/2016:00:00:11 +0100] "GET /icc HTTP/1.1" 302 - "-" "XXX XXX XXX" - 6160 - - [11/Nov/2016:00:00:11 +0100] "GET /icc/ HTTP/1.1" 302 - "-" "XXX XXX XXX" - 2981

Regex can be simulated here https://regex101.com/r/xDfSqj/2.


Try this solution: https://regex101.com/r/xDfSqj/4

It's the same thing you had, except:

(?P<ip>.*?) (?P<remote_log_name>.*?) (?P<userid>.*?) \[(?P<date>.*?)(?= ) (?P<timezone>.*?)\] \"(?P<request_method>.*?) (?P<path>.*?)(?P<request_version> HTTP/.*)?\" (?P<status>.*?) (?P<length>.*?) \"(?P<referrer>.*?)\" \"(?P<user_agent>.*?)\" (?P<session_id>.*?) (?P<generation_time_micro>.*?) (?P<virtual_host>.*)

A capture group has been added around HTTP/1.0 and given the ? quantifier. This is also added to your other groups to prevent greedy capturing.

Is this what you were trying to achieve?