thunder1123 thunder1123 - 1 month ago 7x
Python Question

parse a whatsApp conversation log

I am trying to write a parser for the conversation log of WhatsApp. A minimal log file at the end of the question.

In this log, there are two kind of message, the normal ones, where the syntax is

date time: Name: Message

As you can see, the
could go to newline, and the name could contain

The second kind of messages are "event" messages, which could be of the following types:

date time: Name joined
date time: Name left
date time: Name was removed
date time: Name changed the subject to “GroupName”
date time: Name changed the group icon

I tried to write down some regex, but the difficulties that I encountered are several: how to handle multiline messages, how to parse
field (because splitting on
does not work), how to build a regex that recognize messages only from senders that currently are in the group and finally how to parse the special messages (for example, parsing searching for joined as last word it is not a good idea).

How can I parse such a log file and move everything to a dictionary?

More precisely,to answer the question in the comment, the output i was thinking about was something like a nested dict:
where in the first level the keys are each sender, on the second level made a distinction between 'Events' (such join, left etc.) and 'Message', and putting everything as a list of tuples.



But if you could think of a more intelligent format, go for it!

29/03/14 15:48:05: John Smith changed the subject to “Test”

29/03/14 16:10:39: John Smith joined

29/03/14 16:10:40: Person:2 joined

29/03/14 16:10:40: John Smith: Hello!

29/03/14 16:11:40: Person:2: some random words,

29/03/14 16:12:40: Person3 joined

29/03/14 16:13:40: John Smith: Hello!Test message with newline
Another line of the same message
Another line.

29/03/14 16:14:43: Person:2: Test message using as last word joined

29/03/14 16:15:57: Person3 left

29/03/14 16:17:16: Person3 joined

29/03/14 16:18:21: Person:2 changed the group icon

29/03/14 16:19:16: Person3 was removed

29/03/14 16:20:43: Person:2: Test message using as last word left


You can use this pattern:

(?P<datetime>\d{2}\/\d{2}\/\d{2} \d{2}:\d{2}:\d{2}): (?P<name>\w+(?::\s*\w+)*|[\w\s]+?)(?:\s+(?P<action>joined|left|was removed|changed the (?:subject to “\w+”|group icon))|:\s(?P<message>(?:.+|\n(?!\n))+))


To deal with multiline message, I forbid with a negative lookahead consecutive newline characters. However, you can make the pattern more tolerant by adding the start of the next block or the end of the string in the lookahead after the \n