drishit96 drishit96 - 4 months ago 18
Java Question

Reading chat message using regex

I am using regex in Java to read contents of messages from a text file, where each conversation is of the format:

02/05/16, 12:05 AM - ‪+91 00580 00000: Hello

02/05/16, 12:06 AM - ‪Ross Clark‬: Hello there!

I have formed the following pattern:

\d\d/\d\d/\d\d,\s\d{1,2}:\d\d(\s\w\w)?\s-\s((\w+\s?\w+)|(\+\d{2}\s\d{5}\s\d{5})): ((.*)(\n)*(.*))+


The problem is that the output shows the chats which have the name of the sender, for example in the above chat sample, message sent by 'Ross Clark' is matched but the message with the number: +91 00580 00000 doesn't match. However, there also some rare cases when some messages with the number match.

Please help, I am new to this.

EDIT: I want to know when the sender is a name or a number, i.e I want the name to be catched by one group and number by another, so I can differentiate.

oak oak
Answer

If you know the format of the message and its like:

 <Date>, <Time> - ‪<NameOrNumber>‬: <Message>

Then you can search for the text in between the - and the :

  1. Search with strings functions
  2. Regex

Version 1

Regex version based on your solution \d\d\/\d\d\/\d\d,\s\d{1,2}:\d\d(\s\w\w)\s-\s(.+?): ((.*)(\n)*(.*))+ In this case 2nd group will have the name or the phonenumber note the the forward slashed for date is escaped in this version so you may need to change it

Version 2

:.+?-\s(.+?): search for a text in between the - and the : the first group will hold the name or the phonenumber. Assuming the message format mentioned above.

Version 2+

:[^-]+-\s([^:]+): search for a text in between the - and the : the first group will hold the name or the phonenumber. Assuming the message format mentioned above.

Version 3

:.+?-\s(.+?):(.+) first group - NameOrNumber 2nd group - Message

Version 3+

:[^-]+-\s([^:]+):(.+) first group - NameOrNumber 2nd group - Message

Version 4, Assuming Number starts with + and name does not

:[^-]+-\s([^:\+]*)(\+*[^:]+):(.+)

  • First group holds the name if any
  • 2nd group holds the number starts with + if any
  • 3rd group holds the message
  • online version 4 example

Version 5 - multi line support for message (date based delimiter)

(\d{2}\/\d{2}\/\d{2}),\s([^-]+)+-\s([^:\+]*)(\+*[^:]+):((.|\n(?!\d{2}\/\d{2}\/\d{2},[^-]+))+)

  • First group holds the day
  • 2nd group holds the time
  • 3rd group holds the name if any
  • 4th group holds the number starts with + if any
  • 5th group holds the message
  • 6th group holds the char after the delimiter online version 5 example

How does version 5 work?

  1. (\d{2}\/\d{2}\/\d{2}) look for dd\mm\yy format
  2. ,\s([^-]+)+-\s look for the time that should be after , and before -
  3. ([^:\+]*)(\+*[^:]+): look for text before the next :. If there is + then its a number if there is non than its a name
  4. ((.|\n(?!\d{2}\/\d{2}\/\d{2},[^-]+))+) - The tricky part. This is tricky because . find any char expect new line. So what does this part do? It search for any character or \n that is not followed by dd\mm\yy,<anything here> -. In simple word if new line start with date it does not capture it as part of the message.

notes

\d{2}\/\d{2}\/\d{2} allows illegal dates like 99/99/99. Its possible to prevent it but its solution is a verrrrrrry large regex

,\s([^-]+)+-\s - This search for the hour assuming its in between , and -. This can be done more carefully depending on the real needs.