Dan Dan - 3 months ago 19
C# Question

Search for some phrases in a text file using Regex C#

The task:

Write a program, which counts the phrases in a text file. Any sequence of characters could be given as phrase for counting, even sequences containing separators. For instance in the text "I am a student in Sofia" the phrases "s", "stu", "a" and "I am" are found respectively 2, 1, 3 and 1 times.

I know the solution with string.IndexOf or with LINQ or with some type of algorithm like Aho-Corasick. I want to do same thing with Regex.

This is what I've done so far:

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;

namespace CountThePhrasesInATextFile
{
class Program
{
static void Main(string[] args)
{
string input = ReadInput("file.txt");
input.ToLower();
List<string> phrases = new List<string>();
using (StreamReader reader = new StreamReader("words.txt"))
{
string line = reader.ReadLine();
while (line != null)
{
phrases.Add(line.Trim());
line = reader.ReadLine();
}
}
foreach (string phrase in phrases)
{
Regex regex = new Regex(String.Format(".*" + phrase.ToLower() + ".*"));
int mathes = regex.Matches(input).Count;
Console.WriteLine(phrase + " ----> " + mathes);
}
}

private static string ReadInput(string fileName)
{
string output;
using (StreamReader reader = new StreamReader(fileName))
{
output = reader.ReadToEnd();
}
return output;
}
}
}


I know my regular expression is incorrect but I don't know what to change.

The output:

Word ----> 2
S ----> 2
MissingWord ----> 0
DS ----> 2
aa ----> 0


The correct output:

Word --> 9
S --> 13
MissingWord --> 0
DS --> 2
aa --> 3


file.txt contains:

Word? We have few words: first word, second word, third word.
Some passwords: PASSWORD123, @PaSsWoRd!456, AAaA, !PASSWORD


words.txt contains:

Word
S
MissingWord
DS
aa

Answer

You need to post the file.txt contents first, otherwise it's difficult to verify if the regex is working correctly or not.

That being said, check out the Regex answer here: Finding ALL positions of a substring in a large string in C# and see if that helps with your code in the mean time.

edit:

So there's a simple solution, add "(?=(" and "))" to each of your phrases. This is a lookahead assertion in regex. The following code handles what you want.

        foreach (string phrase in phrases) {
            string MatchPhrase = "(?=(" + phrase.ToLower() + "))";
            int mathes = Regex.Matches(input, MatchPhrase).Count;
            Console.WriteLine(phrase + " ----> " + mathes);
        }

You also had an issue with

input.ToLower();

which should be instead

input = input.ToLower();

as strings in c# are immutable. In total, your code should be:

    static void Main(string[] args) {
        string input = ReadInput("file.txt");
        input = input.ToLower();
        List<string> phrases = new List<string>();
        using (StreamReader reader = new StreamReader("words.txt")) {
            string line = reader.ReadLine();
            while (line != null) {
                phrases.Add(line.Trim());
                line = reader.ReadLine();
            }
        }
        foreach (string phrase in phrases) {
            string MatchPhrase = "(?=(" + phrase.ToLower() + "))";
            int mathes = Regex.Matches(input, MatchPhrase).Count;
            Console.WriteLine(phrase + " ----> " + mathes);
        }
        Thread.Sleep(50000);
    }

    private static string ReadInput(string fileName) {
        string output;
        using (StreamReader reader = new StreamReader(fileName)) {
            output = reader.ReadToEnd();
        }
        return output;
    }
Comments