Exagon Exagon - 3 months ago 11
C++ Question

How to cut off parts of a string, which every string in a collection has

My currently problem is the following:
I have a

std::vector
of full path names to files.
Now i want to cut off the common prefix of all string.

Example



If I have these 3 strings in the vector:

/home/user/foo.txt
/home/user/bar.txt
/home/baz.txt


I would like to cut off
/home/
from every string in the vector.

Question



Is there any method to achieve this in general?
I want an algorithm that drops the common prefix of all string.
I currently only have an idea which solves this problem in O(n m) with n strings and m is the longest string length, by just going through every string with every other string char by char.
Is there a faster or more elegant way solving this?

Answer

This can be done entirely with std:: algorithms.

synopsis:

  1. sort the input range if not already sorted. The first and last paths in the sorted range will be the most dissimilar. Best case is O(N), worst case O(N + N.logN)

  2. use std::mismatch to determine the larges common sequence between the two most dissimilar paths [insignificant]

  3. run through each path erasing the first COUNT characters where COUNT is the number of characters in the longest common sequence. O (N)

Best case time complexity: O(2N), worst case O(2N + N.logN) (can someone check that?)

#include <iostream>
#include <algorithm>
#include <string>
#include <vector>

std::string common_substring(const std::string& l, const std::string& r)
{
    return std::string(l.begin(),
                       std::mismatch(l.begin(), l.end(),
                                     r.begin(), r.end()).first);
}

std::string mutating_common_substring(std::vector<std::string>& range)
{
    if (range.empty())
        return std::string();
    else
    {
        if (not std::is_sorted(range.begin(), range.end()))
            std::sort(range.begin(), range.end());
        return common_substring(range.front(), range.back());
    }
}

std::vector<std::string> chop(std::vector<std::string> samples)
{
    auto str = mutating_common_substring(samples);
    for (auto& s : samples)
    {
        s.erase(s.begin(), std::next(s.begin(), str.size()));
    }
    return samples;
}

int main()
{
    std::vector<std::string> samples = {
        "/home/user/foo.txt",
        "/home/user/bar.txt",
        "/home/baz.txt"
    };

    samples = chop(std::move(samples));

    for (auto& s : samples)
    {
        std::cout << s << std::endl;
    }   
}

expected:

baz.txt
user/bar.txt
user/foo.txt

Here's an alternate `common_substring' which does not require a sort. time complexity is in theory O(N) but whether it's faster in practice you'd have to check:

std::string common_substring(const std::vector<std::string>& range)
{
    if (range.empty())
    {
        return {};
    }

    return std::accumulate(std::next(range.begin(), 1), range.end(), range.front(),
                           [](auto const& best, const auto& sample)
                           {
                               return common_substring(best, sample);
                           });
}

update:

Elegance aside, this is probably the fastest way since it avoids any memory allocations, performing all transformations in-place. For most architectures and sample sizes, this will matter more than any other performance consideration.

#include <iostream>
#include <vector>
#include <string>

void reduce_to_common(std::string& best, const std::string& sample)
{
    best.erase(std::mismatch(best.begin(), best.end(),
                             sample.begin(), sample.end()).first,
               best.end());

}

void remove_common_prefix(std::vector<std::string>& range)
{
    if (range.size())
    {
        auto iter = range.begin();
        auto best = *iter;
        for ( ; ++iter != range.end() ; )
        {
            reduce_to_common(best, *iter);
        }

        auto prefix_length = best.size();

        for (auto& s : range)
        {
            s.erase(s.begin(), std::next(s.begin(), prefix_length));
        }
    }
}


int main()
{
    std::vector<std::string> samples = {
        "/home/user/foo.txt",
        "/home/user/bar.txt",
        "/home/baz.txt"
    };

    remove_common_prefix(samples);

    for (auto& s : samples)
    {
        std::cout << s << std::endl;
    }   
}