tonfagun tonfagun - 14 days ago 6
Linux Question

How to determine if the content of one file is included in the content of another file

First, my apologies for what is perhaps a rather stupid question that doesn't quite belong here.

Here's my problem: I have two large text files containing a lot of file names, let's call them A and B, and I want to determine if A is a subset of B, disregarding order, i.e. for each file name in A, find if file name is also in B, otherwise A is not a subset.

I know how to preprocess the files (to remove anything but the file name itself, removing different capitalization), but now I'm left to wonder if there is a simple way to perform the task with a shell command.

Diff probably doesn't work, right? Even if I 'sort' the two files first, so that at least the files that are present in both will be in the same order, since A is probably a proper subset of B, diff will just tell me that every line is different.

Again, my apologies if the question doesn't belong here, and in the end, if there is no easy way to do it I will just write a small program to do the job, but since I'm trying to get a better handle on the shell commands, I thought I'd ask here first.

Answer

Do this:

cat b | sort -u | wc
cat a b | sort -u | wc

If you get the same result, a is a subset of b.

Comments