I recently had the occasion to refactor a script (not mine) in which there was some convoluted logic to ensure that the contents of file A (an ascii file) were fully contained within file B (another ascii file). Lots of looping and grepping were the order of the day. I imagined that there had to be better way and first went down the path of using grep with a file as the source of the things I was looking for. However, while that would get me partially there, it would not tell me if File A was wholly contained with file B. I am sure I could made it work (somehow) using this idea, but again, I thought that there has to be a better way.
With the help of Google (because of course), I stumbled across an HP-UX command (not unique to HP-UX of course) that in my 25 years of scripting on the HP-UX platform, I had never encountered nor used before: the comm command. This command lets one implement set logic on ascii files.
Let’s have a look . . .
From the man page for comm:
NAME
comm - select or reject lines common to two sorted files
SYNOPSIS
comm [-[123]] file1 file2
DESCRIPTION
comm reads file1 and file2, which should be ordered in increasing
collating sequence (see sort(1) and Environment Variables below), and
produces a three-column output:
Column 1: Lines that appear only in file1,
Column 2: Lines that appear only in file2,
Column 3: Lines that appear in both files.
If - is used for file1 or file2, the standard input is used.
Options 1, 2, or 3 suppress printing of the corresponding column.
Thus comm -12 prints only the lines common to the two files; comm -23
prints only lines in the first file but not in the second; comm -123
does nothing useful.
So, “comm -12” performs the intersection operation on the files. But “comm -23” should do the trick for what I was after: Subtracting file B from File A. In my use case File B should be a proper subset of File A. I can test for that via:
checkCount=$(comm -23 $FILE_B_SORTED $FILE_A_SORTED | wc -l)
If the resulting count is “0”, I know that File B is wholly contained within File B. If the count is not “0”, then I can consume what is in File B but not in File A and take appropriate action.
Note that the two ASCII files need to be in sorted order – easy enough. Here is an example of all of this put together to accomplish of “subtracting File B from File A”:
FILE_B_SORTED=$(mktemp)
FILE_A_SORTED=$(mktemp)
sort -u $FILE_B > $FILE_B_SORTED
sort -u $FILE_A > $FILE_A_SORTED
#
# 'comm -23 file1 file2' says show me all the lines that appear in
# file2 but not in file1. Thus, here, we are ensuring that there is
# nothing in the file2 that is NOT in file1.
#
# We count how many such lines there are - we expect there to be zero.
#
checkCount=$(comm -23 $FILE_B_SORTED $FILE_A_SORTED | wc -l)
if (( checkCount == 0 )); then
echo "All entries in the File B are in File A (goodness)"
else
# Do something with the entries in File B that are not in File A
checkList=$(mktemp)
comm -23 $FILE_B_SORTED $FILE_A_SORTED > $checkList
exec 4< $checkList
while read entry <&4; do
# Do whatever is that needs to be done
done
fi
All good stuff 🙂
Tags: "set logic on files", "subtracting one file from another", hp-ux script