msgbartop
Tips and Tricks site for advanced HP-UX Engineers
msgbarbottom

16 Dec 21 “Subtracting” one file from another

I recently had the occasion to refactor a script (not mine) in which there was some convoluted logic to ensure that the contents of file A (an ascii file) were fully contained within file B (another ascii file). Lots of looping and grepping were the order of the day. I imagined that there had to be better way and first went down the path of using grep with a file as the source of the things I was looking for. However, while that would get me partially there, it would not tell me if File A was wholly contained with file B. I am sure I could made it work (somehow) using this idea, but again, I thought that there has to be a better way.

With the help of Google (because of course), I stumbled across an HP-UX command (not unique to HP-UX of course) that in my 25 years of scripting on the HP-UX platform, I had never encountered nor used before: the comm command. This command lets one implement set logic on ascii files.

Let’s have a look . . .

From the man page for comm:

 NAME
      comm - select or reject lines common to two sorted files

 SYNOPSIS
      comm [-[123]] file1 file2

DESCRIPTION
      comm reads file1 and file2, which should be ordered in increasing
      collating sequence (see sort(1) and Environment Variables below), and
      produces a three-column output:

       Column 1:   Lines that appear only in file1,
       Column 2:   Lines that appear only in file2,
       Column 3:   Lines that appear in both files.

  If - is used for file1 or file2, the standard input is used.

  Options 1, 2, or 3 suppress printing of the corresponding column.
  Thus comm -12 prints only the lines common to the two files; comm -23
  prints only lines in the first file but not in the second; comm -123
  does nothing useful.

So, “comm -12” performs the intersection operation on the files. But “comm -23” should do the trick for what I was after: Subtracting file B from File A. In my use case File B should be a proper subset of File A. I can test for that via:

checkCount=$(comm -23 $FILE_B_SORTED $FILE_A_SORTED | wc -l)

If the resulting count is “0”, I know that File B is wholly contained within File B. If the count is not “0”, then I can consume what is in File B but not in File A and take appropriate action.

Note that the two ASCII files need to be in sorted order – easy enough. Here is an example of all of this put together to accomplish of “subtracting File B from File A”:

FILE_B_SORTED=$(mktemp)
FILE_A_SORTED=$(mktemp)

sort -u $FILE_B > $FILE_B_SORTED
sort -u $FILE_A > $FILE_A_SORTED

#
# 'comm -23 file1 file2' says show me all the lines that appear in
# file2 but not in file1.  Thus, here, we are ensuring that there is
# nothing in the file2 that is NOT in file1.
#
# We count how many such lines there are - we expect there to be zero.
#

checkCount=$(comm -23 $FILE_B_SORTED $FILE_A_SORTED | wc -l)
if (( checkCount == 0 )); then
   echo "All entries in the File B are in File A  (goodness)"
else
   # Do something with the entries in File B that are not in File A
   checkList=$(mktemp)
   comm -23 $FILE_B_SORTED $FILE_A_SORTED > $checkList
   exec  4< $checkList
   while read entry <&4; do
     # Do whatever is that needs to be done
   done
fi

All good stuff 🙂

Tags: , ,

Leave a Comment

You must be logged in to post a comment.

sidebarbottom
sidebartop
sidebarbottom
WhatsApp chat