dmgrep

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# Copyright (c) 2015, 2018 Karl Fogel
# 
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# General Public License for more details.
# 
# If you did not receive a copy of the GNU General Public License
# along with this program, see <http://www.gnu.org/licenses/>.

"""Print delimited sections of input that match a given regexp.  

Usage:

  $ CMD | dmgrep [-d|--delimiter DELIM_REGEXP] [-i|--ignore-case] REGEXP
  -OR-
  $ dmgrep [...] < FILE

Use the '-v' ('--inverse') flag to invert the sense of the match.

Detailed explanation:

"Delimited" means there is some sort of single-line delimiter
separating sections from each other.  This is common in many output
formats.  For example, 'svn log' puts a line of 72 hyphens between
each revision's log entry, and 'git log' output starts each commit
message with a line matching the regexp "^commit [a-z0-9]{40}$".

You might use this script to, say, find all the 'git log' entries that
match a given regular expression.  The match is performed against the
leading delimiter as well as against the body text between delimiters
(this is to handle cases like 'git log' output, where you might want
to match a pattern against both commit messages themselves and against
the commit IDs appearing in the delimiters -- see an example below).

Wherever there is a match, the entire matching entry is printed to
standard output, including the leading delimiter if there was one
(whether or not the match matched the delimiter itself), but not the
trailing delimiter, whether or not there was one.  If you understood
that sentence on the first reading, this program is probably for you.

You can specify a custom delimiter regexp at run-time, but if you
don't then a default delimiter regexp is used, matching a set of
well-known delimiters that includes at least the two mentioned above.

Examples:

  $ git clone git@github.com:OpenTechStrategies/smoke-alarm-portal.git
  $ cd smoke-alarm-portal
  $ git log --name-status | dmgrep "[Rr]einstate"
  commit 52964c56964cc454b4239d3cada61e0577db897b
  Author: Noel Taylor <example@opentechstrategies.com>
  Date:   Thu Jun 25 15:26:42 2015 -0500
  
      #22 Reinstate a couple of DCSOps style rules
      
      Commit 3ac3645 commented out a link to the DCSOps stlyesheet as a rule
      there was breaking bootstrap grid behavior. This commit reinstates two
      of those rules (for font family and button color) into our own
      stylesheet.
  
  M	public/smokeAlarmPortal_files/base.css
  $
 
  ### Hmm, interesting -- the commit message mentions another
  ### commit, 3ac3645.  Let's see all commit messages that either
  ### are that commit or mention it.
 
  $ git log --name-status | dmgrep "3ac3645"
  commit 52964c56964cc454b4239d3cada61e0577db897b
  Author: Noel Taylor <example@opentechstrategies.com>
  Date:   Thu Jun 25 15:26:42 2015 -0500
  
      #22 Reinstate a couple of DCSOps style rules
      
      Commit 3ac3645 commented out a link to the DCSOps stlyesheet as a rule
      there was breaking bootstrap grid behavior. This commit reinstates two
      of those rules (for font family and button color) into our own
      stylesheet.
  
  M	public/smokeAlarmPortal_files/base.css
  
  commit 3ac3645bac42f598911c9fc3829202a246bb0218
  Author: Noel Taylor <example@opentechstrategies.com>
  Date:   Thu Jun 25 14:52:51 2015 -0500
  
      #22 Errors now more obvious, yet less disruptive
      
      Red asterisks now mark required form fields. If validation script
      returns error messages, these now appear in place of the asterisks and
      the errored fields gain a red outline. I have also commented out the
      link to the DCSOps stylesheet, as one of the rules was breaking the
      behavior of the bootstrap grid especially for extra-small screen sizes
      but in other ways also.
  
  M	public/smokeAlarmPortal_files/base.css
  M	views/index.jade
  M	views/layout.jade
  $

  ### Good -- we now have an overview of that commit and its followups.

Another example, a bit more contrived this time:

  $ cat delimited_file.txt
  This file has some delimited text
  by which I mean it is divided into sections
  divided by some known delimiter.
  ----------
  In this case the delimiter is a line of dashes,
  although it could have been anything.
  By the way, this section contains the word "fish".
  ----------
  The delimiter in this example isn't very interesting.
  ----------
  How odd -- this section also contains the word "fish"!
  What a coincidence.
  ----------
  The leading delimiter is printed, but not the trailing.
  ----------
  EOF is treated as another occurrence of the delimiter.
  By the way -- fish!

  $ dmgrep -d '-{10}' fish
  In this case the delimiter is a line of dashes,
  although it could have been anything.
  By the way, this section contains the word "fish".
  ----------
  How odd -- this section also contains the word "fish"!
  What a coincidence.
  ----------
  EOF is treated as another occurrence of the delimiter.
  By the way -- fish!

Related programs and historical notes:

I scripted this up at a time when I didn't have easy access to my
preferred Internet search engines, thanks to the Great Firewall of
China, and originally called it 'dgrep'.  But later when I could
easily search the Internet again, I found there was already a program
named 'dgrep'.  So next I tried 'sgrep' (for "sectioned grep"), but
not only is that also already taken, the 'sgrep' ("structured grep")
program by Jani Jaakkola and Pekka Kilpeläinen (btw, what is it with
the over-representation of Finns among authors of useful utilities?)
actually does something pretty close to what this script does --
'sgrep' may even be a superset.  In desperation, I resorted next to my
first initial, but 'kgrep' is already used too.  Hence 'dmgrep', a
slightly odd abbreviation for "delimited grep".

After writing this script, I realized it is a generalization of
http://svn.apache.org/viewvc/subversion/trunk/contrib/client-side/search-svnlog.pl?view=co,
which has been around for many years.

Run 'dmgrep --summary' to see a compact summary of usage.
"""

import sys
import re
import getopt


def match_maybe_output(body, regexp, leading_delim, inverse, out):
  """If LEADING_DELIM or BODY match REGEXP, print them to OUT.
BODY does not include any delimiters; REGEXP is a compiled regexp;
LEADING_DELIM is a string; OUT is a file handle (e.g., sys.stdout).
If INVERSE is true, invert the sense of the match."""
  match = regexp.search(body) or regexp.search(leading_delim)
  if (match and not inverse) or (inverse and not match):
    out.write(leading_delim)
    out.write(body)


def main():
  delim_re = None
  inverse = False
  search_re = None
  search_flags = re.MULTILINE | re.DOTALL

  try:
    (opts, args) = getopt.getopt(sys.argv[1:], "ih?d:v", 
                                 ["ignore-case", 
                                  "help",  # same as '--usage'
                                  "usage", 
                                  "summary",
                                  "inverse",
                                  "delimiter="])
  except getopt.GetoptError as err:
    sys.stderr.write(str(err))
    sys.stderr.write("\n")
    sys.exit(1)

  for opt, optarg in opts:
    if opt in ("-d", "--delimiter"):
      delim_re = re.compile(optarg)
    elif opt in ("-h", "-?", "--help", "--usage"):
      print(__doc__)
      sys.exit(0)
    elif opt == "--summary":
      print("dmgrep [-d|--delimiter DELIM_REGEXP] [-i|--ignore-case] " \
            "REGEXP < INPUT")
      print("")
      print("(Run 'dmgrep --usage' for full usage.)")
      sys.exit(0)
    elif opt in ("-i", "--ignore-case"):
      search_flags |= re.IGNORECASE
    elif opt in ("-v", "--inverse"):
      inverse = True
    else:
      print("Unrecognized option '%s'" % opt)
      sys.exit(1)

  if len(args) < 1:
    sys.stderr.write("ERROR: missing REGEXP argument.\n")
    sys.stderr.write("       (Run with '--usage' flag to see documentation.)\n")
    sys.exit(1)
  elif len(args) > 1:
    sys.stderr.write("ERROR: too many arguments -- " \
                     "expected exactly one REGEXP argument.\n")
    sys.stderr.write("       (Run with '--usage' flag to see documentation.)\n")
    sys.exit(1)
    
  if delim_re is None:
    delim_re = re.compile(
        # Revision separator in a series of Subversion log messages.
        "^(-{72}"
        "|"
        # Git commits.
        "commit [a-z0-9]{40})$"
        "|"
        # Output of 'find-dups' (also in 'ots-tools', like 'dmgrep').
        "[a-z0-9]{32} \\([0-9]+ identical files, each [0-9]+ bytes\\):")
    
  search_re = re.compile(args[0], flags=search_flags)

  accum = ''
  latest_leading_delim = ''
  line = sys.stdin.readline()
  if delim_re.match(line): # if file starts with a delimiter, eat it now
    latest_leading_delim = line
    line = sys.stdin.readline()
  while line != '':
    if delim_re.match(line):
      match_maybe_output(accum, search_re, latest_leading_delim, 
                         inverse, sys.stdout)
      accum = ''
      latest_leading_delim = line
    else:
      accum += line
    line = sys.stdin.readline()
  match_maybe_output(accum, search_re, latest_leading_delim, 
                     inverse, sys.stdout)


if __name__ == '__main__':
  main()