cowboy me, 2.0: jose nazario beauty and the street

bugspot in python

i saw this discussion on hackernews, which links to a google blog post about how they try and predict bug hotspots based on past activity. pretty simple approach, not all that ineffective, either.

someone on HN had a working ruby tool for git in a few hours, but those are tools i don't use (git or ruby). so i wrote a version for subversion in python, which is below. my output is based on theirs since it made a lot of sense. one thing i think i may do is add support for arbitrary strings for the match conditions (e.g. whatever terms you team uses) and also a max "age" or timeline to start at.

#!/usr/bin/env python

""" copyright (c) 2011 jose nazario , all rights reserved license: 2 clause BSD """

# see #

def svnlog_parser(input): assert type(input) is file data = [] while True: line = input.readline() if line.startswith('-'*30): if data: yield ''.join(data) data = [] if line == '': raise StopIteration else: data.append(line)

if __name__ == '__main__': import math import os import re import sys import time

try: print 'Scanning %s' % sys.argv[1] except IndexError: print >> sys.stderr, 'Usage: %s /path/to/repo' % sys.argv[0] sys.exit(1) s = svnlog_parser(os.popen('cd %s && svn log -v' % sys.argv[1])) message_matchers = [ re.compile(x, re.I) for x in ('fixes', 'fixed', 'closes', 'bug\w?#\d+', ) ]

hotspots = {} messages = [] times = [] for m in s: paths = [] lines = m.split('\n') i = 0 for line in lines: if line.startswith('-'*20): # seperator i += 1 continue if i == 1: # revision | who | timestamp | N lines i += 1 timestamp = ' '.join(line.split(' | ')[2].split()[:2]) timestamp = int(time.strftime('%s', time.strptime(timestamp, '%Y-%m-%d %H:%M:%S'))) times.append(timestamp) continue if line == 'Changed paths:': # blah i += 1 continue try: # actual files changed if line[3] in ('D', 'M', 'A'): i += 1 paths.append(line.split(' ', 1)[1]) continue except IndexError: pass # and everything else is the changelog msg = ' '.join(lines[i:]) for matcher in message_matchers: if matcher.findall(msg): messages.append(msg) for path in paths: path = path.strip() l = hotspots.get(path, []) l.append(timestamp) hotspots[path] = l break start = min(times) end = max(times)

def score(ts): s = 0 for t in ts: t = (float(t)-start)/(end-start) s += 1/(1+(math.e**(-12*t+12))) return s

hotspots = [ (score(y),x) for x,y in hotspots.iteritems() ] hotspots.sort() hotspots.reverse() hotspots = [ (y,x) for x,y in hotspots ] hotspots = filter(lambda x: x[1] > 0.001, hotspots) print 'Found %d bugfix commits, with %d hotspots' % (len(messages), len(hotspots)) print print 'Fixes:' for msg in messages: print ' - %s' % msg print print 'Hotspots:' for path, n in hotspots: print ' %.3f - %s' % (n, path)
output on the phoneyc trunk look like this:
Scanning /Users/jose/code/phoneyc/trunk
Found 6 bugfix commits, with 2 hotspots

Fixes: - fix quoting issues - fix arg length - [phoneyc] support for RTSP MPEG4 SP Control ActiveX Control "MP4Prefix" Property Buffer Overflow vuln module, exploit demo - [phoneyc] found an exploit for QvodCtrl at SecFocus, add. fix: - add CLSID for QvodCtrl - look for URL and url - XXX case independent handling of methods etc? - proper length check - object instantiation can be done with name, not just id - [phoneyc] - handle the redirect stuff as an href - fix up URLs that lack a needed trailing '/' - import order fixup fix sgmllib exception namespace

Hotspots: 0.463 - /phoneyc/trunk/ 0.099 - /phoneyc/trunk/modules/jscript/NCTAudioFile2.js

i'll be testing it on larger codebases soon, i developed it against the svn repo of phoneyc.

if you use it, please let me know how you find it. i'll happily accept patches, too.



next Friday, May 03, 2013 @ 04:05am | previous Saturday, Aug 27, 2011 @ 09:01am | archives

Last modified: Saturday, Dec 17, 2011 @ 08:42am
Weblog Commenting and Trackback by

Your Ad Here

copyright © 2002-2005 jose nazario, all rights reserved.