Handle input exceptions.
Yesterday I ran mailaprop.py --debug on every individual piece of email in my corpus (around 1.07 million messages; it took about 15 hours). 130 of the messages cause mailaprop to thrown an exception, and those exceptions seem to fall into these categories:
-
AttributeError: 'str' object has no attribute 'token_type'(fromabsorb_headers()). This was a lot of them. -
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf8' in position 64: surrogates not allowed(also fromabsorb_headers()). This was even more of them. -
email.errors.HeaderParseError: expected atom at a start of dot-atom-text but found '@@MESSAGE_ID>'leading toIndexError: list index out of range(also inabsorb_headers()). This was only a few of them. -
email.errors.HeaderParseError: expected ':' at end of group display name but found '@GMAIL.COM, ...followed by a long list of email addresses...leading toTypeError: 'ValueTerminal' object does not support item assignment. This one appears to be because a spam message with a lot of recipients has these non-addresses in the list of recipient addresses: "aol.com@hotmail..., 20@aol.., .TXT@GMAIL.COM,". This exception occurred exactly once. Go figure. -
ValueError: hour must be in 0..23(also inabsorb_headers()). This was a fair number of them, maybe around 20.
So all I have to do is handle all of these exception cases and my mailaprop runs can work again :-).
To save time when reproducing, here are the messages that generated exceptions:
creative-commons/1.gz
cvsbook/60.gz
emacs/devel/118034.gz
emacs/diffs/17114.gz
emacs/diffs/17115.gz
emacs/diffs/17116.gz
emacs/diffs/17117.gz
emacs/diffs/17276.gz
emacs/diffs/17277.gz
emacs/diffs/17317.gz
emacs/diffs/17320.gz
emacs/diffs/17321.gz
emacs/diffs/17322.gz
emacs/diffs/17323.gz
emacs/diffs/17592.gz
emacs/diffs/17634.gz
emacs/diffs/17767.gz
emacs/diffs/17849.gz
emacs/diffs/17892.gz
emacs/diffs/18195.gz
emacs/diffs/18287.gz
emacs/diffs/18290.gz
emacs/diffs/18297.gz
emacs/diffs/18298.gz
emacs/diffs/18299.gz
emacs/diffs/18300.gz
emacs/diffs/18301.gz
emacs/diffs/18321.gz
emacs/diffs/18322.gz
emacs/diffs/18323.gz
emacs/diffs/18325.gz
emacs/diffs/18348.gz
emacs/diffs/18349.gz
emacs/diffs/18350.gz
emacs/diffs/18365.gz
emacs/diffs/18408.gz
emacs/diffs/18576.gz
emacs/diffs/18635.gz
emacs/diffs/19610.gz
emacs/diffs/19637.gz
emacs/diffs/19638.gz
emacs/diffs/19639.gz
emacs/diffs/19640.gz
emacs/diffs/19836.gz
emacs/diffs/20039.gz
emacs/diffs/20345.gz
emacs/diffs/20500.gz
emacs/diffs/20517.gz
emacs/diffs/20520.gz
emacs/diffs/20712.gz
emacs/diffs/20767.gz
emacs/diffs/20774.gz
emacs/diffs/20846.gz
emacs/diffs/20850.gz
emacs/diffs/20894.gz
emacs/diffs/20895.gz
emacs/diffs/20923.gz
emacs/diffs/20924.gz
emacs/diffs/20925.gz
emacs/diffs/20926.gz
emacs/diffs/20927.gz
emacs/diffs/20930.gz
emacs/diffs/20935.gz
emacs/diffs/20973.gz
emacs/diffs/20998.gz
emacs/diffs/21027.gz
emacs/diffs/21031.gz
emacs/diffs/21046.gz
emacs/diffs/21047.gz
emacs/diffs/21048.gz
emacs/diffs/21083.gz
emacs/diffs/21090.gz
emacs/diffs/21096.gz
emacs/diffs/21114.gz
emacs/diffs/21115.gz
emacs/diffs/21125.gz
emacs/diffs/21137.gz
emacs/diffs/21366.gz
emacs/diffs/21377.gz
emacs/diffs/21391.gz
emacs/diffs/21392.gz
emacs/diffs/21405.gz
emacs/diffs/21410.gz
emacs/diffs/21411.gz
emacs/diffs/21412.gz
emacs/diffs/21414.gz
golosa/3030.gz
golosa/3033.gz
golosa/3034.gz
misc.archive/136907.gz
misc.archive/141329.gz
misc.archive/141588.gz
misc.archive/151614.gz
misc.archive/151643.gz
misc.archive/151657.gz
misc.archive/151700.gz
misc.archive/151885.gz
misc.archive/151947.gz
misc.archive/151985.gz
misc.archive/152001.gz
misc.archive/152046.gz
misc.archive/152120.gz
misc.archive/152144.gz
misc.archive/152195.gz
misc.archive/152241.gz
misc.archive/152319.gz
misc.archive/152387.gz
misc.archive/186962.gz
misc.archive/187959.gz
misc.archive/191634.gz
misc.archive/194040.gz
misc.archive/200424.gz
misc.archive/200775.gz
misc.archive/201167.gz
misc.archive/202794.gz
misc.archive/204157.gz
misc.archive/204436.gz
misc.archive/206019.gz
misc.archive/211058.gz
misc.archive/230068.gz
osi/press/970.gz
qco/15568.gz
qco/15869.gz
qco/15944.gz
qco/16355.gz
qco/16753.gz
spam/46941.gz
subversion/dev/25186.gz
subversion/dev/27388.gz
subversion/private/6511.gz
Edited by Karl Fogel