Stylometry

Print Friendly

New Scientist posted a writeup about a new paper on stylometry published by researchers Michael Brennan & Rachel Greenstadt at Drexel University. It was an interesting read. They missed a recent publication from last year’s DFRWS conference dealing with authorship identification from anonymous emails etc. The former is referring to the concept as “Stylometry” whereas the later uses the term “Write Print.” The basic idea is an individual’s writing will have a consistent style sufficiently unique from writing at large to link back to the author. In general, various facets of the writing (word choice, tense, grammatical constructs, etc) are parsed and reduced to statistical information fed into various AI classification techniques; thereafter the classifier trained on an individual’s known writing can be used to classify unknown writing samples as belonging to a given individual with some probability.

I attended DFRWS 2008 and was interested in the “Write Print” paper in so much as it took a similar approach to some work I did in my M.S. thesis. My work focused on reconstructing fragmented files based on content analysis using Support Vector Machines (“SVM”)s, but used similar statistical metrics for comparison – I even used wordnet for part of the research hoping synonyms would provide significant classification power in identifying meaning threads in a document. Part of my interest in these types of “Stylometry” studies stems from the weakness in classification I observed in my research. In fairness there is a significant difference between author identification and what I was doing with document reconstruction, but what I observed was that textual artifacts and anomalies were far more powerful for classification than normal word usage – specifically things like misspellings, proper names, acronyms, specific technical terms etc which would not be present in a general dictionary or general usage. I find research in the authorship identification sphere interesting because techniques with good results could be potentially back-ported for use in document reconstruction. (Document reconstruction from fragments is a hideously difficult problem in current digital forensics, and is mostly a hands on process where possible.)

Their study confirms something I have long felt was a nagging problem in this area: most of the authorship identification style concepts rely on the individual being either unaware of their consistent style, or unable to alter it. The DFRWS drew a parallel to fingerprints, so continuing that metaphor imagine if we could alter, at will, our fingerprints through some sci-fi style ability. If a criminal did not know his fingerprints could be used to later identify him, he would have no reason to alter them, but if he did know clearly he would, and better still match them to someone else to frame them. In more practical terms this obfuscation does happen to a lesser extent only with gloves instead of sci-fi abilities.

I have another qualm which their paper does not address. They specifically sanitize their sample set to eliminate what I will call informal writing, to quote them:

“First, each author had to submit approximately 5000 words of pre-existing  sample writing. Each writing sample had to be from some sort of formal source, such as essays for school, reports for work, and other professional and academic correspondence. This was intended to eliminate slang and abbreviations, instead concentrating on consistent, formal writing style of everyone involved. This also helped to limit possible errors that are not a result of the malicious attack attempts but nonetheless could have an effect on the accuracy of the authorship attribution.”

I brought a similar concern up during the DFRWS Q&A for the “Write Print” paper, specifically I wanted to know if they differentiated between email correspondence written from a normal email client and email written from a mobile phone. The idea behind the question: people’s writing style changes depending on the ease with which they can write, emails from a desk are often much longer than those pecked out on a phone as a purely practical matter as is the increase in abbreviation use or short hand. Therefore separate classifiers for each type of writing may be more effective at identification, it is almost certain not separating the different sources introduces noise to the learning process.

The paper does an excellent job of framing this area in a “security through obscurity” light, and is an interesting read.

Posted in AI, Computer Science Tagged with: ,