computational story lab team members

computational story lab alumni

former junior group members have gone on to some of the best PhD programs in the country, including MIT, Cornell, Dartmouth, U of Washington, Northeastern, and UC Boulder.

our students have had internships at LinkedIn, Sandia, Woods Hole, and MITRE, and include current members of the Facebook, Apple, Adobe, and MassMutual Data Science teams. our PhD alums have occupied postdocs and lecturerships at UC Berkeley and UVM, and faculty positions at Drexel.

kameron harris, uvm ms*

fulbright scholar

u of washington postdoc

isabel kloumann*

uvm undergrad goldwater scholar,

cornell phd

facebook data scientist

ross lieb-lappen, uvm masters

shadowing in the atmosphere

dartmouth phd

cold regions research laboratory

dan brown, masters

self-organization in traffic

actuarial director,

prudential financial

el hassan ridouane, uvm postdoc&

chaotic convection in a toy climate

senior research scientist,

united technologies research center

kate morrow, undergrad*

dynamic structure of networks

engineering manager, new relic

phd advisors

eugenia kalnay

atmospheric and oceanic sciences

university of maryland, college park

james a yorke

institute of physical science and technology

university of maryland, college park


eric clark, phd in complex systems & data science, 2018

title: applications in sentiment analysis & machine learning for identifying public health variables across social media

abstract: Twitter, a popular social media outlet, has evolved into a vast source of linguistic data, rich with opinion, sentiment, and discussion. We mined data from several public Twitter endpoints to identify content relevant to healthcare providers and public health regulatory professionals. We began by compiling content related to electronic nicotine delivery systems (or e-cigarettes) as these had become popular alternatives to tobacco products. There was an apparent need to remove high frequency tweeting entities, called bots, that would spam messages, advertisements, and fabricate testimonials. Algorithms were constructed using natural lan- guage processing and machine learning to sift human responses from automated accounts with high degrees of accuracy. We found the average hyperlink per tweet, the average character dissimilarity between each individual’s content, as well as the rate of introduction of unique words were valuable attributes in identify- ing automated accounts. We performed a 10-fold Cross Validation and measured performance of each set of tweet features, at various bin sizes, the best of which performed with 97% accuracy. These methods were used to isolate automated content related to the advertising of electronic cigarettes. A rich taxonomy of au- tomated entities, including robots, cyborgs, and spammers, each with different measurable linguistic features were categorized.

Electronic cigarette related posts were classified as automated or organic and content was investigated with a hedonometric sentiment analysis. The overwhelming majority (≈ 80%) were automated, many of which were commercial in nature. Others used false testimonials that were sent directly to individuals as a personalized form of targeted marketing. Many tweets advertised nicotine vaporizer fluid (or e-liquid) in various “kid-friendly” flavors including ‘Fudge Brownie’, ‘Hot Chocolate’, ‘Circus Cotton Candy’ along with every imaginable flavor of fruit, which were long ago banned for traditional tobacco products. Others offered free trials, as well as incentives to retweet and spread the post among their own network. Free prize giveaways were also hosted whose raffle tickets were issued for sharing their tweet. Due to the large youth presence on the public social media platform, this was evidence that the marketing of electronic cigarettes needed considerable regulation. Twitter has since officially banned all electronic cigarette advertising on their platform.

Social media has the capacity to afford the healthcare industry with valuable feedback from patients who reveal and express their medical decision-making process, as well as self-reported quality of life indicators both during and post treatment. We have studied several active cancer patient populations, discussing their experiences with the disease as well as survivor-ship. We experimented with a Convolutional Neural Network (CNN) as well as logistic regression to classify tweets as patient related. This led to a sample of 845 breast cancer survivor accounts to study, over 16 months. We found positive sentiments regarding patient treatment, raising support, and spreading awareness. A large portion of negative sentiments were shared regarding po- litical legislation that could result in loss of coverage of their healthcare. We refer to these online public testimonies as “Invisible Patient Reported Outcomes” (iPROs), because they carry relevant indicators, yet are difficult to capture by conventional means of self-reporting. Our methods can be readily applied interdisci- plinary to obtain insights into a particular group of public opinions. Capturing iPROs and public sentiments from online communication can help inform healthcare professionals and regulators, leading to more con- nected and personalized treatment regimens. Social listening can provide valuable insights into public health surveillance strategies.

andy reagan, phd in mathematical sciences, 2017

title: towards a science of human stories: using sentiment analysis and emotional arcs to understand the building blocks of complex social systems

abstract: We can leverage data and complex systems science to better understand society and human nature on a population scale through language — utilizing tools that include sentiment analysis, machine learning, and data visualization. Data-driven science and the sociotechnical systems that we use every day are enabling a transformation from hypothesis-driven, reductionist methodology to complex systems sciences. Namely, the emergence and global adoption of social media has rendered possible the real-time estimation of population-scale sentiment, with profound implications for our understanding of human behavior. Advances in computing power, natural language processing, and digitization of text now make it possible to study a culture’s evolution through its texts using a “big data” lens.

Given the growing assortment of sentiment measuring instruments, it is imperative to understand which aspects of sentiment dictionaries contribute to both their classification accuracy and their ability to provide richer understanding of texts. Here, we perform detailed, quantitative tests and qualitative assessments of 6 dictionary-based methods applied to 4 different corpora, and briefly examine a further 20 methods. We show that while inappropriate for sentences, dictionary-based methods are generally robust in their classification accuracy for longer texts. Most importantly they can aid understanding of texts with reliable and meaningful word shift graphs if (1) the dictionary covers a sufficiently large enough portion of a given text’s lexicon when weighted by word usage frequency; and (2) words are scored on a continuous scale.

Our ability to communicate relies in part upon a shared emotional experience, with stories often following distinct emotional trajectories, forming patterns that are meaningful to us. By classifying the emotional arcs for a filtered subset of 4,803 stories from Project Gutenberg’s fiction collection, we find a set of six core trajectories which form the building blocks of complex narratives. We strengthen our findings by separately applying optimization, linear decomposition, supervised learning, and unsupervised learning. For each of these six core emotional arcs, we examine the closest characteristic stories in publication today and find that particular emotional arcs enjoy greater success, as measured by downloads. Within stories lie the core values of social behavior, rich with both strategies and proper protocol, which we can begin to study more broadly and systematically as a true reflection of culture. Of profound scientific interest will be the degree to which we can eventually understand the full landscape of human stories, and data driven approaches will play a crucial role. 

emily cody, phd in mathematics, 2016

title: mathematical modeling of public opinion using traditional and social media

abstract: With the growth of the internet, data from text sources has become increasingly available to researchers in the form of online newspapers, journals, and blogs. This data presents a unique opportunity to analyze human opinions and behaviors without soliciting the public explicitly. In this research, I utilize newspaper articles and the social media service Twitter to infer self-reported public opinions and awareness of climate change. Climate change is one of the most important and heavily debated issues of our time, and analyzing large-scale text surrounding this issue reveals insights surrounding self-reported public opinion. First, I inquire about public discourse on both climate change and energy system vulnerability following two large hurricanes. I apply topic modeling techniques to a corpus of articles about each hurricane in order to determine how these topics were reported on in the post event news media. Next, I perform sentiment analysis on a large collection of data from Twitter using a previously developed tool called the “hedonometer”. I use this sentiment scoring technique to investigate how the Twitter community reports feeling about climate change. Finally, I generalize the sentiment analysis technique to many other topics of global importance, and compare to more traditional public opinion polling methods. I determine that since traditional public opinion polls have limited reach and high associated costs, text data from Twitter may be the future of public opinion polling.

jake williams, phd in mathematics, 2015

title: lexical mechanics: partitions, mixtures, and context

abstract: Highly structured for efficient communication, natural languages are complex systems. Un- like in their computational cousins, functions and meanings in natural languages are relative, frequently prescribed to symbols through unexpected social processes. Despite grammar and definition, the presence of metaphor can leave unwitting language users “in the dark,” so to speak. This is not problematic, but rather an important operational feature of languages, since the lifting of meaning onto higher-order structures allows individuals to compress descriptions of regularly-conveyed information. This compressed terminology, often only appropriate when taken locally (in context), is beneficial in an enormous world of novel experience. However, what is natural for a human to process can be tremendously difficult for a computer.

When a sequence of words (a phrase) is to be taken as a unit, suppose the choice of words in the phrase is subordinate to the choice of the phrase, i.e., there exists an inter-word dependence owed to membership within a common phrase. This word selection process is not one of independent selection, and so is capable of generating word-frequency distributions that are not accessible via independent selection processes. We have shown in Ch. 2 through analysis of thousands of English texts that empirical word-frequency distributions possess these word-dependence anomalies, while phrase-frequency distributions do not. In doing so, this study has also led to the development of a novel, general, and mathematical framework for the generation of frequency data for phrases, opening up the field of mass-preserving mesoscopic lexical analyses.

A common oversight in many studies of the generation and interpretation of language is the assumption that separate discourses are independent. However, even when separate texts are each produced by means of independent word selection, it is possible for their composite distribution of words to exhibit dependence. Succinctly, different texts may use a common word or phrase for different meanings, and so exhibit disproportionate usages when juxtaposed. To support this theory, We have shown in Ch. 3 that the act of combining distinct texts to form large ‘corpora’ results in word-dependence irregularities. This not only settles a 15-year discussion, challenging the current major theory, but also highlights an important practice necessary for successful computational analysis—the retention of meaningful separations in language.

We must also consider how language speakers and listeners navigate such a combinatorially vast space for meaning. Dictionaries (or, the collective editorial communities behind them) are smart. They know all about the lexical objects they define, but we ask about the latent information they hold, or should hold, about related, undefined objects. Based solely on the text as data, in Ch. 4 we build off of our result in Ch. 2 and develop a model of context defined by the structural similarities of phrases. We then apply this model to de- fine measures of meaning in a corpus-guided experiment, computationally detecting entries missing from a massive, collaborative online dictionary known as the Wiktionary.

eitan pechenick, phd in mathematics, 2015

title: exploring the google books corpus: an information theoretic approach to linguistic evolution

abstract: The Google Books corpus contains millions of books in a variety of languages. Due to this incredible volume and its free availability, it is a treasure trove that has inspired a plethora of linguistic research.

It is tempting to treat frequency trends from Google Books data sets as indicators for the true popularity of various words and phrases. Doing so allows us to draw novel conclusions about the evolution of public perception of a given topic. However, sampling published works by availability and ease of digitization leads to several important effects, which have typically been overlooked in previous studies. One of these is the ability of a single prolific author to noticeably insert new phrases into a language. A greater effect arises from scientific texts, which have become increasingly prolific in the last several decades and are heavily sampled in the corpus. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We highlight these dynamics by examining and comparing major contributions to the statistical divergence of English data sets between decades in the period 1800–2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts, in clear contrast to the first version of the fiction data set and both unfiltered English data sets.

We critique a method used by authors of an earlier work to determine the birth and death rates of words in a given linguistic data set. While intriguing, the method in question appears to produce an artificial surge in the death rate at the end of the observed period of time. In order to avoid boundary effects in our own analysis of asymmetries in language dynamics, we observe the volume of word flux across various relative frequency thresholds (in both directions) for the second English Fiction data set. We then use the contributions of the words crossing these thresholds to the Jensen-Shannon divergence between consecutive decades to resolve major factors driving the flux.

Having established careful information-theoretic techniques to resolve important features in the evolution of the data set, we validate and refine our methods by analyzing the effects of major exogenous factors, specifically wars. This approach leads to a uniquely comprehensive set of methods for harnessing the Google Books corpus and exploring socio-cultural and linguistic evolution.

nicholas allgaier, phd in mathematics, 2014

title: reverse engineering the human brain: an evolutionary computation approach to the analysis of fMRI

abstract: The field of neuroimaging has truly become data rich, and as such, novel analytical methods capable of gleaning meaningful information from large stores of imaging data are in high demand. Those methods that might also be applicable on the level of individual subjects, and thus potentially useful clinically, are of special interest. In this dissertation we introduce just such a method, called nonlinear functional mapping (NFM), and demonstrate its application in the analysis of resting state fMRI (functional Magnetic Resonance Imaging) from a 242-subject subset of the IMAGEN project, a European study of risk-taking behavior in adolescents that includes longitudinal phenotypic, behavioral, genetic, and neuroimaging data. Functional mapping employs a computational technique inspired by biological evolution to discover and mathematically characterize interactions among ROI (regions of interest), without making linear or univariate assumptions. Statistics of the resulting interaction relationships comport with recent independent work, constituting a preliminary cross-validation. Furthermore, nonlinear terms are ubiquitous in the models generated by NFM, suggesting that some of the interactions characterized here are not discoverable by standard linear methods of analysis. One such nonlinear interaction is discussed in the context of a direct comparison with a procedure involving pairwise correlation, designed to be an analogous linear version of functional mapping. Another such interaction suggests a novel distinction in brain function between drinking and non-drinking adolescents: a tighter coupling of ROI associated with emotion, reward, and interceptive processes such as thirst, among drinkers. Finally, we outline many improvements and extensions of the methodology to reduce computational expense, complement other analytical tools like graph-theoretic analysis, and possibly allow for voxel level functional mapping to eliminate the necessity of ROI selection.

catherine bliss, phd in mathematics, 2014

title: description, prediction, and evolution of a large, dynamic network from incomplete data

abstract: complex networks underlie a variety of social, biological, physical, and virtual systems. Understanding the topology of networks, the manner in which agents interact and evolutionary dynamics of the system can be challenging, both computationally and theoretically. In many settings, network data is incomplete; it is impossible to observe all nodes and all network interactions due to sampling constraints in large datasets or covert interactions between agents. we begin by defining Twitter reciprocal reply networks and examine the revealed social network structure and dynamics over the time scales of days, weeks, and months. At the level of user behavior, we employ our hedonometric analysis methods to investigate patterns of sentiment expression. We find users average happiness scores to be positively and significantly correlated with those of users one, two, and three links away. We strengthen our analysis by proposing and using a null model to test the effect of network topology on the assortativity of happiness. We also find evidence that more well connected users write happier status updates, with a transition occurring around Dunbar’s number. Second, we use an evolutionary algorithm to optimize weights which are used in a linear combination of sixteen neighborhood and node similarity indices to predict future links. Our method exhibits fast convergence and high levels of precision for the top twenty predicted links. Based on our findings, we suggest possible factors which may be driving the evolution of Twitter reciprocal reply networks. Lastly, we acknowledge that our dataset is incomplete and explore how global network statistics scale with missing data in a variety of sampling regimes. We propose scaling methods to predict true network parameters from only partial knowledge of nodes, links, or weighted interactions. We validate our analytical results with four classes of simulated networks (Erdos-Renyi, Scale-free, Small World, and Range dependent) and six empirical data sets. To overcome limitations due to sampling tweets, we apply our developed methods to Twitter reply networks and suggest a characterization of the Twitter interactome for this time period.

andrew reagan, masters in mathematics, 2013

title: predicting flow reversals in a computational fluid dynamics simulated thermosyphon using data assimilation

abstract: a thermal convection loop is a circular chamber filled with water, heated on the bottom half and cooled on the top half. with sufficiently large forcing of heat, the direction of fluid flow in the loop oscillates chaotically, forming an analog to the earth's weather. as is the case for state-of-the-art weather models, we only observe the statistics over a small region of state space, making prediction difficult. to overcome this challenge, data assimilation (da) methods, and specifically ensemble methods, use the computational model itself to estimate the uncertainty of the model to optimally combine these observations into an initial condition for predicting the future state. first, we build and verify four distinct d.a. methods. then, a computational fluid dynamics simulation of the loop and a reduced order model are both used by these d.a. methods to predict flow reversals. the results contribute to a testbed for algorithm development.

kameron d. harris, masters in mathematics, 2012

title: on-off threshold models of social behavior

abstract: we study binary state dynamics on a social network, where nodes act according to individual response functions of the average state of their neighborhood. these response functions model the competing tendencies of imitation and non-conformity by incorporating an “off-threshold” into standard threshold models of behavior. in this way, we attempt to capture important aspects of fashions and general societal trends.

allowing varying amounts of stochasticity in both the network and response functions, we find different outcomes in the random and deterministic versions of the model. in the limit of a large, dense network, however, these dynamics coincide. the dynamical behavior of the system ranges from steady state to chaotic depending on network connectivity and update synchronicity. a mean field theory is laid out in detail for general random networks. in the undirected case, the mean field theory predicts that the dynamics on the network are a smoothed version of the response function dynamics. the theory is compared to simulations on Poisson random graphs with response functions that average to the chaotic tent map.

ashley b. mckhann, undergraduate honors in mathematics & physics, 2011

title: odes modeling chaotic convection: data vs physics

abstract: this research investigates the difficulties associated with climate and weather prediction by forecasting the future state of a toy climate, analogous to the Lorenz model of natural convection. temperature measurements, collected from a computational fluid dynamics (cfd) simulated thermal convection loop, are fit with a set of three differential equations using eureqa, a recently developed software tool for inferring natural laws from freeform data. several forecasts, made using the optimal differential equations, are compared against a verifying set of temperature data from the cfd system to evaluate the accuracy of the forecast model. the results are interpreted with respect to a physical model derived from first principles. the goal is to discover reasonably accurate, low-dimensional prediction equations which model the physical laws governing this system.

ross lieb-lappen, masters in mathematics & statistics, 2010

title: aggressive shadowing of a low-dimensional model of atmospheric dynamics

abstract: modeling earth’s atmospheric conditions is difficult due to the size of the system, and predictions of its future state suffer from the consequences of chaos. As a result, current weather forecast models quickly diverge from observations as uncertainty in the initial state is amplified by nonlinearity. One measure of the strength of a forecast is its shadowing time, the period for which the forecast is a reasonable description of reality. The present work uses the Lorenz ’96 coupled system, a simplified nonlinear model of atmospheric conditions, to extend a recently developed technique for lengthening the shadowing time of a dynamical system. An ensemble of initial states, systematically perturbed using knowledge of the local dynamics, is used to make a forecast. The experiment is then repeated using inflation, whereby the ensemble is regularly expanded along dimensions whose uncertainty is contracting. The first goal of this work is to compare the two forecasts to reality, chosen to be an imperfect version of the same model, and determine whether variance inflation succeeds. The second goal is to establish whether inflation can increase the maximum shadowing time for a single member of the ensemble. In the second experiment the trajectory of reality is known a priori, and only the closest ensemble members are considered at each time step. When inflation is introduced to this technique, it is called stalking. Variance inflation was shown to have the potential to be successful, with the extent dependent upon algorithm parameters (e.g. size of state space, inflation amount). Under idealized conditions, the technique was shown to improve forecasts over 50% of the time. Under these same conditions, stalking also exhibited the potential to be useful. When only the best ensemble members were considered at each time step, the known trajectory could be shadowed for an entire 50-day forecast 50-75% of the time. However, if inflation occurs in directions incommensurate with the true trajectory, inflation can actually reduce stalking times. Thus, utilized appropriately, inflation has the potential to improve predictions of the future state of atmospheric conditions, and possibly other physical systems.

kameron d. harris, undergraduate honors in mathematics & physics, 2009

title: predicting climate regime change in chaotic convection

abstract: a toy climate analogous to the famous lorenz system is derived and compared to computational fluid dynamics simulations in order to test new modeling techniques. in particular, methods of data assimilation and ensemble forecasting are used to predict regime changes and residencies in this toy climate. a climate “truth” is created using a finite-element simulation of a thermal convection loop, a physical apparatus designed to be the simplest model of convection in the earth’s atmosphere. forecasts of the climate are made using low-dimensional lorenz-like models and synchronized to noisy observations of the truth using various kalman filters. forecasting of regime changes has been successfully demonstrated when the same model is used to create both the observations and the forecast, but never for realistic chaotic convection.

academic genealogy (pdf)

the mathematics genealogy project at north dakota state university keeps track of academic lineage in an attempt to generate the entire history of the advisor/advisee network of mathematicians. my heritage dates back to the 14th century and includes copernicus, huygens, leibniz, kant, wiener, helmholtz, hertz, bjerknes, charney, kalnay, and yorke, among others.

aaron schwartz, phd*=

nature & well-being

tyler gray, phd*

computational linguistics

michael arnold, ms*

urban scaling

sophie hodson, ug*

computational social


sarah howerter, ms*£

computational social



david dewhurst, phd*#



colin van oort, phd*#¡

machine learning

josh minot, ms*

computational linguistics

laura jennings, ug*ƒ

food systems

lindsay ross, ug*º€Ð

palliative care nlp

apple software



suma desu, uvm undergrad*

computational linguistics

data scientist, apple

paul lessard, uvm masters*

computational linguistics

uc boulder phd student

also advised by:

*   peter dodds

  1. + josh bongard

!   james bagrow

#  brian tivnan

=  taylor ricketts

% paul hines

@  yves dubief

&  darren hitt

$  chris jones

^  hugh garavan

  meredith niles

º   maggie eppstein

¡   safwan wshah

£  lisa aultman-hall

¢  lesley-ann


ƒ  lizzie pope

§  matt mahoney

  bob gramling

Р donna rizzo

þ  dave hammond

µ  chris skalka

mike foley

uvm masters*#

computational finance

northeastern phd student


kayla horak, uvm masters*

statistical hedonometrics

research scientist

univ of wisconsin


catherine bliss, uvm phd*

inference on social networks

lecturer at uvm


lewis mitchell, uvm postdoc*

data assimilation

geography & happiness

instructor, univ of adelaide


morgan frank, uvm masters*

human mobility

mit phd student


jake williams, phd*
computational linguistics
quantifying stories
faculty at drexel university

nicholas allgaier, uvm phd*+^

reverse engineering the brain

postdoc, neuroscience, uvm


eitan pechenick, phd*

computational linguistics

maine school of science

& mathematics


sharon alajajian, ms*

food systems & eating

behavior via social media


dilan kiley, ms*

quantification of sports stories

chobanian group

thomas mcandrew, phd*!

nsf igert, smart grid &

human behavior

cardiovascular research


emily cody, phd*!

nsf igert, smart grid &

human behavior

data scientist @ adobe

mark ibrahim, ms*

structure of knowledge

data scientist @ insight

fletcher hazlehurst*

uvm undergrad



ben emery*

climate change

john ring, phd*#µ

computational finance

todd deluca, phd*

machine learning

public health

henry mitchell, ug*§

nonlinear dynamics

anne-marie stupinski, ug*

computational social science


darcy glenn, *þ

uvm undergrad

climate modeling

uc london ms student

abby ross, ms*

gender & media

northfield mount hermon


chris fusting, ms*#

data scientist @ diveplane


andy reagan, phd*@#

the science of stories

data scientist at massmutual


ryan gallagher, ms*

protest dynamics

northeastern phd student


eric clark, phd*$

computational linguistics

sandhya gopchandani, ms*

word embeddings &

mental health

jade adams, ms*

massmutual data artist in residence

brendan whitney, ms*¢

climate statistics

sven mccall, ug*¢

lagrangian coherent structures

apple maps team

peter larsen, ug*

tweeting while eating