Understanding Communities with Topic Models
This post is still under construction
This post is a user-friendly introduction to my two recent papers Learning seasonal phytoplankton communities with topic models and Phytoplankton hotspot prediction with an unsupervised spatial community Model
In our recent work, we’ve been developing methods which simplify exploratory data analysis – expanding scientists’ toolbox for taking of complex, high-dimensional data and breaking it down into intuitve representations that suggest hypotheses that are worth testing.
Specifically, we’ve been working with topic-model based representations of datasets, which take a set of observations (‘words’), grouped into unordered collections (‘documents’), and learn how strongly associated each possible observation is (‘topics’). We’ve found that topic models, adapted to include weak temporal or spatial information, are extremely versatile tools for breaking down scientific data into interpretable representations [my co-author, Yogesh Girdhar, has demonstrated this in many more contexts than me].
Most recently, we’ve published two results breaking down datasets featuring population data for 47 types of plankton into a few groupings of taxa which are likely to be found together. These groupings, which we call communities make it possible to understand the overall populations in terms of simple measurements of what is happening in the environment. They also make it easier to predict where certain rare types of phytoplankton are likely to be found, by using the more common types which are easier to find.
Phytoplankton … ?
You probably know that phytoplankton are some kind of microscopic marine organism and not a lot else. That’s partially because phytoplankton are an extremely diverse category of organism. Their defining feature, and the one that makes them so important is that they are autotrophs, meaning they produce their own sustenance, e.g. through photosynthesis, rather than consuming other organisms for it.
Essentially, phytoplankton just means marine plant. Similar to to plants, phytoplankton are the primary producers in many marine ecosystems, sustaining all heterotrophs directly or indirectly. When phytoplankton populations are impacted, wide-ranging ecological consequences follow immediately. Also similar to plants, phytoplankton play a role in global-scale carbon cycles. When they photosynthesize they absorb carbon from the atmosphere or water around them, and when they die they may sink to the ocean floor where under the right conditions the carbon is sequestered in the sea-floor rather than returned to the atmosphere. Phytoplankton are extremely diverse, with thousands of species identified, each with their own unique environmental niche and ecological role to play.
In contrast to plants, phytoplankton are microscopic, they live in the ocean, and they drift rather than staying conveniently in one spot for us to study. For terrestrial plants, even a non-ecologist like me has some easy intuitions: eg. I know that spruce trees, blueberry bushes, golden rod, and poison ivy all thrive in Quebec’s forests in the middle of summer. Phytoplankton ecologists face a similar diversity of species but don’t have the same kind of intuitive picture. Up until recently, studying phytoplankton meant either bringing a water sample back to a microscope in a lab or getting a rough biomass estimate by estimating the amount of chlorophyll in a satellite image. Imagine trying to understand the health of a forest if you had to go in with a blindfold, pick a few plants at random, and only figure out what you picked a day later. If you’re lucky, you could also get data showing that in your particular region there was either a lot or a little plant activity. And if you’re extra lucky, by the time you refine your hypotheses based on this data, the organisms have not drifted away. So clearly we need some kind of improved instruments to begin to understand phytoplankton ecology.
To summarize, phytoplankton are vital to marine food-webs, as well as for predicting and possibly mitigating climate change. But because they exist in a completely different world from our normal experience we don’t understand yet them in detail proportional to their importance or diversity.
Imaging FlowCyto Bot
Heidi Sosik, Robert Olsen and a team of researchers at WHOI are developing a new set of tools to get exactly the kind of data that phytoplankton ecologists are missing. One novel instrument, Imaging FlowCyto Bot or IFCB samples ocean water and automatically photographs individual plankton through a microscope lens. On its own, IFCB is an amazing technical achievement. IFCB uses In situ flow cytometry, a method whoch involves shining laser light through a stream of water droplets and measuring the Chlorophyll flouresence to trigger the camera’s shutter at the exact right moment to capture an image of a microscopic organism. It does this at regular intervals, and is robust enough to be deployed at sea or under a ship for years at a time, communicating it’s plankton photos back to shore remotely.
Sosik, Olsen and others have built an image classification system on top of this plankton detector. Using classic computer vision techniques, they can identify 47 taxa (roughly, groups of species) with 88% accuracy. This means that on an IFCB deployment, scientists can get measurements of which and how many of each taxon were in a 5 mL sample of ocean water, taken up to 3 times per hour.
IFCB has been deployed and continuously sampling off the coast of Martha’s Vineyard, MA since 2008. Other IFCBs have been deployed on research cruises, for instance on a two week long journey from Cape Cod, MA, up and around the Gulf of Maine, and back down the coast all the way to North Carolina. These datasets also include environmental conditions such as water temperature, current direction and speed, dissolved oxygen content, and meteorological data. This represents a dataset of unprecedented resolution, rich with exactly the kind of information ecologists are missing – enough to begin to understand the environmental niches of different phytoplankton taxa in detail.
A simple explanation for complex data
The dataset of daily taxon counts at Martha’s Vineyard from 2009-2016 is exactly the kind of data the ecologists need, but it presents very few obvious conclusions:
You can see that there are some taxa which are much more common than others, and that at least some of the taxa show cyclic patterns in their abundance. If we were to try to predict the distribution of taxa from a set of environment factors, we’d have trouble though: this data is noisey and the target is high-dimensional, meaning we’d need a lot of training data to get a sufficiently complex regression model to show good generalization accuracy on this problem.
Our Oceans 2017 work proposes to find a set of communities of taxa which co-occur frequently throughout the dataset, and to do so in a way that the communities can be predicted accurately from environment data. This last piece of this proposal means that we will balance predictive accuracy with a model that has few enough communities, and intuitive enough communities that an ecologist can look at trends in the output and say things like ‘The communities which die off every winter seem to be surviving longer every year’, and then choose to go collect more specific data to test pieces of that hypothesis.