Saturday 5 September 2015

The good, the bad and the ugly

Dealing with contamination in supernova datasets


Picture this: you have a room full of people and the one burning question you want answered is "what is the average height of women in the room". But the problem is, you don't know which people are women. The only information you do have is the length of each person's hair and their height. What do you do? Well you can make an educated guess as to whether someone is a man or a woman based on the length of their hair. You could even assign a probability, people with longer hair are more likely to be women. Then you could cut your dataset based on the probability and assume everyone with a probability of, say, 90% or higher are women.


There's a problem with this approach though. You're going to lose a lot of your data, since many women might have a probability less than 90% because of short hair. Much worse, you'll also get some contamination. Men with long hair will contaminate your dataset and cause your estimate of the height to be biased, since they are on average, taller. You can get around this by using a higher probability cut, resulting in less contamination and less bias, but then you'll get a much worse estimate of the average height, since you have less data in your sample.

This example might seem ridiculous, but it has very real applications in cosmology. Not with hair, obviously...

In the first half of my PhD, I worked on supernova cosmology and how to deal with contaminated datasets. A supernova is an exploding star, among the brightest events in the Universe, meaning we can see them far away. Supernovae are incredibly useful for cosmology because we think we know how bright they actually are, and hence by measuring how bright they appear, we can figure out how far away they are (much like a car driving towards you on a highway at night).

But there's a catch. Only one type of supernova is useful, because it's the only one that behaves well enough for us to have a good idea of its intrinsic brightness, and that's the type Ia supernova (pronounced "type-one-A"). Type Ia's can give us tight constraints on the fundamental parameters of the Universe, and are the main reason we know dark energy dominates the Universe.

The problem is, in order to know for sure what type a particular supernova is, you have to take a spectrum. It's a bit like doing a DNA test: it's a very accurate way of determining the type, but it's also very expensive. Taking an image is easy, taking a spectrum needs a lot of light, a long observation and hence a lot of telescope time. There's also a time pressure, since a supernova will decay in 20-90 days after initial observation, so they have to be spectroscopically followed-up quickly. In the past, it was possible to follow-up almost every object that was detected which was thought to be a supernova. New surveys such as the Dark Energy Survey (DES) and the upcoming Large Synoptic Survey Telescope (LSST) will make this impossible. LSST is expected to detect anywhere between 10 000 and 500 000 supernovae in its lifetime and it will simply be impossible to get enough telescope time to confirm them all spectroscopically.

So where does that leave us? Well back to the hair problem. Our type Ia supernovae are the women in the first example, and their average height corresponds to the cosmological parameters we can get from them. Non-Ia's are contaminants, causing biases in the sample. And just like in the above example, we have incomplete information. You can use the light curve of a supernova (its brightness as it changes with time), which is much easier to obtain than a spectrum, to get a probability that the supernova is a Ia, just like in the hair example. But you can never know for sure without a spectrum. So what we're left with is a contaminated supernova dataset, with only a probability of being a Ia for each object.

And this is where BEAMS comes in. BEAMS stands for Bayesian Estimation Applied to Multiple Species and is a great way to solve this problem. Instead of applying cuts to the dataset and allowing the contamination to bias your results, we use all the data but in a statistically consistent way that allows us to get good constraints without biases from the non-Ia's.

The difference with BEAMS is that at no point do we ever decide an object is a Ia or not, we allow it to be both. Going back to our example (note the following plots are basically entirely made up), if we plotted the distribution of heights (i.e. we bin the heights of all the people in the room) and we look at the average of all the heights, this clearly does not represent the average height of women in the room, even if we were to cut out a lot of the sample of probable men.


Instead, with a mixture model, we model two populations instead of one and each population is allowed to have its own average (the parameter we're interested in). You can see that in the extremes ends of the distribution, it's quite clear which population a particular height is most likely to belong to, but somewhere in the middle it's hard to tell just from the data. These points contribute a fair bit to both average values (of men and women).



Remember I jumped up and down about probabilities earlier? They're really important in getting mixture models to be precise. If I have a person who could potentially belong to either population but they have really long hair, they have a much higher probability of being a woman and so contribute more to the women's average than to the men's. This allows us to get unbiased estimates of both parameters, whichever one we're interested in. Of course, the more reliable the probabilities, the better this estimate will be.

So the idea behind BEAMS is to give up on trying to perfectly classify every object with spectroscopy and instead account for the contamination we know we're going to get by fitting for it with a mixture model.

The piece of the puzzle I actually worked on in my PhD was, what do you do if the data are correlated? Correlated just means each point is not independent of the others. For example if all the people in the room were made up of families, each person would be correlated with other members of the families since their heights would probably be similar.

It turns out it's computationally impossible to solve the original form of the BEAMS equation if the data are correlated... This was a serious problem so I worked to show you can still do BEAMS with correlated data, you just have to change doing part of the calculation from analytic to numerical. It ultimately instead of going through every possible combination (which was very large) we did some random sampling (with some guidance) to quickly get to the right solution. We repeated our analysis with over 10 000 simulated datasets to ensure we could accurately recover the cosmological parameters we were interested in, every time.

If this last bit is not a complete enough explanation for you, dig into the details in our paper. You can also read the original BEAMS papers in these links:

http://arxiv.org/abs/astro-ph/0611004
http://arxiv.org/abs/1111.5328
http://arxiv.org/abs/1110.6178

And more recently, a paper proposing to use BEAMS as part of a much larger, meta supernova analysis:
http://arxiv.org/abs/1507.01602

So supernova cosmology in the era of LSST is going to be new and exciting territory, but if we can measure the average height of women in a room, it seems we can do cosmology with exploding stars.










No comments:

Post a Comment