Thursday 14 February 2013

Making Data Public (and a small matrix-related rant)

In the world where there is a constant debate over the merits and disadvantages of Open Access journals and science, we are often bombarded with blogs and posts about it. I am generally a silent proponent of Open Access journals, agreeing that it is important, but not particularly versed in all of the politics so I tend to keep quiet. That being said, I have recently stumbled upon a related issue that has affected me in the last few weeks: the importance of making your data public.

Although my primary research interest is in pterosaurs, I am currently writing up a manuscript from my undergraduate thesis, which was on the ceratopsian dinosaur Centrosaurus. Much to my surprise, the most recent discussion with my former supervisor (and the senior/co-author) went in a different direction than I was expecting: he wanted me to develop a character matrix and do a phylogenetic analysis. Now I've never done this before, although I've taken several courses and have a good basic understanding of the concept, I've never actually developed a matrix and done my own analysis. Upon discussion with him, we decided that I would use several published matrices and merge them together, taking several characters from each matrix.

Looking through recently published matrices, I came across the Farke et al. (2011) paper in which Spinops sternbergorum was described. I sent him an email, and he was very happy to share the matrix, and character descriptions (although those are available from the supplementary information of the paper) and he sent along the .nex file. Super helpful, because then I had it already in a matrix that I could open, copy, paste, edit, etc. Thanks so much for that Andy! Then, I had to add some taxa that were published more recently, like Xenoceratops foremostensis (Ryan et al. 2012), and Pachyrhinosaurus perotorum (Fiorillo and Tykoski 2012). The Xenoceratops matrix was published directly in the paper as a table (not as easy to follow the correct character number, but available), while P. perotorum was found in the supplementary material (in a more easily viewable format). The best, however, came when I looked up a paper on Anchiceratops (Mallon et al. 2011). On the downside, the paper is published in a non-open access journal, which means not everyone can access it. On the BIG upside, included in the supplementary material is the actual .nex matrix file which allows you to see all the characters, states, and taxa, right in the format you want. It makes it soooo much easier to access and much quicker when these are available at your finger tips, without having to send many emails to people asking for it. There are several other (mainly older to be fair) phylogenetic papers that don't post the matrix, or characters used, which makes it really difficult to figure out how they've done things.

Unrelated to my story, and covered much in other places so I won't cover it in detail here, is a wonderful story of a recent publication that used previously published data in a huge analysis. Larson and Currie (2013) were able to study over 1000 small theropod teeth from southern Alberta, using data that had previously been published and new data. A study of this scale would clearly have taken a lot longer if they had to do sit down and do all the measurements on 1183 small teeth. Fortunately for them, (and us), they were able to spend their time analysing the data already available, rather than painstakingly measuring them. They determined that the number of small theropods present from this area has been greatly underestimated, and that many species are known only from teeth. Cool! For more information, you can check out this blog by Jon Tennant.

Take home message: make your data open to everyone! For the most part, I have dealt with people who are extremely open and willing to email me stuff if it isn't posted. But wouldn't it be better if you didn't have to email every time? If you could just go online and access it? It shouldn't be some top-secret information. Post it!

And finally, a small rant on matrices. I know that there are disagreements about characters, so not every published matrix is going to use exactly the same characters, but WHY do people insist on changing character states around in a way that just makes things difficult?? For example, there are several characters in Fiorillo and Tykoski (2012) that are just different enough from all other matrices I've looked at that you can't just directly copy the states. Why is it necessary to switch it from the postorbital horncore height being compared to the basal skull length (which every paper does) to comparing it to the length of the face? Or change numbers slightly so one one paper a character is considered to be long if it's 0.8 or more, while in another it's 0.75? Pretty sure that is unnecessary! Make it easy, people!

References:
Farke, A.A. et al. 2011. A new centrosaurine from the Late Cretaceous of Alberta, Canada, and the evolution of parietal ornamentation in horned dinosaurs. Acta Palaeontologica Polonica 56: 691-702. Freely accessible here.
Fiorillo, A.R. and Tykoski, R.S. 2012. A new Maastrichtian species of the centrosaurine ceratopsid Pachyrhinosaurus from the North Slope of Alaska. Acta Palaeontologica Polonica 57: 561-573. Freely accessible here.
Larson, D.W., and Currie, P.J. 2013. Multivariate analyses of small theropod dinosaur teeth and implications of paleoecological turnover through time. PLoS ONE 8: e54329. Freely accessible here.
Mallon, J.C., et al. 2011. Variation in the skull of Anchiceratops (Dinosauria, Ceratopsidae) from the Horseshoe Canyon Formation (Upper Cretaceous) of Alberta. Journal of Vertebrate Paleontology 31: 1047-1071.
Ryan, M.J., et al. 2012. A new ceratopsid from the Foremost Formation (middle Campanian) of Alberta. Canadian Journal of Earth Sciences 49: 1251-1262.