Showing posts with label coding. Show all posts
Showing posts with label coding. Show all posts

Monday, 20 August 2018

Matlab vs open source: Costs and benefits to scientists and society

An interesting twitter thread came along yesterday, started by this query from Jan Wessel (@wessel_lab):

Quick thread of (honest) questions for the numerous people on here that subscribe to the position that sharing code in MATLAB ($) is bad open-science practice compared to open source languages (e.g., Python). What should I do as a PI that runs a lab whose entire coding structure is based (publicly shared) MATLAB code? Some say I should learn an open-source language and change my lab’s procedures over to it. But how would that work in practice? 

When I resort to blogging, it’s often because someone has raised a question that has captured my interest because it does not have a simple answer. I have made a Twitter moment to store the rest of Jan’s thread and some of the responses to it, as they raise important points which have broad application.

In part, this is an argument about costs and benefits to the individual scientist and the community. Sometimes these can be aligned, but in this case, they is some conflict, because those who can’t afford Matlab would not be able to run Jan’s code. If he were to move to Python, then anyone would be able to do so.

His argument is that he has invested a lot of time in learning Matlab, has a good understanding of how Matlab code works, and feels competent to advise his trainees in it. Furthermore, he works in the field of EEG, where there are whole packages developed to do the complex analysis involved, and Matlab is the default in this field. So moving to another programming language would not only be a big time sink, but would also make him out of step with the rest of the field.

There was a fair bit of division of opinion in the replies. On the one hand, there were those who thought this was a non-issue. It was far more important to share code than to worry about whether it was written in a proprietary language. And indeed, if you are well-enough supported to be doing EEG research, then it’s likely your lab can afford the licensing costs.

I agree with the first premise: just having the code available can be helpful in understanding how an analysis was done, even if you can’t run it. And certainly, most of those in EEG research are using Matlab. However, I’m also aware that for those in resource-limited countries, EEG is a relatively cheap technology for doing cognitive neuroscience, so I guess there will be those who would be able to get EEG equipment, but for whom the Matlab licensing costs are prohibitive.

But the replies emphasised another point: the landscape is continually changing. People have been encouraging me to learn Python, and I’m resisting only because I’m starting to feel too old to learn yet another programming language. But over the years, I’ve had to learn Basic, Matlab and R, as well as some arcane stuff for generating auditory stimuli whose name I can’t even remember. But I’ve looked at Jan’s photo on the web, and he looks pretty young, so he doesn’t have my excuse. So on that basis, I’d agree with those advising to consider making a switch. Not just to be a good open scientist, but in his own interests, which involves keeping up to date. As some on the thread noted, many undergrads are now getting training in Python or R, and sooner or later open source will become the default.

In the replies there were some helpful suggestions from people who were encouraging Jan to move to open source but in the least painful way possible. And there was reassurance that there are huge savings in learning a new language: it’s really not like going back to square one. That’s my experience: in fact, my knowledge of Basic was surprisingly useful when learning Matlab.

So the bottom line seems to be, don’t beat yourself up about it. Posting Matlab code is far better than not posting any code. But be aware that things are changing, and sooner than later, you’ll need to adapt. The time costs of learning a new language may prove trivial in the long term, against the costs of being out of date. But I can state with total confidence that learning Python will not be the end of it: give it a few years and something else will come along.

When I was first embarking on an academic career, I remember looking at the people who were teaching me, who, at the age of around 40, looked very old indeed. And I thought it must be nice for them, because they have worked hard, learned stuff, and now they know it all and can just do research and teach. When I got to 40, I had the awful realisation that the field was changing so fast, that unless I kept learning new stuff, I would get left behind. And it hasn't stopped over the past 25 years!

Sunday, 6 December 2015

Open code: not just data and publications


I had a nice exchange on Twitter this week.

Nick Fay had found a tweet I had posted over a year ago, asking for advice on an arcane aspect of statistical analysis:


I'd had some replies, but they hadn’t really helped. In the end, I’d worked out that there was an error in the stats bible written by Rand Wilcox, which was leading me astray. Once I’d overcome that, I managed to get the analysis to work.

It was clear Nick was now having the same problem and going round in exactly the same circles I had experienced.


My initial thought was that I could probably dig out the analysis and reconstruct what I’d done, but my heart sank at the prospect. However, then I had a cheerful thought. I had deposited the analysis scripts for my project on the Open Science Framework, here. I checked, and the script was pretty well annotated, and as a bonus you got a script showing you how to make a nice swarm plot.

This experience comes hard on the heels of another interaction, this time around a paper I’m writing with Paul Thompson on p-curve analysis (latest preprint is here). Here there’s no raw data, just simulations, and it’s been refreshing to interact with reviewers who not only look at the code you have deposited, but also make their own code available.  There’ve been disagreements with the reviewers about aspects of our paper, and it helped enormously that we could examine one another’s code. The nice thing is that if code is available, you get to really understand what someone has done and also learn a great deal about coding.

These two examples illustrate the importance of making code open. It probably didn’t matter much when everyone was doing very simple and straightforward analyses. A t-test or correlation can easily be re-run from any package given a well-annotated dataset. But the trend in science is for analyses to get more and more complicated. I struggle to understand the methods of many current papers in neuroscience and genetics – fields where replication is sorely needed but impossible to achieve if everyone does things differently and only incompletely described. Even in less data-intensive areas such as psycholinguistics, there has been a culture change away from reliance on ANOVAs to much more fancy multilevel modelling approaches.

My experience leads me to recommend sharing of analysis code as well as data: it will help establish reproducibility of your findings, provide a training tool for others, and ensure your work is in a safe haven if you need to revisit it.

Finally, this is a further endorsement of Twitter as an academic tool. Without Twitter I wouldn't have discovered Open Science Framework, or PeerJ, both of which are great for those who want to embrace open science. And my interchange with Nick was not the end of the story. Others chipped in with helpful comments, as you can see below:


P.S. And here is another story of victory for Open Data, just yesterday, from the excellent Ed Yong.