Software and Data Carpentry Workshop in Stockholm
After getting the green light from my current employer, it did not take long to add Oxana Sachenkova to the team and start planning the logistics, lessons, and official PhD-level university credits and to raise some money to support the event.
Since we recently received our instructor training and had some experience with a past scientific Python workshop and a previous edition of Software Carpentry, we decided to go for a two day workshop:
- One day with python fundamentals with a best practices twist.
- The second day with a more biological data analysis focus.
That is how the idea to do Carpentry with Software and Data came into fruition by the end of November 2015. After innumerable emails, talks, and commits the event was on the forge. Also the national Swedish bioinformatics communities BILS and WABI supported us. We would like to thank both of the organisations for their financial support.
Day 1: Software Carpentry
For day one, Olav had some interactive Python console sessions showing basic Python data structures and control mechanisms. Following up, Radovan prepared some excellent TDD lessons inspired by three sources:
The Python Koans.
Some ideas borrowed from BioPython's comprehensive testsuite.
A late addition from an upcoming SWC TDD lesson, released just a few days before our workshop.
Teaching basic Git, GitHub, TravisCI and Coveralls in such a short time was challenging for the instructors but had a very good reception from the students side.
While the infamous installation problem is still an issue, students managed to follow through the lessons; their typical Python installation issues were mostly solved by a proper installation of Miniconda.
The SWC installation tests mostly distracted students since packages not being used in the workshop where flagged as uninstalled/failed (i.e., EasyMercurial). In general I perceived that students were getting overwhelmed by too much information from SWC default guides and stopped following up and reading the instructions early. We need more TL;DR's in Software and Data Carpentry, perhaps starting with the workshop template.
Day 2: BioData, Jupyter Notebooks, Pandas and Machine Learning
The morning was dedicated to briefing students on the Pandas dataframe with Ethan White's python-ecology dataset. Due to time constraints the merging and concatenation of dataframes was not covered but was pointed out in the lesson. After that, the students ha enough knowledge to followup on Oxana's Gene Expression dataset, for which there are exercises for those students willing to earn Swedish university credits. After some glitches with Python 2 vs Python 3 Jupyter notebooks, students got to know how to analyze data from the FANTOM5 consortium.
After getting some expression heatmaps and good insight from Oxana, Ahmed KachKach, currently interning at Spotify AB machine learning division, delighted the audience with a detailed analysis of a toy dataset on breast cancer by using an extremely well documented introduction to machine learning notebook. In order to explain PCA graphically to students, Ahmed used an excellent web visualization to illustrate how variable decomposition/projection works in PCA.
Right after that machine learning introduction, I showed how one can enact reproducible (and interactive!) notebooks via the mybinder.org service by exploring a small scikit-allel dataset. More visualization techniques were shown via my current explorations of HivePlots as an alternative way of visualizing structural genomic variations in cancer samples. I also had a talk prepared about structural variations processed with bcbio, but in the interest of time, I saved it for another event :)
Last but not least, Mikael Huss went through a fantastic notebook showing some gene expression prediction techniques and clever feature engineering from his current efforts at WABI.
Thoughts and Comments
A surprising early realization of this workshop is how high demand those courses could be: only a few minutes after announcing the event, we got around 40 individuals interested and signed up. The retention changed over time due to cancellations, but we managed to **run the workshop with 35+ participants. Regarding attendance, thanks to link shorteners on our announcement emails and twitter we could track the “funnel” of students that showed interest all the way down to those that were commited to actually show up and complete the courses.
In our post-assessment polls we got an average rating of 8 over 10 on “General satisfaction with the workshop”. Here are some selected comments:
I learned a lot of things these two days and the workshop really made me more motivated to use pandas next day in the office. :D
Overall it was a very nicely arranged and well prepared workshop. My only suggestion would be to simplify a bit the exercises for day 2 (perhaps by introducing some intermediary steps between two problems). Thanks for arranging such a nice workshop.
great event, I'll recommend it further, should be on regular (annual?) basis.
And also some things to look after in the future:
I enjoyed particularly the first day, particularly the list of challenges/exercises that looked quite overwhelming at first but turned out to be manageable. Also very much appreciated: collection of ideas and questions on etherpad, post-its to request help. It would have been even better with little stricter time management.
My Python knowledge was not high enough to follow. exercise.py was nice to learn Python, but didn't help to learn testing process (I was stuck with the exercises). Other exercise had too difficult instructions. The python introduction was at a very basic level but the tasks were at intermediate or above level. This needs adjustment.
And again, not putting too much material in one day, no matter how exciting it sounds at first while preparing the lessons:
The first day was great (11/10). Intro to Python was too basic for me, but I understand it was necessary for some participants of the workshop. Intro to Git, test-driven development etc. was very well performed and I learned a lot. Second day was pretty good (7/10), but too hurried. I feel that there were too many things squeezed into the schedule. The visualization lab had a good premise, but also suffered from too little time.