Software Carpentry: the University Course
Commentary
In today's web-enabled world of big data, we are collecting, processing and storing more data than ever before. The answers to many fascinating and previously intractable questions lie within these data, however extracting them is a non-trivial task. Due to their vast size and complexity, modern research datasets have largely outgrown simple data analysis packages like Microsoft Excel. Since their work typically requires significant domain-specific knowledge, researchers are therefore increasingly left to write their own data analysis software (or code), even though most have little formal training in software development. For example, recent studies suggest that modern scientists spend 30% or more of their time developing software, but 90% are primarily self-taught (Hannay et al, 2009; Prabhu et al, 2011). They therefore lack exposure to basic software development practices such as version control, code reviews, unit testing, and task automation.
This lack of computational competency has no doubt contributed to the current irreproducibility crisis in published research (e.g. Ince et al., 2012), not to mention a number of high profile retractions of published results (e.g. Miller, 2006). In an attempt to address this problem, the Information Technology Services (ITS) Research Services department at The University of Melbourne ran a Software Carpentry bootcamp for staff and students in November 2013 (see summary). This four-afternoon course introduced researchers to current best practices in computational research (Wilson et al., 2014). It proved to be very popular (over 30 people attended), so there are plans to hold a number of similar bootcamps in 2014.
The ultimate goal for ITS Research Services is to get all university staff and students using computational best practices in their data intensive research. Given the global success of the Software Carpentry project, it seems likely that future bootcamps will continue to be well attended, however so long as attendance is voluntary it is unlikely that this goal of reaching all researchers will be achieved. The proposed new subject, Software Carpentry for Data Intensive Research, takes the bootcamp concept (and materials) and recasts it into a formal university subject. It will be compulsory for all Research Higher Degree students conducting data intensive research, thus eliminating the current situation where bootcamp attendance is limited to highly motivated students who managed to hear about the event through colleagues or social media.
An even more compelling reason for running a formal university subject is the opportunity for assessment. As noted in a recent review of the Software Carpentry project (Wilson, 2016), two major problems have been low participation in follow-up initiatives (e.g. online office hours) and an inability for project coordinators to know/ensure that researchers are actually applying the teachings in their own research. Since assessment is central to the student learning experience and tends to define what they see as important (e.g. Price et al., 2011), it provides a means by which to remedy these problems.
Subject Curriculum: Software Carpentry for Data Intensive Research
Overview
Modern researchers are spending more and more time building and using software, yet most are self-taught programmers. As a result, they spend hours doing things that should take minutes, reinvent a lot of wheels, and still can't be sure that their results are reliable. This subject introduces students to a set of best practices for software development which has solid foundations in research and experience and improves both productivity and software reliability. It is not expected that students will be professional software 'engineers' by the end of the subject, but they will be using basic software 'carpentry' skills in their day-to-day research.
Expectations and prerequisites
- Students must be in the first 12-months of a data intensive Research Higher Degree. Their supervisor is responsible for classifying their research project as data intensive and thus recommending them for the subject.
- The semester long subject consists of approximately 20 contact hours delivered during one intensive week at the beginning of semester. The formal (i.e. graded) assessment then takes place at the very end of semester (i.e. 3-4 months hence).
- Students are encouraged to bring their own laptops with a number of free and open source software packages installed (instructions and support will be provided prior to the first class). By teaching on personal laptops (as opposed to in a computer lab), there is no impediment to students immediately applying what they learn in their own research. All exercises during the intensive week are completed in pairs, so students who do not own a laptop will be paired with someone who does.
- It is expected that in the time between the intensive week and the graded assessment, students will begin to apply what they have learned in their own research, drawing on support services such as the The Hacker Within student group as required.
Learning objectives
Upon completion of the subject students will be applying computational best practices in their own research. In particular, they will be:
- Using the shell to perform basic tasks such as searching their file system and automating repetitive tasks.
- Writing data analysis scripts that apply programming best practices such as the use of functions, defensive programming (e.g. using assertions) and scripts that parse the command line.
- Testing their code, when appropriate, using a formal testing library.
- Using a version control system (i.e. git, svn or mercurial) to back-up and capture a complete revision history of their code.
More generally, the students will also:
- Demonstrate an ability to exercise a high level of independence, initiative and self guidance in their data analysis.
- Be able to articulate simple explanations of the key barriers to reproducible computational research and the options available for overcoming them.
- Be able to participate in code review, both as a submitter of code and a reviewer.
Schedule and teaching and learning approaches
The daily schedule for the intensive week will be:
09:30 - 10:30: | Tutors available for general assistance |
10:30 - 10:45: | Morning tea |
10:45 - 12:30: | Coursework |
12:30 - 13:30: | Lunch |
13:30 - 15:00: | Coursework |
The coursework involves a series of 10-15 minute periods of demonstration (live coding) from the lead instructor, interspersed with various challenges/exercises/quizzes that are completed in pairs. During the coursework sessions, tutors (at a ratio of about one tutor to every six students) will circulate the room answering questions and providing feedback on student answers to the challenges. The teaching content can be broken down into four major sections:
- The Shell (Monday)
- Programming like a programmer (Tuesday and Wednesday)
- Version control (Thursday)
- Context and assessment (Friday)
The final section on context and assessment places the content of the previous four days in the context of broader topics like open access, reproducibility in computational research, and ongoing professional development. In fact, since the assessment tasks require students to actively participate in an open and reproducible workflow, these tasks are introduced along with a lesson on the tools required to complete them. Since the ongoing learning/development process for researchers who write code is somewhat independent and self-guided, the support services available to assist this process are also introduced (e.g. The Hacker Within postgraduate programming society, web-based help forums like Stack Overflow and key mailing lists).
Assessment
During the one-week intensive course tutors will provide feedback and assistance on the challenges, but no formal grades will be allocated. On the final day of the course the three graded assessment tasks will be introduced (see below). It should be noted that the grading of these tasks focuses on ensuring that students have made a genuine and ongoing attempt at using computational best practices in their research. It is not expected that students will be writing "perfect" code by the end of the semester.
Task 1: Code peer review
Students are required to upload a script of their choice to
Codebrag so that it can be reviewed by their partner.
It must be a script that they are using in their own research. In return, they must
also review their partner's script.
- Submission process: Students must email the relevant Codebrag URL to the subject coordinator.
- Due date: Three months after the one-week intensive coursework.
- Assessment criteria: This is a hurdle requirement and a pass contributes a full 20% to the overall grade for the subject. The subject coordinator will simply check that the student has made a genuine attempt at both submitting a script worthy of review and providing considered feedback for their partner. The coordinator will also provide some commentary on the script, which will assist the student with the remaining tasks.
Task 2: Code repository
Students are required to maintain a version-controlled repository of their research
code (e.g. using Git,
Subversion or
Mercurial) that is linked to an external
hosting service (e.g. GitHub,
Bitbucket).
- Submission process: Students must email the (publicly available) URL for the homepage of their externally hosted code repository to the subject coordinator.
- Due date: Four months after the one-week intensive coursework.
- Assessment criteria: This is a hurdle requirement and a pass contributes a full 20% to the overall grade for the subject. The subject coordinator will view the publicly available statistics on the site (e.g. commit activity) to check that the student has been using version control consistently. They will also provide suggestions on how the student's usage could be improved.
Task 3: Formal code review
Students are required to upload a data analysis script of their choice, including an
accompanying testing library and example data, to
RunMyCode.org. They must be using the script in
their own research.
- Submission process: Students must email the (publicly available) URL for their RunMyCode.org homepage to the subject coordinator.
- Due date: Four months after the one-week intensive coursework.
- Assessment criteria: This exercise is graded and is worth 60% of the final grade. Students must achieve a passing grade on this task in order to pass the subject. It will be marked according to the degree to which the script/s apply the best practices introduced in the subject and also the appropriateness and coverage of the testing library.
To assist students in understanding the requirements of these assessment tasks, examples of good code review, repository usage and RunMyCode.org submissions will be provided. Students can apply for an extension (due two months after the intensive week) if the early stages of their research project will not involve much programming (e.g. they might be doing fieldwork instead). It is also possible to make alternative assessment arrangements if their code and/or data cannot be open access.
This article was originally posted at Dr Climate, where I write on topics relating to research best practice in the weather and climate sciences.