New group leaders: what to know about computational research

As a new group leader, how can you make the most of your future group’s computational work, so that it becomes an investment rather than a liability? This is currently focused on software.

If you are actively writing research software yourself, perhaps directly check out The Zen of Scientific computing instead of this for the more practical side.

About you

  • Are you planning a research group which partly uses computing

  • Is your computing not your main thing (not what you want to focus on/not what you studied)?

  • Do you want your new hires to use best practices, even if you can’t mentor them yourself?

  • Do you want your research to be reproducible and open?

Why plan in advance?

  • Your group’s work is valuable.

  • Over time, your work’s value can grow…

  • … or it can be lost every 5 years as your group changes.

What usually goes wrong?

At a group level, these often happen to semi-computational groups:

  • Every researcher starts a project over from scratch

  • Researchers leave, previous work becomes unusable (your group completely changes every ~5 years!)

  • If you don’t work at it, your group’s software and data gets more and more disorganized, until it becomes unusable. It limits what you can do in the future.

At an individual level:

  • Time wasted with bugs

  • Time wasted when one can’t repeat analysis for reviews

  • Desire to hide or not share code because it’s “messy”, which promotes the above cycle continuing. And less Open Science.

Step 1: Define how you work together

This is kind of meta, but: do you want to be a group of people connected by supervisor, or a team that works together?

  • Is co-working limited to coffee chats and presentations at group meetings?

    • Do these presentations comment only on the final results?

    • Or do you discuss and praise good practices for getting those results?

    • Are some meetings spent on skill development?

  • Or on the other end, are you co-developing the same project?

  • Are you a team, or a bunch of independent contractors?

Suggestions

  • Don’t be only results oriented in your group activities. Make sure you value the process with both your time and mental energy.

Planning vs writing a plan

Plans are useless, planning is indispensable - Dwight Eisenhower

  • Different grants request you make a data management plan and I’ve seen ideas of software management plan for the future.

  • If you making a plan just for a grant, I think that’s the wrong idea. You want everything you do to go beyond single projects.

Suggestions

  • Make a “practical plan” for important aspects, in your group’s documentation area: “here is where you find our data”, “here is where we share code”, etc. Keep it lightweight but useful.

  • Designate it as part of onboarding.

  • Update it as needed.

Group documentation, “group wiki”

A single place for reference on groups practices helps with onboarding and keeping things consistent and usable over time.

  • A group wiki is a good place to start.

  • Minimum documentation about how you want things done - or how they are actually being done.

  • But not so strict that you can’t make progress in the future.

  • Index of important software, data, and other resources

    • But description of the software/data should be with the them, not in the group docs.

  • Can you make everything open. e.g. your group website contains this reference information, so it also serves as an advertisement?

Suggestions

  • If in doubt, make a group wiki

  • Use it to keep your group’s internal operating information organized - however makes sense for you.

  • When you hear of someone doing something new, ask: “did you update this in our wiki?”

Skill development

Many people learn basic programming. Far fewer people learn best practices beyond programming:

But there is also informal learning, mentoring:

  • You learn more from co-working than courses.

  • You need good, active mentoring (not weekly status checks, but real co-working)

  • Desks next to each other where you can see each others screens

  • Pair programming

  • But, as an academic supervisor, you probably don’t have time to mentor. How do you get mentoring?

    • Set up group to work together

    • Time and motivation for self-learning

    • Encourage a internal specialist who can mentor for you (“Research software engineer”).

Suggestions

  • Everyone in your group attends a CodeRefinery workshop

  • At least one group member is developed into a computational specialist and supports others.

Why talk so much about teaching and mentoring, rather than practices?

  • Unlike many topics, we can’t rely on academic courses to prepare your group members.

  • In all my experience, good software and data practices comes from sharing good internal practices.

  • I know supervisors can’t do everything, but hopefully they can promote what they need internally.

Software in research

  • Software allows you to do far more than one can alone and transform research.

  • … but can also be one of the most complex tasks you do.

  • What kind do you use?

    • You can and will use software developed by others

    • Many groups develop their own internally.

    • If you make something good, you may want to release it so that others can use it - and cite you.

Software: tools

We give a lightning overview. Come to CodeRefinery for the full story.

Version control

  • Tracks changes

    • solves: Everything just broke but I don’t know what I changed.

    • solves: I’m getting different results than when we submitted the paper.

  • Allows collaboration

    • solves: “can you send me the latest version of the code”

    • solves: “we’re using two different versions, too bad”

  • Creates a single source of truth for the code

    • Not different scattered around on everyone’s computers

  • Most common these days: git

Suggestions

  • Everyone must learn the basics of a version control system (CodeRefinery week 1 does this).

  • Find a source of advanced support (your specialist group member or some other university service)

Github, Gitlab, etc.

  • Version control platforms

  • Online hosting platforms for git (others available)

  • Very useful to keep stuff organized

  • Makes a lot of stuff below possible.

  • Individual projects and organizations with members - for your group.

Suggestions

  • Make one public Github/Gitlab organization for your group

  • Make one internal Gitlab organization hosted at your university.

  • Strongly discourage personal repositories for common code.

Issue tracking

  • Version control platforms provide issue trackers

  • Important bugs, improvements, etc. can be closely tracked.

Suggestions

  • Use issues for your most important common projects

Change proposals (aka “pull requests”)

  • Feature of version control platforms like Github or Gitlab

  • People should work together, but maybe not everyone should be able to modify everything, right?

  • Contributors (your group or outside) can contribute without risk of messing things up.

  • For this to work you need to actually review, improve, and accept them

Suggestions

  • Decide which projects are important enough for a more formal change process.

  • Use pull requests for these projects which should not be broken.

Testing

  • How do you know your code is correct? Try running it, right?

  • But what happens if you change it later?

  • Software testing is a concept of writing tests, which can automatically verify functionality.

  • You write tests, and then anytime you make a change later, the tests verify it still works.

Suggestions

  • Each moderately important project has some test data and can automatically run something

  • More important projects: add in as many tests as practical

Documentation

  • Documentation makes reusability.

  • Minimum is Readme files in each repository.

  • Big projects can have dedicated documentation.

Suggestions

  • Every projects gets a README file. As supervisor, read these README files and confirm what it contains.

  • Dedicated, in-repository documentation for large projects (for example Sphinx)

Licensing

  • Reuse gets you citations

  • Reuse requires a license - or else significant reuse will be minimal.

  • You will often need to check your local policies on making something open source.

Suggestions

  • Decide (with stakeholders) on a license as early as possible - use only open-source licenses unless there is special reason. You don’t have to actually open right away.

  • Try to focus on using similarly licensed things.

Publication and release

  • If you invest in your software, you probably want to share it

    • “If we release a paper on some method, and we don’t include easy to use software to run it, our impact will be tiny compared to what it could be.” - CS Professor

  • Good starting point: make the repository open on Github/Gitlab

  • Can also be archived on Zenodo (or other places) to make it citeable.

  • Do all work expecting that it might be made open someday. Separate public and secret information into different repositories.

Suggestions

  • Public on GitHub/GitLab as soon as possible

  • Next level is releases on package indexes

  • You can make software papers later (when relevant)

Working together on code

Group discussion: What can go wrong when people work together?

Other computational topics

… not exactly software, but still relevant to this discussion.

Data storage

  • Discourage single-user storage spaces (laptop, home directories)

  • Use common shared spaces instead

  • Network drives

    • Usually used via a remote system

    • Some can be locally mounted on your own laptop for ease of use

    • Not the best for people who want to work on their own computer, but works. Data can be synced.

Aalto Scientific Computing strategy:

  • All mass storage provided in shared group directories.

  • Request as many as your want - each one has a unique access control.

  • Access and data can be passed on as the group evolves.

Suggestions

  • Have a plan. People know where central storage is and at least one copy must be there.

  • Request central network drive storage if possible.

  • Ask your group members: “Where is your data? Is the data documented?”

Data storage locations at Aalto University

  • Own devices

    • Danger, no backups! Personal devices are considered insecure.

  • Aalto home directories

  • Aalto network drives

    • Large, secure, backed-up. Request from your department or from Aalto IT Services.

    • 10-100 GB range is easy.

  • Triton HPC Cluster

    • Very large, fast, direct cluster access, but not backed up.

    • 10s-100s of TB.

  • CSC data storage resources

  • Public data repositories

    • For open data

Computing

There are a range of computing options: (easy to use, small) ⋄ (harder to use, large)

  • Own devices

  • Remote servers

  • Remote computer clusters

    • Aalto

    • CSC

Support

It’s dangerous to go alone. Take us!

  • There were many things above.

  • Hopefully you got some ideas, but I don’t think that anyone can do this alone (I learned everything by working with others)

  • Rely on support and mentoring.

Some possibilities, if you are at Aalto:

Suggestions

  • Ensure your group members come to garage if they have questions you can’t answer.

  • Come to a RSE consultation and chat at least once when getting your group started.

Summary: dos and don’ts

You are not allowed to

  • Not use version control

  • Not push to online repository

  • Have critical data or material only on an own computer.

  • Make something so chaotic that you can’t organize it later

  • Go alone

… but you don’t have to

  • Start every code perfectly

  • Do everything perfectly

  • … as long as you can improve it later, if needed.

  • Know everything yourself.

Checklist

  • Set up group reference information (for example, wiki).

  • Work with your supporters to create a basic outline of plan.

  • Set up Github organization for group code

  • Set up Gitlab organization for internal work (university Gitlab)

  • Create your internal data/software management plan.

  • (Think what code/data will be most reused, put it in one place, and make it reusable.)

  • Send group members to CodeRefinery as they join.

See also

  • The Zen of Scientific computing - different levels of different aspects you can slowly improve. Emphasizes that you don’t have to be perfect when you first start.