Responsible Computational Research

Doing research comes with responsibilities, how to make sure the principles of research integrity are translated into scientific computing practices? This page presents a short structured view of ethical and legal considerations in computational research in Europe, with an extra overview of responsible use of generative AI for writing code. This page is a draft, help us improving it!.

The Normative Cascade

The normative cascade (Floridi, 2018) summarises how general ethical principles become regulations/laws, which then become research policies which guide our work as researchers. Ethics is not law, and (ideally) ethics drives changes in the law. Finally, society and the scientific community re-defines what (research) ethics should be.

The normative cascade

Redrawn from Floridi, 2018, and re-adapted in the context of ethics, laws, and academic research.

Level 1: Foundational ethical principles in research

In the context of research, the ALLEA - European Code of Conduct for Research Integrity defines the core principles of ethics in research:

  • Reliability: Ensuring the quality of research, reflected in the design, methodology, analysis, and use of resources.

  • Honesty: Developing, undertaking, reviewing, reporting, and communicating research in a transparent, fair, full, and unbiased way​

  • Respect: Towards colleagues, research participants, research subjects, society, ecosystems, cultural heritage, and the environment.​

  • Accountability: Accountability for the research from idea to publication, for its management and organisation, for training, supervision, and mentoring, and for its wider societal impacts.​

The ALLEA code of conduct is a short little book that everyone should read before doing any research work. After the principles, the book covers the good research practices, and the violations of research integrity.

Exercise: How do the ALLEA principles, good practices, violations map to computational research?

In small groups, browse the ALLEA code of conduct, pick a few items and write the corresponding good/bad practice in computational research.

Level 2: EU Legislations and regulations that can affect research

Multiple regulation govern how we should handle research data and the tools we use in research. While this section is not aiming at covering all possible legislations that apply to research, it is important to mention that other legislations that apply to an organisation can also then apply to research (for example, an organisation has the responsibility of repecting the Cybersecurity Directive, and that might affect how researchers use the tools of the organisation).

A few examples worth mentioning:

  • The General Data Protection Regulation (GDPR). Aalto guidelines for handling personal data in research.

  • The Artificial Intelligence Act. While research on prohibited or high-risk AI systems is legal, the use of certain systems for research can pose risks to the researchers or research subjects. Aalto guidelines on the AI Act. Article 4 is fundamental for all of us: gaining a sufficient level of AI literacy.

  • European legislation on open data: it is worth mentioning that the Commission adopted a list of high-value datasets: geospatial, earth observation and environment, meteorological, statistics, companies and mobility.

  • Export control, dual-use regulations, sanctions compliance: various legislation govern how technological advances can be exported outside Europe, considering the risks associated with the technology, and the current geo-political landscape.

Level 3: National and University-level guidelines (in Finland and Aalto University)

From the general ethical principles and legislation, national level guidelines follow. While sometimes breaking these guidelines might not have criminal implications, they can have clear impact on reputation and career of the researchers not following them.

Level 4: Researchers

And finally it is the researcher who need to understand how all the guidelines and laws apply to their work, and how new ways of working can provide better outcomes or might need a new definitions of core principles.

  • Researchers agree on the ~~best~~ -> good enough practices (Wilson 2017), and start implementing them

  • Researchers learn and adapt (e.g. how generative AI is changing the way we work)

  • Researchers drive future change

Responsible research in practice

So what is the real-world application of ethical and legal standards for computational researchers?

Best/good enough practices in computational research

There are many different types of “best” practices to adopt. We all struggle to be on the right of this figure, but being in the middle is good enough… for sure you want to avoid being on the red side of this. Figure from “Getting started with reproducibility in research!”.

Cybersecurity: Classification of Information

Cybersecurity is the practice of protecting data, systems, networks, and software from unauthorized access, attacks, damage, or disruption. ​ It involves implementing a broad range of strategies and technologies to secure the digital environment, from individual software components to large interconnected infrastructures. ​

Effective cybersecurity ensures that sensitive data remains protected, systems function reliably, and unauthorized parties are blocked. ​

  • C – Confindentiality: sensitive information is only accessible to those authorized to see it ​

  • I – Integrity: ensures the accuracy and consistency of information​

  • A – Availability: ensures that information, systems, and models are accessible when needed​

Data (and code) can have different levels of classification of information.

  • Public:

    • Publications​

    • Open data (CC*)​

    • Open source software​

    • Other content made public (social media posts, lectures on YouTube)​

  • Internal:

    • Drafts​

    • Team notes, meetings data​

    • Pages that require authentication​

    • Project code, software, scripts​

  • Confidential:

    • Research data with personal data​

    • Trade secrets​

    • Research proposals​

    • Any other file that requires only a specific group of individuals to have access to​

  • Secret:

    • Data that is required to be secret (e.g. secondary health data, high risk research projects)​

    • Sensitive research data that requires strong protection (e.g. as a result of a DPIA).​

Which storage/computing system is good for what?

Rules of thumb for secure storage and computing with the systems provided by your University/organisation (please check your university guidelines of course):
  • If it does not require authentication: only public data

  • If it has single factor authentication (e.g. login and password), then it might be good for internal information.

  • If it has​ multi factor authentication, then it might be good for confidential data (examples at Aalto/CSC: Triton Cluster, Teamwork, Allas)

  • If it has MFA and it is not accessible from the internet, then it might be good for secret data (examples at Aalto/CSC: SECDATA, CSC Sensitive Data Services)

Unsure? Get in touch with your local admins/experts/support team.

Generative AI and Emerging Challenges

Generative AI tools (usually based on large language models, such as ChatGPT, Claude, Gemini) is a powerful tool that can be used for writing software, empowering everyone with the possibility to convert ideas (prompts) into actual code. When automation comes into play, there are suddenly more risks that can be introduced in your workflow, unless there is careful review of what is generated.

New risk dimensions for responsible computational research (and data security):

  • Bias and Fairness in generated content and training data: most models are trained on “anything we can scrape”. Your niche case might be not present in the training dataset: will you get a wrong answer? Old (wrong) methods might still be very frequent in the training data, but it does not mean that they are right.

  • Attribution and originality: how can I find the right references for what I am generating?

  • Intellectual property and licensing issues: how can I know that the code I generate with ChatGPT is not verbatim copied from a library that makes it mandatory to be cited or to adopt a certain software licence?

  • Data leakage and confidentiality risks: there is no “cloud” it’s just someone else’s computer. Be careful with the data you input in that computer (“we promise you your data is safe, trust us” -> data breach happens) (”30% of popular AI chatbots share data with third parties”)

  • Risks of automation and lack of explainability: the more you automate, the more you need to test that the automation works. If your results are just the output of an AI system, how do you ensure reproducibility and explainability?

  • Regulatory uncertainty and ethical concerns: how was the training data obtained? Is it legal that they used all the possible data from the whole internet? What is the impact on the environment? What is the impact on all those exploited workforce who annotates and improves these AI models/systems? Are open-source models actually open source? (spoiler: not open at all!)

  • Dependence, anxiety, burnout: these tools are intentionally “humanised” to act like assistants and empower you, up to the point that you cannot do your work anymore without them (or maybe it is just a tool? We need more studies: see ref)

  • Cybersecurity risks: Generated code with “hallucinations” can introduce cybersecurity risks (reference)

  • There is surely more… let’s expand this list.

Conclusion

Ethical and responsible research is not a constraint, it is an enabler of robust, reproducible, impactful, and trustworthy science. The path toward “best practices” is complex and non-linear; only through open discussion and reflection on our practices can we collectively adopt the most ethical and effective approaches for doing research. To learn about many of these responsible practices in computational research the CodeRefinery workshop covers version control, reproducibility, testing, documentation, and more.

Further references