Kaur Alasoo: Scientific discoveries are born through collaboration and data sharing

Kaur Alasoo, Associate Professor of Bioinformatics at the Institute of Computer Science, University of Tartu, has combined computer science and molecular biology in his research to find new ways of predicting the effects of genetic variants and to better understand rare diseases. He was interviewed by Professor Jaak Vilo, Head of the Chair of Data Science.

What first sparked your interest in genetics and molecular biology?

My initial interest arose in secondary school. I’ve always been more into technology and computing; traditional biology never really fascinated me – I still don’t know much about plant and animal species. However, I still remember how, in the final years of secondary school I was fascinated by molecular biology and genetics: the clear and concrete rules of how DNA is transcribed into RNA and then translated into protein, how inheritance works, and so on. At the end of secondary school, I met Hedi Peterson at an event – she was then a PhD student in bioinformatics at the Institute of Computer Science and is now a professor of bioinformatics. When I started studying computer science at the University of Tartu that autumn, she invited me to learn about the bioinformatics research group and suggested my first project. At that time, I didn’t even know programming. One thing led to another, and that’s how I stayed in bioinformatics. The strong background in computer science and machine learning from my undergraduate studies at the University of Tartu has been extremely useful later on.

You’ve worked abroad in several renowned research institutions: the Wellcome Sanger Institute, the European Molecular Biology Laboratory in Heidelberg, and Aalto University. How have these different research environments shaped your scientific thinking and approach to research?

There’s no single correct formula for conducting research. Every university, institute, and lab does things a little differently. The biggest differences are probably at the level of individual research groups and labs. In biology and the natural sciences, there are fairly clear rules about what constitutes trustworthy science and how it should be written up, but the much bigger differences lie in the invisible side of science – what is sometimes referred to as “night science”. Where do research questions even come from? Which questions are worth spending time on, and which aren’t? When should you change direction and try something new? At what point do data stop being just data and turn into real understanding? How do you know you’re not fooling yourself out of sheer desire to discover something? There are no definitive answers. To find a research style that suits you, it’s very helpful to interact and collaborate with different people. In the end, every scientist has to find their own way, but it’s much easier when you have role models to learn from.

What does it mean to you to have been re-elected to the board of the Estonian Young Academy of Sciences, and what goals do you hope to achieve this time?

I’d like to draw attention to the challenges in combining research, attending conferences, and being a parent. Attending international conferences is a crucial part of young researchers’ (including PhD student) works – it allows them to present their research, get feedback, and build new contacts. But for young parents, attending conferences can be very costly, especially if they need to bring an infant along with a carer or partner. We’d like to develop a grant to support young researchers in such situations, though we’ll see if we can secure funding. In the UK, such measures are quite common – both the University of Bristol and the University of Cambridge have programmes to support parents returning to research.

How do you assess the potential of combining molecular datasets and machine learning in medicine?

Enormous. I’m interested in figuring out what genetic differences between people actually do. But the human genome is three billion letters long, and each position can have up to four different values. We will never be able to measure all possible genetic variants and their combinations. And even if we could, it would just be one huge “stamp collection”. The best proof that we understand something in biology is if we can predict it. This is exactly where there have been major advances in recent years – applying machine learning to basic biology.

Although large language models like ChatGPT have received a lot of attention lately, I think the biggest breakthroughs will come from models focused on solving much more specific biological problems. A good example is Google’s AlphaFold, which was a huge step forward in predicting protein structures. Google has since developed the AlphaMissense model, which can predict the impact of genetic variants on protein function quite well. And Google is far from the only player in this space – there are many others, including academic research groups. These new models have enormous practical potential, for example in helping with rare disease diagnosis.

Have you observed progress in the past five years in understanding how and why a particular genetic variant is linked to a specific disease?

Yes, there are many examples, including in our own work. I think the biggest gains right now come from relatively ‘boring’ solutions. There are plenty of questions we don’t yet have answers to, but we know quite well what needs to be done to find them. We need to collect larger and higher-resolution datasets and integrate them better, as well as create tools that allow quick and convenient querying of these data.

At the Institute of Computer Science, we don’t do biological measurements ourselves, but that makes it all the more possible for us to contribute to linking existing datasets. For example, we recently published a paper in Nature Communications exploring what a particular genetic variant that increases the risk of developing lupus does in human immune cells. All the data used in that study had already been published, but no one had put them together before because it simply required so much work and effort.

By analogy, what we do is like building Google Maps. Different data layers – where the streets, cycle paths, buildings, cafés, and grocery shops are – all exist somewhere, but to make them convenient to use, they need to be brought together in one place. Imagine wanting to cycle in Tartu from Toome Café to the Delta Centre, but first having to look up the café’s address on its website, then the Delta address on the university’s website, and then consult a paper map to figure out how to get from A to B. That’s often how it is with genetic data – the data layers are scattered across different datasets, making it cumbersome to ask the questions scientists are interested in. Our goal is to reduce that friction as much as possible.

Stephen Burgess (Cambridge University), Kaur Alasoo and Hillary Martin (Wellcome Sanger Institute). Session “Disease mechanisms” at the Gene Forum 2025.

What international collaboration opportunities do you see for Estonian researchers?

In human genetics, research is highly international, and projects often involve many collaborators. For example, our latest paper had 20 co-authors from seven countries, all of whom made a substantial contribution to data analysis or interpretation of results. The key is to find the right partners, and for that, international conferences – as I mentioned earlier – are crucial.

What have been your most valuable research collaboration experiences?

I think our recent Nature Communications paper is one of the best examples. I clearly remember the Zoom meeting where we were reviewing the latest results after another round of data analysis updates. We had one candidate association that I had already checked several times without finding anything interesting. I was almost ready to give up, but others on the call suggested we take another look at the genetic variants we had identified – and that’s when we saw the link between the USP18 variant and lupus. Writing the paper and doing additional analyses took another six months, but from that moment, the main direction was clear.

Which research project has been the most meaningful to you so far?

The eQTL Catalogue database that we have created. We can see that it’s widely used by researchers. It already has over 400 citations in scientific literature, but there are also many use cases in pharmaceutical companies that influence drug development decisions, even if they don’t necessarily lead to new scientific papers.

How do you support your mentees at the start of their careers, and what skills do you consider most important to pass on to young researchers?

The most important skills are the hardest to teach. How do you ask good questions? How do you avoid fooling yourself? How do you turn fragments of discoveries into coherent stories and write them down? I hope to pass on these skills and values through meetings with my mentees, discussing their research, and writing papers together – but how well that works is probably something they should answer.

If I had to name one most important thing, it would be writing. It requires a lot of practice, but without writing, you can’t achieve clarity of thought. As an institute, we could probably do much more to support our PhD students here, for example by organising writing workshops.

What advice would you give to students who want to start research in bioinformatics or genomics?

Biology is a complex system that requires deep engagement. It’s very hard to do something interesting quickly, but if you have persistence and curiosity, there are many fascinating unanswered questions. One of the charms of biology is that almost everyone is an amateur. Even scientists who have worked for decades are often experts only in a narrow area, and outside that, they have to start almost from scratch like everyone else. Experience always gives an advantage, but not as big as one might think. There are 20,000 genes in humans, and even an experienced molecular biologist might know only a few dozen really well.

What has been the most instructive moment in your scientific career?

Being as open as possible about your results and sharing data. In many projects, I’ve received crucial feedback because we shared our results openly and publicly, often before we had published a paper or even a preprint. In my experience, you shouldn’t fear that someone will steal your results. A much bigger risk is that there’s an important error in your analysis or experiment that you’ve missed, but which could be spotted quickly in a short conversation.

I respect those who make their results publicly available as early as possible. I don’t understand when people post a preprint without sharing the raw data or key machine-readable results. What’s the point of such advertisements?

If you weren’t a researcher, what job would you like to try for a day, and why?

Teacher. There are many other interesting jobs, but most of them aren’t really sensible to try for just one day – for example, an airline pilot or a doctor.

Kui andmed räägivad: visuaalsed lood Eesti tervisest

Armando Vieira: Toward Coherence-Driven Artificial Intelligence

Elizaveta Yankovskaya „Quality estimation through attention“