On October 13, Ardi Tampuu defended his PhD thesis „Neural networks for analyzing biological data“. In his thesis, it was demonstrated the benefits of ANNs in analyzing two biological datasets. First, they investigated if based only on the information contained within a DNA snippet it is possible to predict if the snippet originates from a viral genome or not. The second biological dataset analyzed originates from neuroscience.
Artificial neural networks (ANNs) are a machine learning algorithm that has gained popularity in recent years. Different subtypes of ANNs are used in various fields of computer science. For example, convolutional networks are useful in object and face recognition systems; whereas recurrent neural networks are effective in speech recognition and natural language processing. However, these examples are not the only possible applications of neural networks – in this thesis we demonstrated the benefits of using this technology in tackling two biological questions.
First, we applied neural networks in the field of genomics. Yet unknown viral species may have important effects on human health. To help virologists identify new viral species, we created a machine learning based recommendation system that can identify DNA sequences likely to originate from a virus. These sequences can then be further studied in the lab to fully sequence and characterize the organism they belonged to.
In machine learning terms we investigated if the viral origin of a DNA sequence can be predicted based on only the information contained within that sequence, i.e. without comparing the sequence to the database of already known sequences like alignment-based tools do. Through two publications (Article 1 and Article 2) we demonstrated that machine learning algorithms can make this prediction with a good accuracy. Our convolutional neural network architecture, named ViraMiner, outperformed the baseline methods by a wide margin. The sequences that our model labels most likely to be viral are indeed all viral – the recommendation system gets 20 out of 20 top predictions correct in a challenging dataset with low prevalence.
Secondly, we investigated the usefulness of neural networks in analyzing neuroscientific data. The sense of location in our brains and the brains of all mammals relies on place cells, a type of neurons that activate only if you are in a specific location in space. These neurons are an interesting object of study and a particularly challenging task is to try to guess the location of the animal based on the activity of these neurons. Unfortunately, the activity of only a handful of neurons can be recorded simultaneously. In our work , we showed that recurrent neural networks (RNNs) are particularly useful for this location prediction task. Based on the activity of only a few dozens of neurons, our model could guess the animal’s location within 1x1m area with 10cm precision. This result clearly outperforms the Bayesian methods commonly used in the field of neural decoding. We hypothesize that recurrent neural networks would be very useful also for other neural decoding tasks, because of their ability to take in consideration not only the brain activity during the stimulus, but also the past brain activity that provides important context.