Research from the Lab's Little AI Scholars: Learn the life language using artificial intelligence

1. Background

1.1 project/research background

This project explores the applications of Artificial intelligence (AI) techniques for classifying Deoxyribonucleic Acid (DNA) sequences by three high school students under teaching and supervision in the Lab's Little AI Scholars program.
To explain AI information and concepts understandably, a couple of analogies were introduced during the research. They were displayed using interesting images to give the high school students a better understanding, and we have successfully achieved our goal of Auto Recognition of DNA Sequences. We first transformed the DNA sequences into human-like language. Then we employed Natural Language Processing (NLP) and Multi-layer perceptron (MLP) to complete sequence classification into 7 gene families from 3 organisms (humans, dogs, and chimpanzees). During this exciting research, the high school students deeply understood the biological and mathematical knowledge they learned in class and adapted them to the research (e.g. DNA-related information for analysis of the experiments, sets, functions, vectors, and matrix, etc., to the classification model). Finally, Python and TensorFlow are used to implement it. The experiments have shown that our project succeeded in achieving high accuracy. In addition, we developed a demo for the user to access the classifier easily.

1.2 DNA background

DNA stands for deoxyribonucleic acid, it is a macromolecule made up of nucleotides; phosphate sugar backbone, and nitrogenous bases A, T, C, and G, each in a different order and sequence. DNA can form genetic instructions to guide individual organism development and the functions of individual cells ensuring the survival and growth of the organism. It stores the required information for each cell and micro molecule to function, often described as the “blueprint” of the body. The different sequencing of DNA is the building block determining the structure of the DNA molecule. Segments of specific DNA sequences form genes, and these genes form gene families, genes are then responsible for gene expression or why our body functions the way it does. Scientists discovered that by classifying and identifying gene families from DNA sequences, diagnosis of early diseases can be made and predicted. After initial research, it was shown that DNA sequences are an important part of the biological field as the ability to understand DNA and classify it into families, can cause crucial breakthroughs in scientific and medical fields. Figure 1 displays some examples. Through DNA sequences, scientists can read, understand, and compare genetic information, potentially causing a breakthrough in biological studies and medical fields.[1-5].

Figure 1 DNA sequences research fields and applications



2. Solution

2.1 solution schematic

    Figure 2  The solution schematic


2.2 More details about the solution

1) Transforming the DNA sequences into texts similar to human language

                                     Figure 3  Transformation of DNA sequences into text

2) Vectorization for the “texts”

                                                 

                                                 Figure 4  Vectorization for the text

3) Multi-layer Perceptron (MLP) as Automatic Classifier

    

      Figure 5 Biological neurons to a “node” in a neural network.

                                                      Figure 6  MLP network structure


3. Experiments and findings

3.1 Data

We downloaded the DNA sequences dataset from Kaggle, an online community where users can find and publish data sets and explore data science. This dataset includes more than 6500 DNA sequences of three organisms, among them, 4380 from humans, 820 from dogs, and 1682 from chimpanzees. They are annotated into 7 classes as shown below. Meanwhile, Figure 8 shows the class distributions of humans, dogs, and chimpanzees.

                                           

Table 1. DNA sequence types and the labels in the dataset


Gene family

Class Label

G protein-coupled receptors

0

Tyrosine kinase

1

Tyrosine phosphatase

2

Synthetase

3

Synthase

4

Ion channel

5

Transcription factor

6

                            

                               Figure 8 Classes distribution in human, dog, and chimpanzee data

3.2 Example experiments  (visualization of the results using confusion matrix)

                            Figure 9  Confusion matrix of testing human, dog, and chimpanzee data

3.3 Interesting findings from generalization evaluation

Furthermore, the generalization of  AI models is crucial. Generalization is the ability to learn patterns from the training data that can be used for a new, unseen dataset.

                                                        

Figure 10 An example of model generalization  

An interesting fining is that 

Trained in Human 

Tested in dog

Tested in Chimpanzees

0.9578

0.9085

0.9899


4. Demo

To be able to conveniently access the classification model for a try when they input a DNA sequence, we used Anvil. works to develop a demo based on our Python codings. The link is https://dna-sequence-recognition.anvil.app. Figure 13 shows users' instructions on accessing the demo, and Figure 13 displays the recognized results.

Figure 11 Instructions for the demo

                                             Figure 12 Demo’s recognized results


5. Summary

In this research, we used Natural Language Processing (NLP) and neural networks to complete automatic classification for DNA sequences. We transformed DNA sequences to a human-like language and explored count vectorizer and TF-IDF as the vectorization methods. At last, we employed the classic neural network multi-layer perceptron as the classification model. We also developed a demo for users to try.

During the project, the three high school students learned a lot. Firstly, they understand the pipeline of an AI project and adapt it to solve practical problems in our lives with high accuracy. To better understand, they are taught several analogies for AI knowledge, such as supervised learning and multi-layer perceptron. Secondly, they deeply understand several concepts in our maths class, such as set, vector, matrix, and function. They then proceeded to apply them to our research, not just let them lie in our exam papers. Also, they try to understand some new and challenging knowledge in science. For example, they read the references for more background on this research's different DNA sequence types.

The first touch of AI is enjoyable, and they will explore more. For the next step, they will study different vectorization methods and AI models for classification. Also, exploring the generalization performance in the view of transfer learning could be interesting.

Reference

Liu, J. H.; Liu, P. Y.. Liu. J. Y, Ding, X. E. & Hou. R. J. DNA Sequence Automatic Classification—Learn the life language using artificial intelligence, 12th International Conference on Soft Computing, Artificial Intelligence and Applications (SCAI), 2023

Comments

Popular posts from this blog

Lab's weekly topic - week 50 2023 - Detailed explanation for encoder of U-Net

Lab's recommendation: Awesome books to learn machine learning and AI (continue updating)