Research from the Lab's Little AI Scholars: Learn the life language using artificial intelligence
1. Background
1.1 project/research background
This project explores the applications of Artificial intelligence (AI) techniques for classifying Deoxyribonucleic Acid (DNA) sequences by three high school students under teaching and supervision in the Lab's Little AI Scholars program.
To explain AI information and concepts understandably, a couple of analogies were introduced during the research. They were displayed using interesting images to give the high school students a better understanding, and we have successfully achieved our goal of Auto Recognition of DNA Sequences. We first transformed the DNA sequences into human-like language. Then we employed Natural Language Processing (NLP) and Multi-layer perceptron (MLP) to complete sequence classification into 7 gene families from 3 organisms (humans, dogs, and chimpanzees). During this exciting research, the high school students deeply understood the biological and mathematical knowledge they learned in class and adapted them to the research (e.g. DNA-related information for analysis of the experiments, sets, functions, vectors, and matrix, etc., to the classification model). Finally, Python and TensorFlow are used to implement it. The experiments have shown that our project succeeded in achieving high accuracy. In addition, we developed a demo for the user to access the classifier easily.
1.2 DNA background
Figure 1 DNA sequences research fields and applications
2. Solution
2.1 solution schematic
Figure 2 The solution schematic
2.2 More details about the solution
Figure 3 Transformation of DNA sequences into text
Figure 4 Vectorization for the text
Figure 5 Biological neurons to a “node” in a neural network.
Figure 6 MLP network structure
3. Experiments and findings
3.1 Data
We downloaded the DNA sequences dataset from Kaggle, an online community where users can find and publish data sets and explore data science. This dataset includes more than 6500 DNA sequences of three organisms, among them, 4380 from humans, 820 from dogs, and 1682 from chimpanzees. They are annotated into 7 classes as shown below. Meanwhile, Figure 8 shows the class distributions of humans, dogs, and chimpanzees.
Table 1. DNA sequence types and the labels in the dataset
Figure 8 Classes distribution in human, dog, and chimpanzee data
3.2 Example experiments (visualization of the results using confusion matrix)
Figure 9 Confusion matrix of testing human, dog, and chimpanzee data
3.3 Interesting findings from generalization evaluation
An interesting fining is that
4. Demo
Figure 11 Instructions for the demo
Figure 12 Demo’s recognized results
5. Summary
In this research, we used Natural Language Processing (NLP) and neural networks to complete automatic classification for DNA sequences. We transformed DNA sequences to a human-like language and explored count vectorizer and TF-IDF as the vectorization methods. At last, we employed the classic neural network multi-layer perceptron as the classification model. We also developed a demo for users to try.During the project, the three high school students learned a lot. Firstly, they understand the pipeline of an AI project and adapt it to solve practical problems in our lives with high accuracy. To better understand, they are taught several analogies for AI knowledge, such as supervised learning and multi-layer perceptron. Secondly, they deeply understand several concepts in our maths class, such as set, vector, matrix, and function. They then proceeded to apply them to our research, not just let them lie in our exam papers. Also, they try to understand some new and challenging knowledge in science. For example, they read the references for more background on this research's different DNA sequence types.
The first touch of AI is enjoyable, and they will explore more. For the next step, they will study different vectorization methods and AI models for classification. Also, exploring the generalization performance in the view of transfer learning could be interesting.
Comments
Post a Comment