I am Chen Siyan, a student majoring in Computer Science, the Class of 2024 at Hainan Bielefeld University of Applied Sciences. During my internship at the laboratory of the School of Life Sciences, Fudan University, I integrated computer technology with life science research, and completed a practical exploration that enhanced both my technical capabilities and cognitive understanding.

The core objective of this internship project is the accurate prediction of crop gene phenotypes. Our core research logic is as follows: the differential expression of SNPs in different samples is the key factor driving the phenotypic differentiation of crops. Therefore, we can establish a mapping relationship between genes and phenotypes through machine learning models to achieve direct prediction of phenotypes from genetic data, providing technical support for precise screening at the seed stage.

In the implementation of the project, I mainly completed the following work:

Data Processing Phase

I conducted systematic cleaning and standardization of phenotypic files, laying a high-quality data foundation for subsequent model training.

Model Construction Phase

Following the principle of “reproduction and verification first, innovative construction second”, I first reproduced the MNIST digital prediction model on the server of the cfff platform to complete technical verification. I then built a network for the prediction of crop gene phenotypes, with the core design centered on the linkage of genetic feature reconstruction and phenotypic prediction:

  1. Taking the SNP state matrix as input, the high-dimensional genetic heterozygosity data is mapped to a low-dimensional latent space through an encoder to learn the potential features of gene expression;
  2. The reconstruction of genetic heterozygous features is completed through a decoder to verify the effectiveness of feature extraction;
  3. The core features of the latent layer are concatenated to a dedicated decoder, the structure is optimized for standardized phenotypic data, and the data is mapped to the phenotypic feature dimension through a fully connected layer, ultimately realizing an end-to-end prediction process of genetic input – feature extraction – phenotypic output. After multiple rounds of parameter tuning and iteration, a stable and efficient prediction model was successfully built.
Conclusion and Reflections

This experience has made me deeply realize that gene phenotype prediction is not a simple accumulation of algorithms, but a rigorous scientific process from data preprocessing to model iteration. I have also transformed from a technical practitioner who only focuses on code implementation to a problem solver who can understand the meaning of data by combining biological logic.

During the internship, I not only put my professional knowledge into practice and consolidated my technical foundation, but also realized the importance of team learning through communication and collaboration with senior fellows and sisters. In the future, I will carry these insights forward to delve deeper into my studies and keep exploring and forging ahead in the interdisciplinary field.