The University of Sheffield
Department of Computer Science

Sarah Deighton Undergraduate Dissertation 2000/01

"Pattern Recognition in DNA Microarray Data"

Supervised by M.Niranjan

Abstract

DNA microarrays are a recent advancement in biotechnology that could revolutionise many areas of biology, from the treatment of disease and illness, to the cataloging of the function of every single gene in the human body. The first arrays were created around 25 years ago and since then technology has improved to allow ever increasing sizes of arrays and ever increasing accuracy of observations from these arrays. Array technology now makes it possible to obtain expression levels for thousands of genes in just one experiment. Microarrays have thus transformed existing biological techniques and have made the formation of new and improved ones possible.

The challenge now is to develop methods and algorithms that can be used to analyse the huge quantity of data currently being churned out from research using microarrays. Most importantly, methods are required that can facilitate the discovery of patterns amongst this data. From these patterns predictions can be made about the genes' function and, by comparison with the patterns of genes with known function, these predictions can become accepted fact. This is where my project enters the scene. The aim of the project was to implement and evaluate some of the methods that have been created. There are a vast number of these methods available. I have focused on two techniques that are at the forefront of this work, self-organising maps and support vector machines. I have based my investigations into the feasibilty of using these techniques to this end upon previous research. The two research areas I followed involve work done on the molecular classification of cancer by Golub et al. (1999), and work done on the functional classification of yeast genes by Brown et al. (2000). A further aim of the project was to critically evaluate their work and their claims of success.

I successfully implemented the self-organising map and support vector machine algorithms plus a simple class prediction algorithm. I found that all gave very accurate results when used on the specified data sets but did not generalise well. It therefore appears that pattern recognition techniques can be used to analyse microarray data, but the particular method used must be carefully chosen. This is a major limitation as an in depth knowledge of the vast range of techniques available is required and a lot of unnecessary and time-consuming preliminary investigations are a distinct possibility. Further work must also be done if these techniques are to be considered accurate enough to become the only test necessary. Currently they must be used in conjunction with other tests.