The University of Sheffield
Department of Computer Science

Simon Payne Undergraduate Dissertation 2000/01

"Classification of Protein Sequences into Homologous Families"

Supervised by M.Niranjan

Abstract

Automatic classification of proteins into homogenous superfamilies, by looking at their amino acid sequence has long been a goal for scientists studying proteins. The current method is very time consuming and complicated. Several statistical models have been created to aid in the classification but so far they have only had partial success. They can give a clue to the protein's superfamily but not a definite answer. Recently Jaakkola and Haussler proposed a method to combine a hidden Markov model with the discriminate technique of a support vector machine. This method builds on the previous work of using a hidden Markov model but uses a different approach to decode the model. It is claimed that this method is able to provide better results than just using a hidden Markov model. This method is known as the SVM-Fisher method. A partial implementation of the system is described in detail including the problems encountered and the differences in the techniques used between this implementation and the Jaakkola implementation. The differences were caused by not being able to implement several mathematical functions as they did not seem to be effective. The results achieved suggested that this method is able to produce better classification results than just using a generative hidden Markov model. A re-run of the experiment presented in the Jaakkola paper was performed and compared but the results achieved were not as good.