Protein family classification with NLP

This scientific project develops an interpretable model for classifying protein sequences into the most common protein families found in the UniProt Knowledgebase. The study employs common NLP techniques and compares various machine learning models, such as k-nearest neighbors, decision trees, and random forests.

The comprehensive analysis and detailed implementation can be accessed through the following link: Protein Family Classification