Comparison of Boosted Models on Parkinsons Prediction with Web App

22 minute read

In this project, I created a web application that uses three boosted classification models and Synthetic Minority Oversampling Technique (SMOTE) for predicting the maximum severity of the Parkinsons disease in patients.

00. Project Overview
01. Data Overview
02. Data Preparation
03. Feature Engineering
04. Apply Boosted Models
05. Summary of Best Performing Models
06. Model Explainability
07. Growth & Next Steps
Appendix: Footnotes

Project Overview

Context

Parkinson’s disease (PD) is a disabling brain disorder that affects movements, cognition, sleep, and other normal functions. Unfortunately, there is no current cure—and the disease worsens over time. It’s estimated that by 2037, 1.6 million people in the U.S. will have Parkinson’s disease, at an economic cost approaching $80 billion. Research indicates that protein or peptide abnormalities play a key role in the onset and worsening of this disease 1.

Three tree based ensemble models are compared for predicting the categorical UPDRS rating of Parkinsons symptoms. The models compared are XGBoost, LightGBM, and CatBoost. Additionally, the data is filtered to the first 12 months of visits for improved performance and a few engineered features are added. The target shows a significant imbalance in classes and so SMOTE (Synthetic Minority Oversampling Technique) is used to balance the target for training. Model hyperparameter tuning is performed using the hyperopt package which uses Bayesian optimization for exploring the search space of hyperparameters. Lastly, the information of whether the patient was on medication during the clinical visit is compared for model performance. The medication information has many missing values but shows predictive improvement in UPDRS 1 and UPDRS 3. AUC-ROC is used as the main comparison metric between models. The categorical threshold for the probability classification is fine-tuned to optimize in favor of Recall while also looking at the highest F1 score. Recall is favored over Precision to minimize False Negatives, which could cause patients to not seek treatment sooner. While False Positives have a negative impact on a patient, because they will likely have more frequent doctors visits, it is not as negatively impactful as a “likely” Parkinson’s patient misdiagnosed as being “not at risk.”

The default model parameters with no SMOTE and no medication data gave AUC-ROC around 0.59. The best performance for UPDRS 1 is AUC-ROC of 0.796 from a CatBoost Classifier using the Hyperopt hyperparameters and the data with SMOTE applied. The best performance for UPDRS 2 is AUC-ROC of 0.881 from a CatBoost Classifier using the Hyperopt hyperparameters, with the data medication data, and with SMOTE applied. The best performance for UPDRS 3 is AUC-ROC of 0.729 from a LightGBM Classifier using the Hyperopt hyperparameters, with the data medication data, and with SMOTE applied.

Actions

Perform EDA on the data.
Drop the target UPDRS 4 due to high percentage of missing values.
Add feature engineering to create aggregate features.
Evaluate default model prediction performance.
Tune hyperparameters.
Narrow data to only 12 months of visits.
Tune hyperparameters with narrow data.
Apply SMOTE for class imbalance.
Add “On Medication” data.
Tune the classification threshold.
Save best model for each UPDRS.

Results

UPDRS 1 Prediction on Test Data

CatBoost Hyperopt SMOTE model:
AUC: 0.796
Recall: 0.765
Precision: 0.864

UPDRS 2 Prediction on Test Data

CatBoost Hyperopt SMOTE Medication model:
AUC: 0.881
Recall: 0.720
Precision: 0.692

UPDRS 3 Prediction on Test Data

LightGBM Hyperopt SMOTE Medication model:
AUC: 0.729
Recall: 0.781
Precision: 0.586

Growth/Next Steps

Areas for Scientific Research

Based on the Feature Importances from the different models, some areas for scientific research are Immunoglobulin response, Transferrins, RNA cleavage, and immune response. Proteins found that are involved in these biological processes were more prevalent.

Specific proteins of more interest are:

P06396 Gelsolin
P01621 Immunoglobulin kappa variable 3-20
Q9UKV8 Argonaute2
P05060 Secretogranin-1
P02787 Serotransferrin
P02790 Hemopexin

Improve Model Performance

While these models perform quite well, next steps would be to try to use bigger compute power to generate and train models using the combination of protein values and peptide values. Unfortunately, on our local machine this number becomes massive and is not reasonable for training times.

Another idea is to using domain knowledge to specifically find proteins that are known to interact and generate new features from that subset. This would require an expert in Parkinsons biochemistry to know which proteins and peptides are of interest. Because there are 227 unique proteins and 968 unique peptides, the amount of research into each protein and peptide would not be doable by a hobbyist.

Web Application

Parkinson’s Web Application Link

Github Repo

View the Github Repo here: Parkinson’s Github Repo

Data Overview:

All of the data are from a Kaggle competition that began on February 16, 2023 and ended on May 18, 2023. The core of the dataset consists of protein abundance values derived from mass spectrometry readings of cerebrospinal fluid (CSF) samples gathered from several hundred patients 2. As well, there is a column indicating whether the patient was taking any medication during the UPDRS assessment. This can affect motor function scores and is represented in the column “upd23b_clinical_state_on_medication.”

Dimensional Columns:

patient_id: a unique identifier for each patient.
visit_month: relative to the first visit, the number of months later for the current visit.
visit_id: a combination of the patient_id and the visit_month. Join key for protein and patient data.

All protein only mass spectrometry data is signified by the column having only the UniProt label for the protein with the column values being the protein abundance.

All peptide mass spectrometry data is signified by the protein UniProt labeled concatenated to peptide sequence with an underscore as the separator.

Metric Columns:

protein UniProt label: contains the mass spectrometry protein abundance.
UniProt_peptide label: contains the mass spectrometry for the peptide abundance.
Upd23b_clinical_state_on_medication: whether the patient was on medication during the visit.
- Values can be “On”, “Off”, or NaN if unknown.

Data Statistics:

248 Patients
968 Unique Peptides
227 Unique Proteins

Distribution of Values

Data Preparation:

Remove UPDRS 4 from the Experiment

Looking at the missing values for each UPDRS, roughly 45% of the values are missing for UPDRS 4 which concludes me to remove this target from the prediction. It is primarily 0’s and has almost half of the values missing, thus it will not be a good target to learn from and predict.

Convert the Target from a Continuous Value to a Catgorical Value

Rather than use a continous value as the target, binning the values into different categories of severity provides a better signal of eventual outcome. Use the range of values for severity given from Wikipedia 3

UPDRS 1	Min	Max
Mild	0	10
Moderate	11	21
Severe	22	52

UPDRS 2	Min	Max
Mild	0	12
Moderate	13	29
Severe	30	52

UPDRS 3	Min	Max
Mild	0	32
Moderate	33	58
Severe	59	132

Combine the Moderate and Severe Categories

After converting the targets into categories, there is significant class imbalance of the targets. The category “severe” is barely represented and will not perform well in a ML model. In order to adjust for this issue, moderate and severe will be combined into a single category. This will mean the model will now be determining whether a user will have mild vs non-mild symptoms (moderate/severe).

Feature Engineering

Number of Visits:

Since a patient who is experiencing more symptoms may be visiting the doctor more frequently, the number of visits was added as a feature.

Number of proteins with value > 0:

Perhaps the number of proteins expressed has an effect on the parkinson’s symptoms

Number of peptides with value > 0:

Perhaps the number of peptides expressed has an effect on the parkinson’s symptoms

Number of proteins and peptides with value > 0:

Perhaps the number of proteins expressed has an effect on the parkinson’s symptoms

Protein X Protein

Combinations of protein values multiplied by each other was implemented on the data. Because this grew the number of features exponentially, it meant the training time for the model was far too long for any benefit. The correlation values of the protein x protein columns to the UPDRS values were not showing that there was a benefit to using the engineered feature. As well, the data explodes to 51,302 protein x protein combinations.

Apply Boosted Models

With the high number of features (proteins and peptides) and low correlation of the features to the target, a decision tree based model seemed to be a good option. Decision tree based models do not require feature selection preprocessing. Using a Random Forest with boosting typically provides the best performance. The Random Forest is an ensemble model of Decision Trees which typically generalizes well when the trees are low depth. The boosting helps the Random Forest learn from the samples that it got wrong in the previous trees.

LightGBM

A gradient boosting ensemble model that grows leaf-wise, meaning only a single leaf is split when considering the gain. Leafwise growth can lead to overfitting and is best controlled by tuning the max tree depth. LightGBM uses histogram binning of data for determining splits. This can be controlled by parameters max_bin and min_data_in_bin. One key feature of LightGBM is the speed of training a model 4. This performance comes mainly from exclusive feature bundling. The gradient boosting is expressed in the GOSS (Gradient One-Sided Sampling) which gives higher weights to the data points with larger gradients 5.

XGBoost

XGBoost or extreme gradient boosting uses depth wise growth for its trees. It uses CART (classification and regression trees) to score each leaf 6. Sampling is simple bootstrap sampling with no weights used in splitting.

CatBoost

A gradient boosting ensemble model that has specific benefits for categorical features. It does not require the user to preprocess the categorical features. The CatBoost model grows the Decision Trees symmetrically, meaning that at every depth level all of the nodes use the same split condition 7. For splitting, the model uses Minimal Variance Sampling, which is performed at the tree level and optimizes the split scoring accuracy 8.

Default Model Performance

Using 5 fold Cross Validation and taking the mean results of each the 5 holdout sets for default hyperparameters for each model.

UPDRS 1

Model	AUC	Accuracy	Precision	Recall	F1
XGBoost	0.598	0.825	0.705	0.220	0.335
LigthtGBM	0.582	0.822	0.736	0.181	0.290
CatBoost	0.542	0.813	0.793	0.090	0.159

UPDRS 2

Model	AUC	Accuracy	Precision	Recall	F1
XGBoost	0.576	0.869	0.825	0.159	0.266
LigthtGBM	0.580	0.872	0.847	0.165	0.276
CatBoost	0.543	0.863	0.850	0.088	0.160

UPDRS 3

Model	AUC	Accuracy	Precision	Recall	F1
XGBoost	0.590	0.854	0.750	0.193	0.307
LigthtGBM	0.593	0.858	0.839	0.193	0.314
CatBoost	0.557	0.847	0.834	0.119	0.208

Hyperparameter Tuning

Hyperparameter Optimization Using Hyperopt

Using 5 fold Cross Validation and taking the mean results of each the 5 holdout sets for best hyperparameters from the tuning package hyperopt for each model.

UPDRS 1

Model	AUC	Accuracy	Precision	Recall	F1
XGBoost	0.623	0.803	0.516	0.323	0.397
LigthtGBM	0.566	0.797	0.485	0.180	0.263
CatBoost	0.566	0.815	0.685	0.150	0.246

UPDRS 2

Model	AUC	Accuracy	Precision	Recall	F1
XGBoost	0.617	0.864	0.599	0.266	0.368
LigthtGBM	0.569	0.851	0.518	0.169	0.255
CatBoost	0.583	0.861	0.610	0.186	0.285

UPDRS 3

Model	AUC	Accuracy	Precision	Recall	F1
XGBoost	0.619	0.859	0.728	0.257	0.380
LigthtGBM	0.617	0.854	0.666	0.260	0.374
CatBoost	0.586	0.836	0.524	0.208	0.298

Hyperopt Conclusion:

The hyperparameter optimization showed an improvement mainly for XGBoost across each UPDRS.

12 Months of Protein and Peptide Data

The distribution of visit months is most high in the 0 to 24 range. Perhaps the later visit months are actually only adding noise because the protein changes have already happened and may have a delayed effect on the symptoms.

UPDRS Categorical Max Values in the First 12 Months

UPDRS 1: 58.9% of the patients with max diagnosis of Parkinson’s get it in 12 months
UPDRS 2: 50% of the patients with max diagnosis of Parkinson’s get it in 12 months
UPDRS 3: 47% of the patients with max diagnosis of Parkinson’s get it in 12 months

Approximately 50% of the patients who get Parkinson’s diagnosis get it within the first 12 months of visits. This shows that the first 12 months may give more valuable data in terms of the categorical target.

Prediction of Max Categorical Value using 12 Month Data

Validation curves for the hyperparameters Max Depth and Subsample were tried for finding the optimal values of the hyperparameters. These two hyperparameters tend to have a large impact on the performance, so testing them manually gives more intuition on their effect. The validation curve was measured using the metric AUC-ROC.

UPDRS 1 – Max Depth Validation Curve

The UPDRS 1 AUC score for the validation curve for Max Depth showed a high value at max_depth = 6 and 8, Since 6 is lower and should be better for generalization, the Max Depth used was 6.

UPDRS 2 – Max Depth Validation Curve

The UPDRS 2 AUC score for the validation curve for Max Depth showed a high value at max_depth = 5

UPDRS 3 – Max Depth Validation Curve

The UPDRS 3 AUC score for Max Depth the value of max_depth = 4 will be used even though the best AUC was max_depth = 8. But since the AUC values are close and max depth 4 will be better at generalizing, it will be used.

UPDRS 1 - Subsample Validation Curve

UPDRS 2 – Subsample Validation Curve

UPDRS 3 – Subsample Validation Curve

Fine Tuned Hyperparameter Comparison for XGBoost

UPDRS 1

Hyperparam Opt	AUC	Accuracy	Precision	Recall	F1
Default	0.696	0.727	0.707	0.532	0.607
Max Depth	0.696	0.727	0.707	0.530	0.605
Scale Pos Wt	0.693	0.725	0.706	0.529	0.604
Subsample	0.670	0.702	0.671	0.503	0.575

UPDRS 2

Hyperparam Opt	AUC	Accuracy	Precision	Recall	F1
Default	0.671	0.764	0.694	0.424	0.526
Max Depth	0.663	0.765	0.718	0.396	0.510
Scale Pos Wt	0.716	0.791	0.730	0.518	0.606
Subsample	0.716	0.791	0.730	0.518	0.606

UPDRS 3

Hyperparam Opt	AUC	Accuracy	Precision	Recall	F1
Default	0.670	0.710	0.698	0.485	0.572
Max Depth	0.656	0.859	0.728	0.257	0.380
Scale Pos Wt	0.669	0.702	0.657	0.515	0.577
Subsample	0.677	0.705	0.649	0.544	0.592

By using only the 12 months of data to forecast for the max UPDRS value, the default hyperparameters for XGBoost show a significant improvement compared to using all of the visit months data.

The default hyperparameters perform quite well for XGBoost. When trying to optimize some of the hyperparameters manually, there is not much improvement except in the UPDRS 2. This mainly came from using the scale_pos_weight to balance out the target values.

Hyperopt Hyperparameter Tuning

Hyperopt is a python package for hyperparameter tuning that uses the algorithm Tree-based Parzen Estimators (TPE)9 to explore a search space of hyperparameter values and minimize a function, such as AUC score 10. This hyperparameter tuning package allows for a faster way to cover combinations of hyperparameters that should improve performance.

XGBoost Hyperparameters to Tune11

max_depth: the maximum depth of a tree. Deep trees can lead to overfitting while shallow trees may not predict as well. Tuning to find the minimum depth, meaning low variance, but with good performance on the evaluation metric.
min_child_weight: the minimum sum of instance weight in a child. If the weight is larger, then the model will generalize better.
subsample: ratio of data sampled for each decision tree. Where 0.5 means, half the training data will be sampled prior to creating the decision tree.
colsample_bytree: the ratio of columns sampled for each decision tree. The default is 1, which means that all columns are used. By adjusting this ratio, it can create more weak learner trees by randomly excluding some of the columns on each tree.
reg_alpha: L1 (Lasso) regularization on the terms. A higher alpha means a more generalized model.
reg_lambda: L2 (Ridge) regularization on the terms. A higher lambda means a more generalized model.
gamma: the minimum loss reduction needed to make a partition on a leaf node. The larger gamma is the more shallow a tree will be and the more generalized the model.
learning_rate: step wise shrinkage of the feature weights at each of the boosting steps. This prevents overfitting by decreasing the variance of the model. The default value is 0.3
max_delta_step: the maximum delta step each leaf is allowed. This can help when there is class imbalance in the data.

LightGBM Hyperparameters to Tune12

max_depth: default is -1, which means no limit.
min_data_in_leaf: default is 20. Minimum number of data in a leaf. This can reduce overfitting on the training data.
bagging_freq: default is 0, which means no bagging. Values greater than 0 are the number of iterations before a new sample with the bagging fraction of the total data is taken.
bagging_fraction: default is 1.0. This is similar to feature fraction but will randomly select part of data without resampling. This can be used to speed up training and reduce overfitting.
tree_learner: default is serial, which is a single machine tree learner. The options are serial, feature, data, voting.
is_unbalance: set to true if the training data is unbalanced. Default is false.
boosting: gbdt, rf, and dart. Default is gbdt.
lambda_l1: L1 (Lasso) regularization.
lamdba_l2: L2 (Ridge) regularization.
feature_fraction: the ratio of features to use on each tree. Default is 1.0.
learning_rate: default is 0.1. This is the shrinkage rate.
max_delta_step: default is 0.0. The final max output of leaves is learning_rate * max_delta_step. Used to limit the output of leaves.

CatBoost Hyperparameters to Tune13

depth: the max depth of a tree.The default is 6.
learning_rate: shrinkage used in updates. The default value is selected automatically for binary classification with other parameters set to default. In all other cases default is 0.03
bagging_temperature: The higher the value the more aggressive the bagging is. Default is None.
min_data_in_leaf: the minimum data required in a leaf node for a split. Default is None.
l2_leaf_reg: L2 (Ridge) regularization. Default is None

Hyperopt Hyperparameter Tuning Results

UPDRS 1

Model	AUC	Accuracy	Precision	Recall	F1
XGBoost	0.701	0.728	0.691	0.563	0.620
LigthtGBM	0.673	0.696	0.628	0.561	0.593
CatBoost	0.566	0.815	0.685	0.150	0.246

UPDRS 2

Model	AUC	Accuracy	Precision	Recall	F1
XGBoost	0.723	0.752	0.712	0.549	0.620
LigthtGBM	0.666	0.700	0.528	0.577	0.551
CatBoost	0.583	0.861	0.610	0.186	0.295

UPDRS 3

Model	AUC	Accuracy	Precision	Recall	F1
XGBoost	0.702	0.746	0.702	0.560	0.623
LigthtGBM	0.697	0.724	0.674	0.576	0.621
CatBoost	0.586	0.836	0.524	0.208	0.298

SMOTE

By using Synthetic Minority Oversampling on the UPDRS values for the training data, the predictive ability was improved significantly. As well, by fine-tuning the threshold cutoff for the classification, the recall and precision can be optimized.

UPDRS 1 Training data:

Class 0: 225 samples
Class 1: 225 samples

UPDRS 2 Training data:

Class 0: 255 samples
Class 1: 255 samples

UPDRS 3 Training data:

Class 0: 227 samples
Class 1: 227 samples

Overall SMOTE Results (Default Threshold):

UPDRS 1

Model	AUC	Accuracy	Precision	Recall	F1
XGBoost	0.679	0.675	0.650	0.764	0.702
LigthtGBM	0.699	0.696	0.714	0.646	0.678
CatBoost	0.722	0.717	0.699	0.771	0.733

UPDRS 2

Model	AUC	Accuracy	Precision	Recall	F1
XGBoost	0.823	0.821	0.805	0.847	0.825
LigthtGBM	0.767	0.765	0.783	0.734	0.758
CatBoost	0.825	0.821	0.837	0.801	0.819

UPDRS 3

Model	AUC	Accuracy	Precision	Recall	F1
XGBoost	0.708	0.706	0.662	0.838	0.740
LigthtGBM	0.725	0.727	0.770	0.648	0.704
CatBoost	0.724	0.722	0.750	0.673	0.709

Fine-Tuning the Threshold

By adjusting the classification threshold cutoff for each of the models, the recall and precision can be optimized for the performance needs.

Threshold for SMOTE Trained Models (No Medication Data)

XGBoost UPDRS 1

Threshold	AUC	Accuracy	Precision	Recall
0.4	0.750	0.655	0.537	0.853

XGBoost UPDRS 2

Threshold	AUC	Accuracy	Precision	Recall
0.17	0.860	0.636	0.436	0.960

XGBoost UPDRS 3

Threshold	AUC	Accuracy	Precision	Recall
0.15	0.643	0.388	0.378	0.969

LightGBM UPDRS 1

Threshold	AUC	Accuracy	Precision	Recall
0.4	0.759	0.690	0.571	0.824

LightGBM UPDRS 2

Threshold	AUC	Accuracy	Precision	Recall
0.38	0.785	0.716	0.500	0.720

LightGBM UPDRS 3

Threshold	AUC	Accuracy	Precision	Recall
0.30	0.703	0.694	0.568	0.781

CatBoost UPDRS 1

Threshold	AUC	Accuracy	Precision	Recall
0.48	0.796	0.770	0.684	0.765

CatBoost UPDRS 2

Threshold	AUC	Accuracy	Precision	Recall
0.21	0.865	0.795	0.621	0.720

CatBoost UPDRS 3

Threshold	AUC	Accuracy	Precision	Recall
0.25	0.718	0.659	0.533	0.750

Use Patient Medical Data along with Hyperopt Params and SMOTE

There is data on whether the patient was on medication during the clinical visit. This can affect the value of the UPDRS score.

Overall Results with medication data included (Default Threshold):

UPDRS 1

Model	AUC	Accuracy	Precision	Recall	F1
XGBoost	0.727	0.721	0.692	0.804	0.744
LigthtGBM	0.737	0.734	0.761	0.680	0.718
CatBoost	0.767	0.764	0.771	0.756	0.763

UPDRS 2

Model	AUC	Accuracy	Precision	Recall	F1
XGBoost	0.823	0.819	0.805	0.845	0.825
LigthtGBM	0.808	0.807	0.828	0.768	0.797
CatBoost	0.848	0.845	0.854	0.832	0.843

UPDRS 3

Model	AUC	Accuracy	Precision	Recall	F1
XGBoost	0.712	0.711	0.667	0.834	0.741
LigthtGBM	0.725	0.726	0.776	0.648	0.706
CatBoost	0.717	0.715	0.739	0.678	0.707

Threshold for SMOTE Trained Models (No Medication Data)

XGBoost UPDRS 1

Threshold	AUC	Accuracy	Precision	Recall
0.5	0.750	0.724	0.639	0.676

XGBoost UPDRS 2

Threshold	AUC	Accuracy	Precision	Recall
0.33	0.853	0.716	0.5	0.8

XGBoost UPDRS 3

Threshold	AUC	Accuracy	Precision	Recall
0.52	0.676	0.624	0.5	0.72

LightGBM UPDRS 1

Threshold	AUC	Accuracy	Precision	Recall
0.35	0.768	0.644	0.528	0.824

LightGBM UPDRS 2

Threshold	AUC	Accuracy	Precision	Recall
0.42	0.856	0.830	0.667	0.8

LightGBM UPDRS 3

Threshold	AUC	Accuracy	Precision	Recall
0.28	0.729	0.682	0.556	0.781

CatBoost UPDRS 1

Threshold	AUC	Accuracy	Precision	Recall
0.36	0.769	0.655	0.540	0.794

CatBoost UPDRS 2

Threshold	AUC	Accuracy	Precision	Recall
0.22	0.881	0.830	0.692	0.720

CatBoost UPDRS 3

Threshold	AUC	Accuracy	Precision	Recall
0.13	0.587	0.494	0.410	0.781

Summary of Best Performing Models

UPDRS 1 Prediction on Test Data

CatBoost Hyperopt SMOTE model:
AUC: 0.796
Recall: 0.765
Precision: 0.864

UPDRS 2 Prediction on Test Data

CatBoost Hyperopt SMOTE Medication model:
AUC: 0.881
Recall: 0.720
Precision: 0.692

UPDRS 3 Prediction on Test Data

LightGBM Hyperopt SMOTE Medication model:
AUC: 0.729
Recall: 0.781
Precision: 0.586

Model Explainability

By looking at the Feature Importances for the top ten features, there are some interesting areas for further research.

UPDRS 1 Top Ten Features

P01621 - Immunoglobulin kappa variable 3-20
- the variable domain of immunoglobulin light chains that participates in the antigen recognition.
P06396 - Gelsolin
- Calcium regulated, actin modulating protein that binds to the plus ends of actin monomers or filaments.
P01621 - Immunoglobulin kappa variable 3-20
- the variable domain of immunoglobulin light chains that participates in the antigen recognition.
Q9UKV8 - Argonaute2
- Required for RNA mediated gene splicing.
P06707 - Apolipoprotein A-IV
- Required for efficient activation of lipoprotein lipase by Apoc-II and potent activator of LCAT.
Q9UKV8 - Argonaute2
- Required for RNA mediated gene splicing.
P02765 - Fetuin-A
- Counteracts the host antiviral immune response when activated and phosphorylated.
P07998 - Ribonuclease Pancreatic
- Endonuclease that catalyzes the cleavage of RNA.
P05060 - Secretogranin-1
- Neuroendocrine secretory granule protein.
P08603 - Complement Factor H
- Glycoprotein that plays an essential role in maintaining a well-balanced immune response.

UPDRS 1 Summary

Three proteins/peptides are involved in RNA splicing, two are involved in immunoglobulin structure, two are involved in immune response, one in actin formation, one in lipase and one in neuroendocrine activity.

UPDRS 2 Top Ten Features

Medication Unknown
- When the patient’s medication status is not known, it helps predict severity of UPDRS 2.
P06396 - Gelsolin
- Calcium regulated, actin modulating protein that binds to the plus ends of actin monomers or filaments.
P06727 - Apolipoprotein A-IV
- Required for efficient activation of lipoprotine lipase by ApoC-II and potent activator of LCAT.
P01042 - Kinnogen-1
- Inhibitor of thiol proteases.
Q13740 - CD166 antigen
- Promotes neurite extension, axon growth and axon guidance.
P02787 - Serotransferrin
- Iron binding transport protein.
P02787 - Serotransferrin
- Iron binding transport protein.
P43652 - Afamin
- May be involved in the transport of Vitamin E across the blood brain barrier.
P02790 - Hemopexin
- Binds heme and transports it to the liver for breakdown and iron recovery.
P02787 - Serotransferrin
- Iron binding transport protein.

UPDRS 2 Summary

Four protien/peptides are involed in iron binding, one in actin modulation, one lipase, one protease inhibitor, one in neurite and axon growth, and one in Vitamin E transport.

UPDRS 3 Top Ten Features

P00734 - Prothrombin
- Converts fibrinogen to fibrin by cleaving bonds after Arg and Lys. Functions in blood homeostasis, inflammation and wound healing.
Q13283 - Ras GTPase-activating protein-binding protein
- Involved in various processes, such as stress granule formation and innate immunity.
P01859 - Immunoglobulin heavy constant gamma 2
- Constant region of immunoglobulin heavy chains.
P25311 - Zinc-alpha-2-glycoprotein
- Stimulates lipid degradation in adipocytes.
P08133 - Annexin A6
- May associate with C21 and may regulate release of Calcium from intracellular stores.
P04433 - Immunoglobulin kappa variable 3-11
- V region of the variable domain of immunoglobulin light chains.
P05060 - Secretogranin-1
- Neuroendocrine secretory granule protein.
P51884 - Lumican
- Structural molecule activity.
P01876 - Immunoglobulin heavy constant alpha 1
- Constant region of immunoglobulin heavy chains.
P02790 - Hemopexin
- Binds heme and transports it to the liver for breakdown and iron recovery.

UPDRS 3 Summary

Three protien/peptides are involed in Immunoglobulins, one in immune response, one in iron, one lipase, one in cleavage, one in neuroendocrine activity, one in Calcium transport, and one in structural activity.

Final Summary

This a common trend of proteins groups, where Immunoglobulins, Iron transport or recovery, RNA splicing, and immune response are highly represented. These types of activities can be explored to see how it relates to Parkinson’s disease.

Furthermore, researching more into the proteins that show up in the top ten more than once, which are:

P06396 Gelsolin
P01621 Immunoglobulin kappa variable 3-20
Q9UKV8 Argonaute2
P05060 Secretogranin-1
P02787 Serotransferrin
P02790 Hemopexin

Growth/Next Steps

Areas for Scientific Research

Specific proteins of more interest are:

P06396 Gelsolin
P01621 Immunoglobulin kappa variable 3-20
Q9UKV8 Argonaute2
P05060 Secretogranin-1
P02787 Serotransferrin
P02790 Hemopexin

Comparison of Boosted Models on Parkinsons Prediction with Web App

Table of contents

Project Overview

Context

Actions

Results

UPDRS 1 Prediction on Test Data

UPDRS 2 Prediction on Test Data

UPDRS 3 Prediction on Test Data

Growth/Next Steps

Areas for Scientific Research

Improve Model Performance

Web Application

Parkinson’s Web Application Link

Github Repo

View the Github Repo here: Parkinson’s Github Repo

Data Overview:

Dimensional Columns:

Metric Columns:

Data Statistics:

Distribution of Values

Data Preparation:

Remove UPDRS 4 from the Experiment

Convert the Target from a Continuous Value to a Catgorical Value

Combine the Moderate and Severe Categories

Feature Engineering

Number of Visits:

Number of proteins with value > 0:

Number of peptides with value > 0:

Number of proteins and peptides with value > 0:

Protein X Protein

Apply Boosted Models

LightGBM

XGBoost

CatBoost

Default Model Performance

UPDRS 1

UPDRS 2

UPDRS 3

Hyperparameter Tuning

Hyperparameter Optimization Using Hyperopt

UPDRS 1

UPDRS 2

UPDRS 3

Hyperopt Conclusion:

12 Months of Protein and Peptide Data

UPDRS Categorical Max Values in the First 12 Months

Prediction of Max Categorical Value using 12 Month Data

UPDRS 1 – Max Depth Validation Curve

UPDRS 2 – Max Depth Validation Curve

UPDRS 3 – Max Depth Validation Curve

UPDRS 1 - Subsample Validation Curve

UPDRS 2 – Subsample Validation Curve

UPDRS 3 – Subsample Validation Curve

Fine Tuned Hyperparameter Comparison for XGBoost

UPDRS 1

UPDRS 2

UPDRS 3

Hyperopt Hyperparameter Tuning

XGBoost Hyperparameters to Tune11

LightGBM Hyperparameters to Tune12

CatBoost Hyperparameters to Tune13

Hyperopt Hyperparameter Tuning Results

UPDRS 1

UPDRS 2

UPDRS 3

SMOTE

Overall SMOTE Results (Default Threshold):

UPDRS 1

UPDRS 2

UPDRS 3

Fine-Tuning the Threshold

Threshold for SMOTE Trained Models (No Medication Data)

XGBoost UPDRS 1

XGBoost UPDRS 2

XGBoost UPDRS 3

LightGBM UPDRS 1

LightGBM UPDRS 2

LightGBM UPDRS 3

CatBoost UPDRS 1