Speaker
Description
In this study, PERMY data set taken from Pharmaceutical Statistics Using SAS: A Practical Guide is analyzed. It describes permeability of cell membranes, which is the ability of a molecule to cross a membrane. Biological structures are a complex layer of molecules and proteins. Substances require a particular structure to pass through the target membrane and drugs that fail to demonstrate sufficient permeability should be excluded from further testing. For that specific reason, permeability is important in the early stages of drug development.
The aim of this study is to compare several classification methods for binary classification, using 71 molecular properties whose meanings are not explicitly known. Due to possible collinearity and near singular data matrices, the models were complemented with multicollinearity and principal component analysis for dimension reduction. The following methods were compared: logistic regression (including stepwise, decision tree and cluster based variable selection), decision trees, neural networks, random forests, gradient boosting trees and bagging trees. The data was split into training and validating subsets. To rank model performances, average square errors were computed, while confusion matrix and misclassification rate were used to assess classification accuracy for each algorithm. Statistics mentioned above were compared using validation data. Additionally, to assess model fit and complexity of the candidate models, metrics such as Gini index and area under the ROC curve were evaluated.
Special attention regarding possible interpretation was given to black box algorithms, primarily because of their robustness to near singular data matrices. These models were further explained using surrogate decision trees, which provide insight into variable importance and internal model structure.
64288210087