The model was applied to predict the class of compounds in an external database, consisting of 1738 small-molecules. The top 30 features included descriptors related to atom and bond counts, topological and partial charge properties. The features of the model were ranked using the Gini importance measure of the random forest algorithm. The best performing classifier, built using molecular descriptors, achieved an area under the curve score (AUC) of 0.815 for classifying the compounds in the test set. Five predictive models were built using the random forest algorithm with molecular fingerprints and/or molecular descriptors as features. The aim of this study was to build a machine learning model based on the data of the DrugAge database to predict whether a chemical compound will extend the lifespan of Caenorhabditis elegans. Pharmaceutical interventions that slow down ageing and delay the onset of age-related diseases are a growing research area. Msg = "%s: %f (%f)" % (name, cv_an(), cv_results.Ageing is a major risk factor for many conditions including cancer, cardiovascular and neurodegenerative diseases. Kfold = model_selection.KFold(n_splits=10, random_state=seed)Ĭv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring) Models.append(('CART', DecisionTreeClassifier())) Models.append(('KNN', KNeighborsClassifier())) Models.append(('LDA', LinearDiscriminantAnalysis())) Models.append(('LR', LogisticRegression())) # prepare configuration for cross validation test harness Names = ĭataframe = pandas.read_csv(url, names=names) import pandasįrom sklearn.linear_model import LogisticRegressionįrom ee import DecisionTreeClassifierįrom sklearn.neighbors import KNeighborsClassifierįrom sklearn.discriminant_analysis import LinearDiscriminantAnalysisįrom sklearn.naive_bayes import GaussianNB This gives a nice comparison of a few different algos. If you are not getting good results from the RF algo, test some others. I believe, in around at least 80% of the time, I'm done. Sometimes other algos beat the RF algo, but I have found that often the RF is quite good! Usually, I start with RF, and if I see decent performance, I am done. you definitely want the predictive capabilities of the algo to be pretty high (over 90%). Ultimately, what algo you choose to work with is up to you. Please consider the following: 1) Random forest algorithm can be used for both classifications and regression task.Ģ) It typically provides very high accuracy.ģ) Random forest classifier will handle the missing values and maintain the accuracy of a large proportion of data.Ĥ) If there are more trees, it usually won’t allow overfitting trees in the model.ĥ) It has the power to handle a large data set with high dimensionality Outside of the bounds of your original training data, a Random Forest Also, if you want your model to extrapolate to predictions for data that is.With the target and the degree to which it affects the outcome) then others like Logistic Regression and Lasso are better choices. If this is important for your use case (for example, you want to know if a feature has a positive or negative relationship I think the biggest deciding factor of whether to use a RF or another algorithm is probably if you want to understand more about the relationship thatįeatures have with the target and the degree of influence they have.Relative importance of the features, especially if you have ones that are highly correlated. However, as with my point above, you cannot read too much into the Times a feature was used to split on the data, or the mean decrease It's possible to extract the 'best'įeatures (which could be the total number of The algorithm with features that are not useful, the algorithm simply Random Forests can be used for feature selection because if you fit.Of the features will also use up the predictive power of the other Highly similar features, the information gain from splitting on one They also do not make any assumptions about the underlying distribution of your data,Īnd can implicitly handle collinearity in features, because if you have two. As a decision tree algorithm, Random Forests are less influenced by outliers than.Adding some extra general points to the previous answer:
0 Comments
Leave a Reply. |