Galaxy Community Hub

The Galaxy Machine Learning workbench is a comprehensive set of data preprocessing, machine learning, deep learning and visualisation tools, consolidated workflows for end-to-end machine learning analysis and training materials to showcase the usage of these tools. The workbench is available on the Galaxy framework, which guarantees simple access, easy extension, flexible adaption to personal and security needs, and sophisticated machine learning analyses independent of command-line knowledge.

The workbench provides you with a Swiss Army knife of scikit-learn, Keras (a deep learning library based on TensorFlow) and various other tools to transform, learn and predict and plot your data.

The workbench is currently developed by the Goecks Lab and the European Galaxy project. The German Network for Bioinformatics Infrastructure (de.NBI), which runs the German ELIXIR Node, provides the necessary compute clusters with CPUs and GPU resources.

The project is a community effort, please jump in, ask questions, and contribute to the development of new tools, workflows or trainings!

Training

We are passionate about training. So we are working in close collaboration with the Galaxy Training Network (GTN) to develop training materials of data analyses based on Galaxy. These materials hosted on the GTN GitHub repository are available online at https://training.galaxyproject.org.

Want to learn more about machine learning? Take one of our guided tours or check out the following hands-on tutorials, developed together with the GTN community.

LessonSlidesHands-onInput datasetWorkflowsGalaxy tourGalaxy History
Basics of machine learning
Classification
Regression
Age prediction using machine learning
Clustering
Introduction to deep learning

Available tools

In this section we list the most important tools that have been integrated into the Machine Learning workbench. There are many more tools available so please have a more detailed look at the tool panel at https://ml.usegalaxy.eu. All tools follow the IUC best practise guidelines for Galaxy tool development and are available under https://github.com/bgruening/galaxytools and https://github.com/goeckslab/Galaxy-ML. For better readability, we have listed the most powerful tools below and divided them into categories.

Classification

Identifying which category an object belongs to.

ToolDescriptionReference
SVM ClassifierSupport vector machines (SVMs) for classificationPedregosa et al. 2011
NN ClassifierNearest Neighbors ClassificationPedregosa et al. 2011
Ensemble classificationEnsemble methods for classification and regressionPedregosa et al. 2011
Discriminant ClassifierLinear and Quadratic Discriminant AnalysisPedregosa et al. 2011
Generalized linearGeneralized linear models for classification and regressionPedregosa et al. 2011
CLF MetricsCalculate metrics for classification performancePedregosa et al. 2011

Regression

Predicting a continuous-valued attribute associated with an object.

ToolDescriptionReference
Ensemble regressionEnsemble methods for classification and regressionPedregosa et al. 2011
Generalized linearGeneralized linear models for classification and regressionPedregosa et al. 2011
Regression metricsCalculate metrics for regression performancePedregosa et al. 2011

Clustering

Automatic grouping of similar objects into sets.

ToolDescriptionReference
Numeric clusteringDifferent numerical clustering algorithmsPedregosa et al. 2011

Model building

Building general machine learning models.

ToolDescriptionReference
Estimator AttributesEstimator attributes to get all attributes from an estimator or scikit objectPedregosa et al. 2011
Stacking Ensemble ModelsStacking Ensembles to build stacking, voting ensemble models with numerous base optionsPedregosa et al. 2011
Search CVHyperparameter Search performs hyperparameter optimization using various SearchCVsPedregosa et al. 2011
Build PipelinePipeline Builder as an all-in-one platform to build pipeline, single estimator, preprocessor and custom wrappersPedregosa et al. 2011

Model evaluation

Evaluation, validating and choosing parameters and models.

ToolDescriptionReference
Model validationModel Validation includes cross_validate, cross_val_predict, learning_curve, and morePedregosa et al. 2011
Pairwise MetricsEvaluate pairwise distances or compute affinity or kernel for sets of samplesPedregosa et al. 2011
Train/Test evaluationTrain, Test and Evaluation to fit a model using part of dataset and evaluate using the restPedregosa et al. 2011
Model PredictionModel Prediction predicts on new data using a preffited modelChollet et al. 2011
Fitted model evaluationEvaluate a Fitted Model using a new batch of labeled dataPedregosa et al. 2011
Model fittingFit a Pipeline, Ensemble or other models using a labeled datasetPedregosa et al. 2011

Preprocessing and feature selection

Feature selection and preprocessing.

ToolDescriptionReference
Data preprocessingPreprocess raw feature vectors into standardized datasetsPedregosa et al. 2011
Feature selectionFeature Selection module, including univariate filter selection methods and recursive feature elimination algorithmPedregosa et al. 2011

Deep learning

Build and use deep neural networks.

ToolDescriptionReference
Batch ModelsBuild Deep learning Batch Training Models with online data generator for Genomic/Protein sequences and imagesChollet et al. 2011
Model BuilderCreate deep learning model with an optimizer, loss function and fit parametersChollet et al. 2011
Model ConfigCreate a deep learning model architecture using KerasChollet et al. 2011
Train and evaluationDeep learning training and evaluation either implicitly or explicitlyChollet et al. 2011

Visualization

Plotting and visualization.

ToolDescriptionReference
Regression performance plotsPlot actual vs predicted curves and residual plots of tabular data
ML performance plotsPlot confusion matrix, precision, recall and ROC and AUC curves of tabular data
VisualizationMachine Learning Visualization Extension includes several types of plotting for machine learningChollet et al. 2011

Utilities

General data and table manipulation tools.

ToolDescriptionReference
Table computeThe power of the pandas data library for manipulating and computing expressions upon tabular data and matrices.
Datamash operationsDatamash operations on tabular data
Datamash transposeTranspose rows/columns in a tabular file
Sample GeneratorGenerate random samples with controlled size and complexityPedregosa et al. 2011
Train/Test splittingSplit Dataset into training and test subsetsPedregosa et al. 2011

Interactive Environments

You have done the heavy lifting and now want to use your coding skills inside Jupyter or RStudio? Work on data with the following:

ToolDescriptionReference
JupyterJupyter lab
RStudioRStudio

Contributors