algorithms for mass univariate regression
Mass univariate regression is the process of independently regressing multiple response variables against a single set of explantory features. It is common in any domain in which a lage number of response variables are measured, and fitting large collections of such models can benefit significantly from parallelization.
This package provides a simple API for fitting these kinds of models. It provides a collection of algorithms for performing different types of mass regression, all following the scikit-learn style. It also supports providing custom algorithms directly from scikit-learn. The algorithms are fit to data, returning a fitted model that contains regression coefficients and allows for prediction and scoring on new data. Compatible with Python 2.7+ and 3.4+. Works well alongside thunder and supprts parallelization via spark, but can also be used as a standalone module on local numpy arrays.
pip install thunder-regressionIn this example we'll create data and fit a collection of models
# generate data
from sklearn.datasets import make_regression
X, Y = make_regression(n_samples=100, n_features=3, n_informative=3, n_targets=10, noise=1.0)
# create and fit the model
from regression import LinearRegression
algorithm = LinearRegression(fit_intercept=False)
model = algorithm.fit(X, Y.T)After fitting, model.betas is an array with the 3 coefficients for each of 10 response variables.
Import and construct an algorithm
from regression import LinearRegression
algorithm = LinearRegression(fit_intercept=False)Fit the algorithm to data in the form of a samples x features design matrix X and a targets x samples response matrix Y.
model = algorithm.fit(X, Y)The results of the fit are accessible on the fitted model, and the model can be used to score new data
betas = model.betas
rsq = model.score(X, Y)For all methods, X should be a local numpy array, and Y can be either a local numpy array, a bolt array, or a thunder Series object.
All algorithms have the following methods:
Fit the algorithm to data
Xdesign matrix, dimensionssamples x featuresYcollection of responses, dimensionstargets x samples- returns a fitted
MassRegressionModel
The result of fitting an algorithm is a model with the following properties and methods:
Array of regression coefficients, dimensions targets x features. If an intercept was fit, it will be the
the first feature.
Array of regression coefficients, followed by prediction scores on the fitted data, dimensions targets x (feature + 1). If an intercept was fit, it will be the the first feature.
Array of individual fitted models, dimensions 1 x targets.
Array of coefficients, not including a possible intercept term, for consistency with scikit-learn.
Array of intercepts, for consistency with scikit-learn. If no intercepts were fit, all will have values 0.0.
Predicts the response to new inputs.
Xdesign matrix, dimensionsnew samples x features- returns an array of responses, dimensions
targets x new samples
Computes the goodness of fit (r-squared, unless otherwise stated) of the model for given data
Xdesign matrix, dimensionssamples x featuresYcollection of responses, dimensionstargets x samples- returns an array of scores
Simultaneously computes the results of predict(X) and score(X, Y)
Xdesign matrix, dimensionssamples x featuresYcollection of responses, dimensionstargets x samples- returns an array of predictions and an array of scores
Here are all the algorithms currently available.
Linear regression through ordinary least squares as implemented in scikit-learn's LinearRegression algorithm.
fit_interceptwhether or not to fit intercept termsnormalizewhether or not to normalize the data before fitting the models
Use a custom regression algorithm in a mass regression analysis. The provided algorithm should operate on single response variables, and must conform to the scikit-learn API as follows
- Must implement a
.fit(X, Y)method that takes a design matrix (samples x features) and a response vector and returns an object representing the fitted model. - The returned fitted model must must have attributes
.coef_and.intercept_that hold the results of the the fit (.coef_having dimensions1 x featuresand.intercept_being a scalar). - The returned fitted model must also have methods
.predict(X)and.score(X, y)(Xhaving dimensionsnew samples x featuresandyhaving dimensions1 x new samples). The former should return a vector of predictions (dimensions1 x new samples) and the former should return a scalar score (likely r-squared).
This allows you to define an algorithm in scikit-learn and then wrap it for mass fitting, for example
from regression import CustomRegression
from sklearn.linear_model import LassoCV
algorithm = CustomRegression(LassoCV(normalize=True, fit_intercept=False))
model = algorithm.fit(X, Y)Run tests with
py.testTests run locally with numpy by default, but the same tests can be run against a local spark installation using
py.test --engine=spark