Множественная линейная регрессия в Python

Я не могу найти какие-либо библиотеки Python, которые выполняют множественную регрессию. Единственное, что я нахожу, это только простая регрессия. Мне нужно регрессировать мою зависимую переменную (y) от нескольких независимых переменных (x1, x2, x3 и т. Д.).

Например, с этими данными:

print 'y        x1      x2       x3       x4      x5     x6       x7'
for t in texts:
    print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
   .format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)

(вывод для выше :)

      y        x1       x2       x3        x4     x5     x6       x7
   -6.0     -4.95    -5.87    -0.76     14.73   4.02   0.20     0.45
   -5.0     -4.55    -4.52    -0.71     13.74   4.47   0.16     0.50
  -10.0    -10.96   -11.64    -0.98     15.49   4.18   0.19     0.53
   -5.0     -1.08    -3.36     0.75     24.72   4.96   0.16     0.60
   -8.0     -6.52    -7.45    -0.86     16.59   4.29   0.10     0.48
   -3.0     -0.81    -2.36    -0.50     22.44   4.81   0.15     0.53
   -6.0     -7.01    -7.33    -0.33     13.93   4.32   0.21     0.50
   -8.0     -4.46    -7.65    -0.94     11.40   4.43   0.16     0.49
   -8.0    -11.54   -10.03    -1.03     18.18   4.28   0.21     0.55

Как бы я регрессировал это в python, чтобы получить формулу линейной регрессии:

Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c

Ответы на вопрос(11)

scipy.optimize.curve_fit, И не только для линейной подгонки.

from scipy.optimize import curve_fit
import scipy

def fn(x, a, b, c):
    return a + b*x[0] + c*x[1]

# y(x0,x1) data:
#    x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4

x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt

отеки sklearn, как указано выше. Я использую установку Anaconda Python 3.6.

Создайте свою модель следующим образом:

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)

# display coefficients
print(regressor.coef_)

numpy.linalg.lstsq:

import numpy as np
y = np.array([-6,-5,-10,-5,-8,-3,-6,-8,-8])
X = np.array([[-4.95,-4.55,-10.96,-1.08,-6.52,-0.81,-7.01,-4.46,-11.54],[-5.87,-4.52,-11.64,-3.36,-7.45,-2.36,-7.33,-7.65,-10.03],[-0.76,-0.71,-0.98,0.75,-0.86,-0.50,-0.33,-0.94,-1.03],[14.73,13.74,15.49,24.72,16.59,22.44,13.93,11.40,18.18],[4.02,4.47,4.18,4.96,4.29,4.81,4.32,4.43,4.28],[0.20,0.16,0.19,0.16,0.10,0.15,0.21,0.16,0.21],[0.45,0.50,0.53,0.60,0.48,0.53,0.50,0.49,0.55]])
X = X.T # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])] # add bias term
beta_hat = np.linalg.lstsq(X,y)[0]
print beta_hat

Результат:

[ -0.49104607   0.83271938   0.0860167    0.1326091    6.85681762  22.98163883 -41.08437805 -19.08085066]

Вы можете увидеть примерный результат с:

print np.dot(X,beta_hat)

Результат:

[ -5.97751163,  -5.06465759, -10.16873217,  -4.96959788,  -7.96356915,  -3.06176313,  -6.01818435,  -7.90878145,  -7.86720264]

которую я создал. Я проверил это с помощью R, и он работает правильно.

import numpy as np
import statsmodels.api as sm

y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]

x = [
     [4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
     [4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
     [4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
     ]

def reg_m(y, x):
    ones = np.ones(len(x[0]))
    X = sm.add_constant(np.column_stack((x[0], ones)))
    for ele in x[1:]:
        X = sm.add_constant(np.column_stack((ele, X)))
    results = sm.OLS(y, X).fit()
    return results

Результат:

print reg_m(y, x).summary()

Выход:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.535
Model:                            OLS   Adj. R-squared:                  0.461
Method:                 Least Squares   F-statistic:                     7.281
Date:                Tue, 19 Feb 2013   Prob (F-statistic):            0.00191
Time:                        21:51:28   Log-Likelihood:                -26.025
No. Observations:                  23   AIC:                             60.05
Df Residuals:                      19   BIC:                             64.59
Df Model:                           3                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.2424      0.139      1.739      0.098        -0.049     0.534
x2             0.2360      0.149      1.587      0.129        -0.075     0.547
x3            -0.0618      0.145     -0.427      0.674        -0.365     0.241
const          1.5704      0.633      2.481      0.023         0.245     2.895

==============================================================================
Omnibus:                        6.904   Durbin-Watson:                   1.905
Prob(Omnibus):                  0.032   Jarque-Bera (JB):                4.708
Skew:                          -0.849   Prob(JB):                       0.0950
Kurtosis:                       4.426   Cond. No.                         38.6

pandas предоставляет удобный способ запуска OLS, как указано в этом ответе:

Запустите регрессию OLS с помощью фрейма данных Pandas

df),

import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()
print(lm.params)

Термин перехвата включен по умолчанию.

Увидетьэтот блокнот для большего количества примеров.

sklearn.linear_model.LinearRegression сделаю это:

clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
        [t.y for t in texts])

затемclf.coef_ будет иметь коэффициенты регрессии.

sklearn.linear_model также имеет аналогичные интерфейсы для выполнения различных видов регуляризации регрессии.

from patsy import dmatrices
import statsmodels.api as sm

y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ### 
model_fit = sm.OLS(y,x)
results = model_fit.fit()
print(results.summary())

Вместоsm.OLS Вы также можете использоватьsm.Logit или жеsm.Probit и так далее.

def linear(x, y=None, show=True):
    """
    @param x: pd.DataFrame
    @param y: pd.DataFrame or pd.Series or None
              if None, then use last column of x as y
    @param show: if show regression summary
    """
    import statsmodels.api as sm

    xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
    res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()

    if show: print res.summary()
    return res

multiple линейная регрессия, неmultivariate линейная регрессия см.разница:

The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression. Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term multivariate linear regression refers to cases where y is a vector, i.e., the same as general linear regression. The difference between multivariate linear regression and multivariable linear regression should be emphasized as it causes much confusion and misunderstanding in the literature.

Короче:

multiple linear regression: the response y is a scalar. multivariate linear regression: the response y is a vector.

(Другаяисточник.)

from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4

print x.head()

         x1        x2        x3  b
0  0.433681  0.946723  0.103422  1
1  0.400423  0.527179  0.131674  1
2  0.992441  0.900678  0.360140  1
3  0.413757  0.099319  0.825181  1
4  0.796491  0.862593  0.193554  1

print y.head()

0    6.637392
1    5.849802
2    7.874218
3    7.087938
4    7.102337
dtype: float64

model = OLS(y, x)
result = model.fit()
print result.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 5.859e+30
Date:                Wed, 09 Dec 2015   Prob (F-statistic):               0.00
Time:                        15:17:32   Log-Likelihood:                 3224.9
No. Observations:                 100   AIC:                            -6442.
Df Residuals:                      96   BIC:                            -6431.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             1.0000   8.98e-16   1.11e+15      0.000         1.000     1.000
x2             2.0000   8.28e-16   2.41e+15      0.000         2.000     2.000
x3             3.0000   8.34e-16    3.6e+15      0.000         3.000     3.000
b              4.0000   8.51e-16    4.7e+15      0.000         4.000     4.000
==============================================================================
Omnibus:                        7.675   Durbin-Watson:                   1.614
Prob(Omnibus):                  0.022   Jarque-Bera (JB):                3.118
Skew:                           0.045   Prob(JB):                        0.210
Kurtosis:                       2.140   Cond. No.                         6.89
==============================================================================

ВАШ ОТВЕТ НА ВОПРОС