implementing R scale function in pandas in Python?

2020-08-26 14:48发布

What is the efficient equivalent of R's scale function in pandas? E.g.

newdf <- scale(df)

written in pandas? Is there an elegant way using transform?

2条回答
可以哭但决不认输i
2楼-- · 2020-08-26 15:03

I don't know R, but from reading the documentation it looks like the following would do the trick (albeit in a slightly less general way)

def scale(y, c=True, sc=True):
    x = y.copy()

    if c:
        x -= x.mean()
    if sc and c:
        x /= x.std()
    elif sc:
        x /= np.sqrt(x.pow(2).sum().div(x.count() - 1))
    return x

For the more general version you'd probably need to do some type/length checking.

EDIT: Added explanation of the denominator in elif sc: clause

From the R docs:

 ... If ‘scale’ is
 ‘TRUE’ then scaling is done by dividing the (centered) columns of
 ‘x’ by their standard deviations if ‘center’ is ‘TRUE’, and the
 root mean square otherwise.  If ‘scale’ is ‘FALSE’, no scaling is
 done.

 The root-mean-square for a (possibly centered) column is defined
 as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing
 values and n is the number of non-missing values.  In the case
 ‘center = TRUE’, this is the same as the standard deviation, but
 in general it is not.

The line np.sqrt(x.pow(2).sum().div(x.count() - 1)) computes the root mean square using the definition by first squaring x (the pow method) then summing along the rows and then dividing by the non NaN counts in each column (the count method).

As a side the note the reason I didn't just simply compute the RMS after centering is because the std method calls bottleneck for faster computation of that expression in that special case where you want to compute the standard deviation and not the more general RMS.

You could instead compute the RMS after centering, might be worth a benchmark since now that I'm writing this I'm not actually sure which is faster and I haven't benchmarked it.

查看更多
仙女界的扛把子
3楼-- · 2020-08-26 15:05

Scaling is very common in machine learning tasks, so it is implemented in scikit-learn's preprocessing module. You can pass pandas DataFrame to its scale method.

The only "problem" is that the returned object is no longer a DataFrame, but a numpy array; which is usually not a real issue if you want to pass it to a machine learning model anyway (e.g. SVM or logistic regression). If you want to keep the DataFrame, it would require some workaround:

from sklearn.preprocessing import scale
from pandas import DataFrame

newdf = DataFrame(scale(df), index=df.index, columns=df.columns)

See also here.

查看更多
登录 后发表回答