## implementing R scale function in pandas in Python?

2020-08-26 14:48发布

What is the efficient equivalent of R's `scale` function in pandas? E.g.

``````newdf <- scale(df)
``````

written in pandas? Is there an elegant way using `transform`?

2条回答

I don't know R, but from reading the documentation it looks like the following would do the trick (albeit in a slightly less general way)

``````def scale(y, c=True, sc=True):
x = y.copy()

if c:
x -= x.mean()
if sc and c:
x /= x.std()
elif sc:
x /= np.sqrt(x.pow(2).sum().div(x.count() - 1))
return x
``````

For the more general version you'd probably need to do some type/length checking.

EDIT: Added explanation of the denominator in `elif sc:` clause

From the R docs:

`````` ... If ‘scale’ is
‘TRUE’ then scaling is done by dividing the (centered) columns of
‘x’ by their standard deviations if ‘center’ is ‘TRUE’, and the
root mean square otherwise.  If ‘scale’ is ‘FALSE’, no scaling is
done.

The root-mean-square for a (possibly centered) column is defined
as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing
values and n is the number of non-missing values.  In the case
‘center = TRUE’, this is the same as the standard deviation, but
in general it is not.
``````

The line `np.sqrt(x.pow(2).sum().div(x.count() - 1))` computes the root mean square using the definition by first squaring `x` (the `pow` method) then summing along the rows and then dividing by the non `NaN` counts in each column (the `count` method).

As a side the note the reason I didn't just simply compute the RMS after centering is because the `std` method calls `bottleneck` for faster computation of that expression in that special case where you want to compute the standard deviation and not the more general RMS.

You could instead compute the RMS after centering, might be worth a benchmark since now that I'm writing this I'm not actually sure which is faster and I haven't benchmarked it.

Scaling is very common in machine learning tasks, so it is implemented in scikit-learn's `preprocessing` module. You can pass pandas DataFrame to its `scale` method.

The only "problem" is that the returned object is no longer a DataFrame, but a numpy array; which is usually not a real issue if you want to pass it to a machine learning model anyway (e.g. SVM or logistic regression). If you want to keep the DataFrame, it would require some workaround:

``````from sklearn.preprocessing import scale
from pandas import DataFrame

newdf = DataFrame(scale(df), index=df.index, columns=df.columns)
``````