Imbalanced-Learn Random Over Sampler Removing Columns
up vote
-1
down vote
favorite
I'm training a multi label classifier to predict 'codes' for specific comments. My training set has a column with text and another with a list of codes (1 to 3) which I am trying to predict.
When I run:
from sklearn.preprocessing import MultiLabelBinarizer
from imblearn.over_sampling import RandomOverSampler
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(df.Code)
Y = multilabel_binarizer.transform(df.Code)
ros = RandomOverSampler(random_state=42)
X_resampled, Y_resampled = ros.fit_sample(X, Y)
Y has a shape of (12000, 168) but,
Y_resampled has a shape of (150000,166). I've looked through the source code but I can't seem to figure out why columns are disappearing. If anyone has any suggestions, it would be helpful.
Thank you!
python scikit-learn classification text-classification oversampling
add a comment |
up vote
-1
down vote
favorite
I'm training a multi label classifier to predict 'codes' for specific comments. My training set has a column with text and another with a list of codes (1 to 3) which I am trying to predict.
When I run:
from sklearn.preprocessing import MultiLabelBinarizer
from imblearn.over_sampling import RandomOverSampler
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(df.Code)
Y = multilabel_binarizer.transform(df.Code)
ros = RandomOverSampler(random_state=42)
X_resampled, Y_resampled = ros.fit_sample(X, Y)
Y has a shape of (12000, 168) but,
Y_resampled has a shape of (150000,166). I've looked through the source code but I can't seem to figure out why columns are disappearing. If anyone has any suggestions, it would be helpful.
Thank you!
python scikit-learn classification text-classification oversampling
can you add some data to reproduce the problem ?
– seralouk
Nov 8 at 21:05
You say you have a list of codes in y, from 1 to 3, then why does Y has a shape of(12000, 168)
. Why are 168 columns in it? Why are you doingMultiLabelBinarizer
on it? According toRandomOverSampler
documentation, it supports only a 1-dy
, so why are you supplying 2-dY
to it?
– Vivek Kumar
Nov 9 at 12:24
Sorry for the confusion, I wasn't very clear. There are 168 unique codes, but up to 3 exist for each row. Multi-label bianrizer was used to create a 'dummy-variable-esque" matrix of what codes were present in each row. So, y is really a n by 168 matrix, each row containing up to three 1's and 165 0's.
– gthom
Nov 14 at 18:47
add a comment |
up vote
-1
down vote
favorite
up vote
-1
down vote
favorite
I'm training a multi label classifier to predict 'codes' for specific comments. My training set has a column with text and another with a list of codes (1 to 3) which I am trying to predict.
When I run:
from sklearn.preprocessing import MultiLabelBinarizer
from imblearn.over_sampling import RandomOverSampler
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(df.Code)
Y = multilabel_binarizer.transform(df.Code)
ros = RandomOverSampler(random_state=42)
X_resampled, Y_resampled = ros.fit_sample(X, Y)
Y has a shape of (12000, 168) but,
Y_resampled has a shape of (150000,166). I've looked through the source code but I can't seem to figure out why columns are disappearing. If anyone has any suggestions, it would be helpful.
Thank you!
python scikit-learn classification text-classification oversampling
I'm training a multi label classifier to predict 'codes' for specific comments. My training set has a column with text and another with a list of codes (1 to 3) which I am trying to predict.
When I run:
from sklearn.preprocessing import MultiLabelBinarizer
from imblearn.over_sampling import RandomOverSampler
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(df.Code)
Y = multilabel_binarizer.transform(df.Code)
ros = RandomOverSampler(random_state=42)
X_resampled, Y_resampled = ros.fit_sample(X, Y)
Y has a shape of (12000, 168) but,
Y_resampled has a shape of (150000,166). I've looked through the source code but I can't seem to figure out why columns are disappearing. If anyone has any suggestions, it would be helpful.
Thank you!
python scikit-learn classification text-classification oversampling
python scikit-learn classification text-classification oversampling
asked Nov 8 at 16:57
gthom
1
1
can you add some data to reproduce the problem ?
– seralouk
Nov 8 at 21:05
You say you have a list of codes in y, from 1 to 3, then why does Y has a shape of(12000, 168)
. Why are 168 columns in it? Why are you doingMultiLabelBinarizer
on it? According toRandomOverSampler
documentation, it supports only a 1-dy
, so why are you supplying 2-dY
to it?
– Vivek Kumar
Nov 9 at 12:24
Sorry for the confusion, I wasn't very clear. There are 168 unique codes, but up to 3 exist for each row. Multi-label bianrizer was used to create a 'dummy-variable-esque" matrix of what codes were present in each row. So, y is really a n by 168 matrix, each row containing up to three 1's and 165 0's.
– gthom
Nov 14 at 18:47
add a comment |
can you add some data to reproduce the problem ?
– seralouk
Nov 8 at 21:05
You say you have a list of codes in y, from 1 to 3, then why does Y has a shape of(12000, 168)
. Why are 168 columns in it? Why are you doingMultiLabelBinarizer
on it? According toRandomOverSampler
documentation, it supports only a 1-dy
, so why are you supplying 2-dY
to it?
– Vivek Kumar
Nov 9 at 12:24
Sorry for the confusion, I wasn't very clear. There are 168 unique codes, but up to 3 exist for each row. Multi-label bianrizer was used to create a 'dummy-variable-esque" matrix of what codes were present in each row. So, y is really a n by 168 matrix, each row containing up to three 1's and 165 0's.
– gthom
Nov 14 at 18:47
can you add some data to reproduce the problem ?
– seralouk
Nov 8 at 21:05
can you add some data to reproduce the problem ?
– seralouk
Nov 8 at 21:05
You say you have a list of codes in y, from 1 to 3, then why does Y has a shape of
(12000, 168)
. Why are 168 columns in it? Why are you doing MultiLabelBinarizer
on it? According to RandomOverSampler
documentation, it supports only a 1-d y
, so why are you supplying 2-d Y
to it?– Vivek Kumar
Nov 9 at 12:24
You say you have a list of codes in y, from 1 to 3, then why does Y has a shape of
(12000, 168)
. Why are 168 columns in it? Why are you doing MultiLabelBinarizer
on it? According to RandomOverSampler
documentation, it supports only a 1-d y
, so why are you supplying 2-d Y
to it?– Vivek Kumar
Nov 9 at 12:24
Sorry for the confusion, I wasn't very clear. There are 168 unique codes, but up to 3 exist for each row. Multi-label bianrizer was used to create a 'dummy-variable-esque" matrix of what codes were present in each row. So, y is really a n by 168 matrix, each row containing up to three 1's and 165 0's.
– gthom
Nov 14 at 18:47
Sorry for the confusion, I wasn't very clear. There are 168 unique codes, but up to 3 exist for each row. Multi-label bianrizer was used to create a 'dummy-variable-esque" matrix of what codes were present in each row. So, y is really a n by 168 matrix, each row containing up to three 1's and 165 0's.
– gthom
Nov 14 at 18:47
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53212597%2fimbalanced-learn-random-over-sampler-removing-columns%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
can you add some data to reproduce the problem ?
– seralouk
Nov 8 at 21:05
You say you have a list of codes in y, from 1 to 3, then why does Y has a shape of
(12000, 168)
. Why are 168 columns in it? Why are you doingMultiLabelBinarizer
on it? According toRandomOverSampler
documentation, it supports only a 1-dy
, so why are you supplying 2-dY
to it?– Vivek Kumar
Nov 9 at 12:24
Sorry for the confusion, I wasn't very clear. There are 168 unique codes, but up to 3 exist for each row. Multi-label bianrizer was used to create a 'dummy-variable-esque" matrix of what codes were present in each row. So, y is really a n by 168 matrix, each row containing up to three 1's and 165 0's.
– gthom
Nov 14 at 18:47