Apply TfidfVectorizer in every row of dataframe that is a list of lists











up vote
0
down vote

favorite












I have a pandas dataframe containing 2 columns and I want to use sklearn TfidfVectorizer for text-classification in one of them. However this column is a list of lists and TFIDF wants raw input as text. In this question they provide a solution in case we have just one list of lists, but I would like to ask how it would be possible to apply this function in every single row of my dataframe, which row contains a list of lists. Thank you in advance.






Input:

0 [[this, is, the], [first, row], [of, dataframe]]
1 [[that, is, the], [second], [row, of, dataframe]]
2 [[etc], [etc, etc]]





Wanted Output:



0    ['this is the', 'first row', 'of dataframe']
1 ['that is the', 'second', 'row of dataframe']
2 ['etc', 'etc etc']











share|improve this question
























  • Could you add some sample input?
    – Daniel Mesejo
    Nov 8 at 10:06










  • I updated the question Daniel
    – joasa
    Nov 8 at 10:15















up vote
0
down vote

favorite












I have a pandas dataframe containing 2 columns and I want to use sklearn TfidfVectorizer for text-classification in one of them. However this column is a list of lists and TFIDF wants raw input as text. In this question they provide a solution in case we have just one list of lists, but I would like to ask how it would be possible to apply this function in every single row of my dataframe, which row contains a list of lists. Thank you in advance.






Input:

0 [[this, is, the], [first, row], [of, dataframe]]
1 [[that, is, the], [second], [row, of, dataframe]]
2 [[etc], [etc, etc]]





Wanted Output:



0    ['this is the', 'first row', 'of dataframe']
1 ['that is the', 'second', 'row of dataframe']
2 ['etc', 'etc etc']











share|improve this question
























  • Could you add some sample input?
    – Daniel Mesejo
    Nov 8 at 10:06










  • I updated the question Daniel
    – joasa
    Nov 8 at 10:15













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have a pandas dataframe containing 2 columns and I want to use sklearn TfidfVectorizer for text-classification in one of them. However this column is a list of lists and TFIDF wants raw input as text. In this question they provide a solution in case we have just one list of lists, but I would like to ask how it would be possible to apply this function in every single row of my dataframe, which row contains a list of lists. Thank you in advance.






Input:

0 [[this, is, the], [first, row], [of, dataframe]]
1 [[that, is, the], [second], [row, of, dataframe]]
2 [[etc], [etc, etc]]





Wanted Output:



0    ['this is the', 'first row', 'of dataframe']
1 ['that is the', 'second', 'row of dataframe']
2 ['etc', 'etc etc']











share|improve this question















I have a pandas dataframe containing 2 columns and I want to use sklearn TfidfVectorizer for text-classification in one of them. However this column is a list of lists and TFIDF wants raw input as text. In this question they provide a solution in case we have just one list of lists, but I would like to ask how it would be possible to apply this function in every single row of my dataframe, which row contains a list of lists. Thank you in advance.






Input:

0 [[this, is, the], [first, row], [of, dataframe]]
1 [[that, is, the], [second], [row, of, dataframe]]
2 [[etc], [etc, etc]]





Wanted Output:



0    ['this is the', 'first row', 'of dataframe']
1 ['that is the', 'second', 'row of dataframe']
2 ['etc', 'etc etc']







Input:

0 [[this, is, the], [first, row], [of, dataframe]]
1 [[that, is, the], [second], [row, of, dataframe]]
2 [[etc], [etc, etc]]





Input:

0 [[this, is, the], [first, row], [of, dataframe]]
1 [[that, is, the], [second], [row, of, dataframe]]
2 [[etc], [etc, etc]]






python list dataframe tfidfvectorizer






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 8 at 10:14

























asked Nov 8 at 10:04









joasa

168116




168116












  • Could you add some sample input?
    – Daniel Mesejo
    Nov 8 at 10:06










  • I updated the question Daniel
    – joasa
    Nov 8 at 10:15


















  • Could you add some sample input?
    – Daniel Mesejo
    Nov 8 at 10:06










  • I updated the question Daniel
    – joasa
    Nov 8 at 10:15
















Could you add some sample input?
– Daniel Mesejo
Nov 8 at 10:06




Could you add some sample input?
– Daniel Mesejo
Nov 8 at 10:06












I updated the question Daniel
– joasa
Nov 8 at 10:15




I updated the question Daniel
– joasa
Nov 8 at 10:15












1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










You could use apply:



import pandas as pd

df = pd.DataFrame(data=[[[['this', 'is', 'the'], ['first', 'row'], ['of', 'dataframe']]],
[[['that', 'is', 'the'], ['second'], ['row', 'of', 'dataframe']]]],
columns=['paragraphs'])


df['result'] = df['paragraphs'].apply(lambda xs: [' '.join(x) for x in xs])
print(df['result'])


Output



0     [this is the, first row, of dataframe]
1 [that is the, second, row of dataframe]
Name: result, dtype: object


Further, if you want to apply the vectorizer in conjunction with the above function you could do something like this:



def vectorize(xs, vectorizer=TfidfVectorizer(min_df=1, stop_words="english")):
text = [' '.join(x) for x in xs]
return vectorizer.fit_transform(text)


df['vectors'] = df['paragraphs'].apply(vectorize)
print(df['vectors'].values)





share|improve this answer





















  • Is this result normal? [<10x17 sparse matrix of type '<class 'numpy.float64'>' ` with 19 stored elements in Compressed Sparse Row format>` <644x855 sparse matrix of type '<class 'numpy.float64'>' with 3092 stored elements in Compressed Sparse Row format>
    – joasa
    Nov 8 at 10:40








  • 1




    @joasa That is because vectorizer.fit_transform returns a sparse matrix, by applying it to each cell you get a column of sparse matrices.
    – Daniel Mesejo
    Nov 8 at 10:43











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53205421%2fapply-tfidfvectorizer-in-every-row-of-dataframe-that-is-a-list-of-lists%23new-answer', 'question_page');
}
);

Post as a guest
































1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










You could use apply:



import pandas as pd

df = pd.DataFrame(data=[[[['this', 'is', 'the'], ['first', 'row'], ['of', 'dataframe']]],
[[['that', 'is', 'the'], ['second'], ['row', 'of', 'dataframe']]]],
columns=['paragraphs'])


df['result'] = df['paragraphs'].apply(lambda xs: [' '.join(x) for x in xs])
print(df['result'])


Output



0     [this is the, first row, of dataframe]
1 [that is the, second, row of dataframe]
Name: result, dtype: object


Further, if you want to apply the vectorizer in conjunction with the above function you could do something like this:



def vectorize(xs, vectorizer=TfidfVectorizer(min_df=1, stop_words="english")):
text = [' '.join(x) for x in xs]
return vectorizer.fit_transform(text)


df['vectors'] = df['paragraphs'].apply(vectorize)
print(df['vectors'].values)





share|improve this answer





















  • Is this result normal? [<10x17 sparse matrix of type '<class 'numpy.float64'>' ` with 19 stored elements in Compressed Sparse Row format>` <644x855 sparse matrix of type '<class 'numpy.float64'>' with 3092 stored elements in Compressed Sparse Row format>
    – joasa
    Nov 8 at 10:40








  • 1




    @joasa That is because vectorizer.fit_transform returns a sparse matrix, by applying it to each cell you get a column of sparse matrices.
    – Daniel Mesejo
    Nov 8 at 10:43















up vote
1
down vote



accepted










You could use apply:



import pandas as pd

df = pd.DataFrame(data=[[[['this', 'is', 'the'], ['first', 'row'], ['of', 'dataframe']]],
[[['that', 'is', 'the'], ['second'], ['row', 'of', 'dataframe']]]],
columns=['paragraphs'])


df['result'] = df['paragraphs'].apply(lambda xs: [' '.join(x) for x in xs])
print(df['result'])


Output



0     [this is the, first row, of dataframe]
1 [that is the, second, row of dataframe]
Name: result, dtype: object


Further, if you want to apply the vectorizer in conjunction with the above function you could do something like this:



def vectorize(xs, vectorizer=TfidfVectorizer(min_df=1, stop_words="english")):
text = [' '.join(x) for x in xs]
return vectorizer.fit_transform(text)


df['vectors'] = df['paragraphs'].apply(vectorize)
print(df['vectors'].values)





share|improve this answer





















  • Is this result normal? [<10x17 sparse matrix of type '<class 'numpy.float64'>' ` with 19 stored elements in Compressed Sparse Row format>` <644x855 sparse matrix of type '<class 'numpy.float64'>' with 3092 stored elements in Compressed Sparse Row format>
    – joasa
    Nov 8 at 10:40








  • 1




    @joasa That is because vectorizer.fit_transform returns a sparse matrix, by applying it to each cell you get a column of sparse matrices.
    – Daniel Mesejo
    Nov 8 at 10:43













up vote
1
down vote



accepted







up vote
1
down vote



accepted






You could use apply:



import pandas as pd

df = pd.DataFrame(data=[[[['this', 'is', 'the'], ['first', 'row'], ['of', 'dataframe']]],
[[['that', 'is', 'the'], ['second'], ['row', 'of', 'dataframe']]]],
columns=['paragraphs'])


df['result'] = df['paragraphs'].apply(lambda xs: [' '.join(x) for x in xs])
print(df['result'])


Output



0     [this is the, first row, of dataframe]
1 [that is the, second, row of dataframe]
Name: result, dtype: object


Further, if you want to apply the vectorizer in conjunction with the above function you could do something like this:



def vectorize(xs, vectorizer=TfidfVectorizer(min_df=1, stop_words="english")):
text = [' '.join(x) for x in xs]
return vectorizer.fit_transform(text)


df['vectors'] = df['paragraphs'].apply(vectorize)
print(df['vectors'].values)





share|improve this answer












You could use apply:



import pandas as pd

df = pd.DataFrame(data=[[[['this', 'is', 'the'], ['first', 'row'], ['of', 'dataframe']]],
[[['that', 'is', 'the'], ['second'], ['row', 'of', 'dataframe']]]],
columns=['paragraphs'])


df['result'] = df['paragraphs'].apply(lambda xs: [' '.join(x) for x in xs])
print(df['result'])


Output



0     [this is the, first row, of dataframe]
1 [that is the, second, row of dataframe]
Name: result, dtype: object


Further, if you want to apply the vectorizer in conjunction with the above function you could do something like this:



def vectorize(xs, vectorizer=TfidfVectorizer(min_df=1, stop_words="english")):
text = [' '.join(x) for x in xs]
return vectorizer.fit_transform(text)


df['vectors'] = df['paragraphs'].apply(vectorize)
print(df['vectors'].values)






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 8 at 10:19









Daniel Mesejo

7,7191821




7,7191821












  • Is this result normal? [<10x17 sparse matrix of type '<class 'numpy.float64'>' ` with 19 stored elements in Compressed Sparse Row format>` <644x855 sparse matrix of type '<class 'numpy.float64'>' with 3092 stored elements in Compressed Sparse Row format>
    – joasa
    Nov 8 at 10:40








  • 1




    @joasa That is because vectorizer.fit_transform returns a sparse matrix, by applying it to each cell you get a column of sparse matrices.
    – Daniel Mesejo
    Nov 8 at 10:43


















  • Is this result normal? [<10x17 sparse matrix of type '<class 'numpy.float64'>' ` with 19 stored elements in Compressed Sparse Row format>` <644x855 sparse matrix of type '<class 'numpy.float64'>' with 3092 stored elements in Compressed Sparse Row format>
    – joasa
    Nov 8 at 10:40








  • 1




    @joasa That is because vectorizer.fit_transform returns a sparse matrix, by applying it to each cell you get a column of sparse matrices.
    – Daniel Mesejo
    Nov 8 at 10:43
















Is this result normal? [<10x17 sparse matrix of type '<class 'numpy.float64'>' ` with 19 stored elements in Compressed Sparse Row format>` <644x855 sparse matrix of type '<class 'numpy.float64'>' with 3092 stored elements in Compressed Sparse Row format>
– joasa
Nov 8 at 10:40






Is this result normal? [<10x17 sparse matrix of type '<class 'numpy.float64'>' ` with 19 stored elements in Compressed Sparse Row format>` <644x855 sparse matrix of type '<class 'numpy.float64'>' with 3092 stored elements in Compressed Sparse Row format>
– joasa
Nov 8 at 10:40






1




1




@joasa That is because vectorizer.fit_transform returns a sparse matrix, by applying it to each cell you get a column of sparse matrices.
– Daniel Mesejo
Nov 8 at 10:43




@joasa That is because vectorizer.fit_transform returns a sparse matrix, by applying it to each cell you get a column of sparse matrices.
– Daniel Mesejo
Nov 8 at 10:43


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53205421%2fapply-tfidfvectorizer-in-every-row-of-dataframe-that-is-a-list-of-lists%23new-answer', 'question_page');
}
);

Post as a guest




















































































Popular posts from this blog

Schultheiß

Verwaltungsgliederung Dänemarks

Liste der Kulturdenkmale in Wilsdruff