How to find max using pyspark fold operation in following example?











up vote
1
down vote

favorite












I am new to pyspark and python. So, please help me with this problem where i need to find max value using fold and by using operator.gt or by writing our own lambda function.



Following code written by me throws error that rdd cannot be indexed. I understood it but how to pass and compare values each value 1,2,0,3 with 0 and find max.
Here 0 is my accumulator value and 1,2,0,3 are current values each time.
I am trying to convert a program written in scala that explained fold to python.
Answer expected : ('d', 3)



from pyspark import SparkContext
from operator import gt

def main():
sc = SparkContext("local", "test")

data = sc.parallelize([('a', 1), ('b', 2),('c', 0), ('d', 3)])

#dummy = ('dummy', 0)

maxVal = data.fold(0, lambda acc, a : gt(acc, a[1])).collect()

print(maxVal)


if __name__ == '__main__':
main()









share|improve this question
























  • Do you understand what lambdas are and how fold works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.
    – Bernhard Stadler
    Nov 9 at 11:03

















up vote
1
down vote

favorite












I am new to pyspark and python. So, please help me with this problem where i need to find max value using fold and by using operator.gt or by writing our own lambda function.



Following code written by me throws error that rdd cannot be indexed. I understood it but how to pass and compare values each value 1,2,0,3 with 0 and find max.
Here 0 is my accumulator value and 1,2,0,3 are current values each time.
I am trying to convert a program written in scala that explained fold to python.
Answer expected : ('d', 3)



from pyspark import SparkContext
from operator import gt

def main():
sc = SparkContext("local", "test")

data = sc.parallelize([('a', 1), ('b', 2),('c', 0), ('d', 3)])

#dummy = ('dummy', 0)

maxVal = data.fold(0, lambda acc, a : gt(acc, a[1])).collect()

print(maxVal)


if __name__ == '__main__':
main()









share|improve this question
























  • Do you understand what lambdas are and how fold works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.
    – Bernhard Stadler
    Nov 9 at 11:03















up vote
1
down vote

favorite









up vote
1
down vote

favorite











I am new to pyspark and python. So, please help me with this problem where i need to find max value using fold and by using operator.gt or by writing our own lambda function.



Following code written by me throws error that rdd cannot be indexed. I understood it but how to pass and compare values each value 1,2,0,3 with 0 and find max.
Here 0 is my accumulator value and 1,2,0,3 are current values each time.
I am trying to convert a program written in scala that explained fold to python.
Answer expected : ('d', 3)



from pyspark import SparkContext
from operator import gt

def main():
sc = SparkContext("local", "test")

data = sc.parallelize([('a', 1), ('b', 2),('c', 0), ('d', 3)])

#dummy = ('dummy', 0)

maxVal = data.fold(0, lambda acc, a : gt(acc, a[1])).collect()

print(maxVal)


if __name__ == '__main__':
main()









share|improve this question















I am new to pyspark and python. So, please help me with this problem where i need to find max value using fold and by using operator.gt or by writing our own lambda function.



Following code written by me throws error that rdd cannot be indexed. I understood it but how to pass and compare values each value 1,2,0,3 with 0 and find max.
Here 0 is my accumulator value and 1,2,0,3 are current values each time.
I am trying to convert a program written in scala that explained fold to python.
Answer expected : ('d', 3)



from pyspark import SparkContext
from operator import gt

def main():
sc = SparkContext("local", "test")

data = sc.parallelize([('a', 1), ('b', 2),('c', 0), ('d', 3)])

#dummy = ('dummy', 0)

maxVal = data.fold(0, lambda acc, a : gt(acc, a[1])).collect()

print(maxVal)


if __name__ == '__main__':
main()






python scala apache-spark pyspark






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 9 at 10:45









Ali AzG

532414




532414










asked Nov 9 at 10:00









Kumkum Sharma

164




164












  • Do you understand what lambdas are and how fold works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.
    – Bernhard Stadler
    Nov 9 at 11:03




















  • Do you understand what lambdas are and how fold works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.
    – Bernhard Stadler
    Nov 9 at 11:03


















Do you understand what lambdas are and how fold works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.
– Bernhard Stadler
Nov 9 at 11:03






Do you understand what lambdas are and how fold works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.
– Bernhard Stadler
Nov 9 at 11:03














1 Answer
1






active

oldest

votes

















up vote
2
down vote















  • Use neutral value (a one which can be merge an arbitrary number of times without changing the final result) suitable for a particular operation and matches the type of data (the function should be (T, T) => T or with Python conventions Callable[[T, T], T]). With max by value it makes sense to use float("-Inf") and a dummy key:



    zero = (None, float("-Inf"))



  • To reduce use max with key:



    from functools import partial
    from operator import itemgetter

    op = partial(max, key=itemgetter(1))



Combined:



data.fold(zero, op)


('d', 3) 


Of course in practice you can just use RDD.max



data.max(key=itemgetter(1))


('d', 3)





share|improve this answer























  • Its working fine. Thanks.
    – Kumkum Sharma
    Nov 9 at 11:39










  • You're welcome. Could you accept the answer?
    – user10465355
    Nov 12 at 23:27











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53223557%2fhow-to-find-max-using-pyspark-fold-operation-in-following-example%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote















  • Use neutral value (a one which can be merge an arbitrary number of times without changing the final result) suitable for a particular operation and matches the type of data (the function should be (T, T) => T or with Python conventions Callable[[T, T], T]). With max by value it makes sense to use float("-Inf") and a dummy key:



    zero = (None, float("-Inf"))



  • To reduce use max with key:



    from functools import partial
    from operator import itemgetter

    op = partial(max, key=itemgetter(1))



Combined:



data.fold(zero, op)


('d', 3) 


Of course in practice you can just use RDD.max



data.max(key=itemgetter(1))


('d', 3)





share|improve this answer























  • Its working fine. Thanks.
    – Kumkum Sharma
    Nov 9 at 11:39










  • You're welcome. Could you accept the answer?
    – user10465355
    Nov 12 at 23:27















up vote
2
down vote















  • Use neutral value (a one which can be merge an arbitrary number of times without changing the final result) suitable for a particular operation and matches the type of data (the function should be (T, T) => T or with Python conventions Callable[[T, T], T]). With max by value it makes sense to use float("-Inf") and a dummy key:



    zero = (None, float("-Inf"))



  • To reduce use max with key:



    from functools import partial
    from operator import itemgetter

    op = partial(max, key=itemgetter(1))



Combined:



data.fold(zero, op)


('d', 3) 


Of course in practice you can just use RDD.max



data.max(key=itemgetter(1))


('d', 3)





share|improve this answer























  • Its working fine. Thanks.
    – Kumkum Sharma
    Nov 9 at 11:39










  • You're welcome. Could you accept the answer?
    – user10465355
    Nov 12 at 23:27













up vote
2
down vote










up vote
2
down vote











  • Use neutral value (a one which can be merge an arbitrary number of times without changing the final result) suitable for a particular operation and matches the type of data (the function should be (T, T) => T or with Python conventions Callable[[T, T], T]). With max by value it makes sense to use float("-Inf") and a dummy key:



    zero = (None, float("-Inf"))



  • To reduce use max with key:



    from functools import partial
    from operator import itemgetter

    op = partial(max, key=itemgetter(1))



Combined:



data.fold(zero, op)


('d', 3) 


Of course in practice you can just use RDD.max



data.max(key=itemgetter(1))


('d', 3)





share|improve this answer
















  • Use neutral value (a one which can be merge an arbitrary number of times without changing the final result) suitable for a particular operation and matches the type of data (the function should be (T, T) => T or with Python conventions Callable[[T, T], T]). With max by value it makes sense to use float("-Inf") and a dummy key:



    zero = (None, float("-Inf"))



  • To reduce use max with key:



    from functools import partial
    from operator import itemgetter

    op = partial(max, key=itemgetter(1))



Combined:



data.fold(zero, op)


('d', 3) 


Of course in practice you can just use RDD.max



data.max(key=itemgetter(1))


('d', 3)






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 9 at 11:16

























answered Nov 9 at 11:01









user10465355

89339




89339












  • Its working fine. Thanks.
    – Kumkum Sharma
    Nov 9 at 11:39










  • You're welcome. Could you accept the answer?
    – user10465355
    Nov 12 at 23:27


















  • Its working fine. Thanks.
    – Kumkum Sharma
    Nov 9 at 11:39










  • You're welcome. Could you accept the answer?
    – user10465355
    Nov 12 at 23:27
















Its working fine. Thanks.
– Kumkum Sharma
Nov 9 at 11:39




Its working fine. Thanks.
– Kumkum Sharma
Nov 9 at 11:39












You're welcome. Could you accept the answer?
– user10465355
Nov 12 at 23:27




You're welcome. Could you accept the answer?
– user10465355
Nov 12 at 23:27


















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53223557%2fhow-to-find-max-using-pyspark-fold-operation-in-following-example%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Schultheiß

Verwaltungsgliederung Dänemarks

Liste der Kulturdenkmale in Wilsdruff