How to find max using pyspark fold operation in following example?

up vote
1
down vote

favorite

I am new to pyspark and python. So, please help me with this problem where i need to find max value using fold and by using operator.gt or by writing our own lambda function.

Following code written by me throws error that rdd cannot be indexed. I understood it but how to pass and compare values each value 1,2,0,3 with 0 and find max.
Here 0 is my accumulator value and 1,2,0,3 are current values each time.
I am trying to convert a program written in scala that explained fold to python.
Answer expected : ('d', 3)

from pyspark import SparkContext

from operator import gt



def main():

    sc = SparkContext("local", "test")



    data = sc.parallelize([('a', 1), ('b', 2),('c', 0), ('d', 3)])



    #dummy = ('dummy', 0)



    maxVal = data.fold(0, lambda acc, a : gt(acc, a[1])).collect()



    print(maxVal)





if __name__ == '__main__':

    main()

edited Nov 9 at 10:45

Ali AzG

532414

asked Nov 9 at 10:00

Kumkum Sharma

164

Do you understand what lambdas are and how fold works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.
– Bernhard Stadler
Nov 9 at 11:03

add a comment |

up vote
1
down vote

favorite

I am new to pyspark and python. So, please help me with this problem where i need to find max value using fold and by using operator.gt or by writing our own lambda function.

from pyspark import SparkContext

from operator import gt



def main():

    sc = SparkContext("local", "test")



    data = sc.parallelize([('a', 1), ('b', 2),('c', 0), ('d', 3)])



    #dummy = ('dummy', 0)



    maxVal = data.fold(0, lambda acc, a : gt(acc, a[1])).collect()



    print(maxVal)





if __name__ == '__main__':

    main()

edited Nov 9 at 10:45

Ali AzG

532414

asked Nov 9 at 10:00

Kumkum Sharma

164

Do you understand what lambdas are and how fold works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.
– Bernhard Stadler
Nov 9 at 11:03

add a comment |

up vote
1
down vote

favorite

I am new to pyspark and python. So, please help me with this problem where i need to find max value using fold and by using operator.gt or by writing our own lambda function.

from pyspark import SparkContext

from operator import gt



def main():

    sc = SparkContext("local", "test")



    data = sc.parallelize([('a', 1), ('b', 2),('c', 0), ('d', 3)])



    #dummy = ('dummy', 0)



    maxVal = data.fold(0, lambda acc, a : gt(acc, a[1])).collect()



    print(maxVal)





if __name__ == '__main__':

    main()

edited Nov 9 at 10:45

Ali AzG

532414

asked Nov 9 at 10:00

Kumkum Sharma

164

I am new to pyspark and python. So, please help me with this problem where i need to find max value using fold and by using operator.gt or by writing our own lambda function.

from pyspark import SparkContext

from operator import gt



def main():

    sc = SparkContext("local", "test")



    data = sc.parallelize([('a', 1), ('b', 2),('c', 0), ('d', 3)])



    #dummy = ('dummy', 0)



    maxVal = data.fold(0, lambda acc, a : gt(acc, a[1])).collect()



    print(maxVal)





if __name__ == '__main__':

    main()

python scala apache-spark pyspark

edited Nov 9 at 10:45

Ali AzG

532414

asked Nov 9 at 10:00

Kumkum Sharma

164

edited Nov 9 at 10:45

Ali AzG

532414

asked Nov 9 at 10:00

Kumkum Sharma

164

edited Nov 9 at 10:45

Ali AzG

532414

edited Nov 9 at 10:45

Ali AzG

532414

edited Nov 9 at 10:45

Ali AzG

532414

asked Nov 9 at 10:00

Kumkum Sharma

164

asked Nov 9 at 10:00

Kumkum Sharma

164

asked Nov 9 at 10:00

Kumkum Sharma

164

Do you understand what lambdas are and how fold works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.
– Bernhard Stadler
Nov 9 at 11:03

add a comment |

Do you understand what lambdas are and how fold works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.
– Bernhard Stadler
Nov 9 at 11:03

Do you understand what lambdas are and how fold works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.
– Bernhard Stadler
Nov 9 at 11:03

add a comment |

1 Answer
1

active

oldest

votes

up vote
2
down vote

Use neutral value (a one which can be merge an arbitrary number of times without changing the final result) suitable for a particular operation and matches the type of data (the function should be (T, T) => T or with Python conventions Callable[[T, T], T]). With max by value it makes sense to use float("-Inf") and a dummy key:
```
zero = (None, float("-Inf"))
```

To reduce use max with key:

from functools import partial

from operator import itemgetter



op = partial(max, key=itemgetter(1))

Combined:

data.fold(zero, op)

('d', 3)

Of course in practice you can just use RDD.max

data.max(key=itemgetter(1))

('d', 3)

edited Nov 9 at 11:16

answered Nov 9 at 11:01

user10465355

89339

Its working fine. Thanks.
– Kumkum Sharma
Nov 9 at 11:39

You're welcome. Could you accept the answer?
– user10465355
Nov 12 at 23:27

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53223557%2fhow-to-find-max-using-pyspark-fold-operation-in-following-example%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

Use neutral value (a one which can be merge an arbitrary number of times without changing the final result) suitable for a particular operation and matches the type of data (the function should be (T, T) => T or with Python conventions Callable[[T, T], T]). With max by value it makes sense to use float("-Inf") and a dummy key:
```
zero = (None, float("-Inf"))
```

To reduce use max with key:

from functools import partial

from operator import itemgetter



op = partial(max, key=itemgetter(1))

Combined:

data.fold(zero, op)

('d', 3)

Of course in practice you can just use RDD.max

data.max(key=itemgetter(1))

('d', 3)

edited Nov 9 at 11:16

answered Nov 9 at 11:01

user10465355

89339

Its working fine. Thanks.
– Kumkum Sharma
Nov 9 at 11:39

You're welcome. Could you accept the answer?
– user10465355
Nov 12 at 23:27

add a comment |

up vote
2
down vote

Use neutral value (a one which can be merge an arbitrary number of times without changing the final result) suitable for a particular operation and matches the type of data (the function should be (T, T) => T or with Python conventions Callable[[T, T], T]). With max by value it makes sense to use float("-Inf") and a dummy key:
```
zero = (None, float("-Inf"))
```

To reduce use max with key:

from functools import partial

from operator import itemgetter



op = partial(max, key=itemgetter(1))

Combined:

data.fold(zero, op)

('d', 3)

Of course in practice you can just use RDD.max

data.max(key=itemgetter(1))

('d', 3)

edited Nov 9 at 11:16

answered Nov 9 at 11:01

user10465355

89339

Its working fine. Thanks.
– Kumkum Sharma
Nov 9 at 11:39

You're welcome. Could you accept the answer?
– user10465355
Nov 12 at 23:27

add a comment |

up vote
2
down vote

Use neutral value (a one which can be merge an arbitrary number of times without changing the final result) suitable for a particular operation and matches the type of data (the function should be (T, T) => T or with Python conventions Callable[[T, T], T]). With max by value it makes sense to use float("-Inf") and a dummy key:
```
zero = (None, float("-Inf"))
```

To reduce use max with key:

from functools import partial

from operator import itemgetter



op = partial(max, key=itemgetter(1))

Combined:

data.fold(zero, op)

('d', 3)

Of course in practice you can just use RDD.max

data.max(key=itemgetter(1))

('d', 3)

edited Nov 9 at 11:16

answered Nov 9 at 11:01

user10465355

89339

Use neutral value (a one which can be merge an arbitrary number of times without changing the final result) suitable for a particular operation and matches the type of data (the function should be (T, T) => T or with Python conventions Callable[[T, T], T]). With max by value it makes sense to use float("-Inf") and a dummy key:
```
zero = (None, float("-Inf"))
```

To reduce use max with key:

from functools import partial

from operator import itemgetter



op = partial(max, key=itemgetter(1))

Combined:

data.fold(zero, op)

('d', 3)

Of course in practice you can just use RDD.max

data.max(key=itemgetter(1))

('d', 3)

edited Nov 9 at 11:16

answered Nov 9 at 11:01

user10465355

89339

edited Nov 9 at 11:16

answered Nov 9 at 11:01

user10465355

89339

answered Nov 9 at 11:01

user10465355

89339

answered Nov 9 at 11:01

user10465355

89339

Its working fine. Thanks.
– Kumkum Sharma
Nov 9 at 11:39

You're welcome. Could you accept the answer?
– user10465355
Nov 12 at 23:27

add a comment |

Its working fine. Thanks.
– Kumkum Sharma
Nov 9 at 11:39

You're welcome. Could you accept the answer?
– user10465355
Nov 12 at 23:27

Its working fine. Thanks.
– Kumkum Sharma
Nov 9 at 11:39

You're welcome. Could you accept the answer?
– user10465355
Nov 12 at 23:27

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Xtykutl