How to find max using pyspark fold operation in following example?
up vote
1
down vote
favorite
I am new to pyspark and python. So, please help me with this problem where i need to find max value using fold and by using operator.gt
or by writing our own lambda
function.
Following code written by me throws error that rdd cannot be indexed. I understood it but how to pass and compare values each value 1,2,0,3 with 0 and find max.
Here 0 is my accumulator value and 1,2,0,3 are current values each time.
I am trying to convert a program written in scala that explained fold to python.
Answer expected : ('d', 3)
from pyspark import SparkContext
from operator import gt
def main():
sc = SparkContext("local", "test")
data = sc.parallelize([('a', 1), ('b', 2),('c', 0), ('d', 3)])
#dummy = ('dummy', 0)
maxVal = data.fold(0, lambda acc, a : gt(acc, a[1])).collect()
print(maxVal)
if __name__ == '__main__':
main()
python scala apache-spark pyspark
add a comment |
up vote
1
down vote
favorite
I am new to pyspark and python. So, please help me with this problem where i need to find max value using fold and by using operator.gt
or by writing our own lambda
function.
Following code written by me throws error that rdd cannot be indexed. I understood it but how to pass and compare values each value 1,2,0,3 with 0 and find max.
Here 0 is my accumulator value and 1,2,0,3 are current values each time.
I am trying to convert a program written in scala that explained fold to python.
Answer expected : ('d', 3)
from pyspark import SparkContext
from operator import gt
def main():
sc = SparkContext("local", "test")
data = sc.parallelize([('a', 1), ('b', 2),('c', 0), ('d', 3)])
#dummy = ('dummy', 0)
maxVal = data.fold(0, lambda acc, a : gt(acc, a[1])).collect()
print(maxVal)
if __name__ == '__main__':
main()
python scala apache-spark pyspark
Do you understand what lambdas are and howfold
works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.
– Bernhard Stadler
Nov 9 at 11:03
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I am new to pyspark and python. So, please help me with this problem where i need to find max value using fold and by using operator.gt
or by writing our own lambda
function.
Following code written by me throws error that rdd cannot be indexed. I understood it but how to pass and compare values each value 1,2,0,3 with 0 and find max.
Here 0 is my accumulator value and 1,2,0,3 are current values each time.
I am trying to convert a program written in scala that explained fold to python.
Answer expected : ('d', 3)
from pyspark import SparkContext
from operator import gt
def main():
sc = SparkContext("local", "test")
data = sc.parallelize([('a', 1), ('b', 2),('c', 0), ('d', 3)])
#dummy = ('dummy', 0)
maxVal = data.fold(0, lambda acc, a : gt(acc, a[1])).collect()
print(maxVal)
if __name__ == '__main__':
main()
python scala apache-spark pyspark
I am new to pyspark and python. So, please help me with this problem where i need to find max value using fold and by using operator.gt
or by writing our own lambda
function.
Following code written by me throws error that rdd cannot be indexed. I understood it but how to pass and compare values each value 1,2,0,3 with 0 and find max.
Here 0 is my accumulator value and 1,2,0,3 are current values each time.
I am trying to convert a program written in scala that explained fold to python.
Answer expected : ('d', 3)
from pyspark import SparkContext
from operator import gt
def main():
sc = SparkContext("local", "test")
data = sc.parallelize([('a', 1), ('b', 2),('c', 0), ('d', 3)])
#dummy = ('dummy', 0)
maxVal = data.fold(0, lambda acc, a : gt(acc, a[1])).collect()
print(maxVal)
if __name__ == '__main__':
main()
python scala apache-spark pyspark
python scala apache-spark pyspark
edited Nov 9 at 10:45
Ali AzG
532414
532414
asked Nov 9 at 10:00
Kumkum Sharma
164
164
Do you understand what lambdas are and howfold
works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.
– Bernhard Stadler
Nov 9 at 11:03
add a comment |
Do you understand what lambdas are and howfold
works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.
– Bernhard Stadler
Nov 9 at 11:03
Do you understand what lambdas are and how
fold
works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.– Bernhard Stadler
Nov 9 at 11:03
Do you understand what lambdas are and how
fold
works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.– Bernhard Stadler
Nov 9 at 11:03
add a comment |
1 Answer
1
active
oldest
votes
up vote
2
down vote
Use neutral value (a one which can be merge an arbitrary number of times without changing the final result) suitable for a particular operation and matches the type of data (the function should be
(T, T) => T
or with Python conventionsCallable[[T, T], T]
). Withmax
by value it makes sense to usefloat("-Inf")
and a dummy key:
zero = (None, float("-Inf"))
To reduce use
max
withkey
:
from functools import partial
from operator import itemgetter
op = partial(max, key=itemgetter(1))
Combined:
data.fold(zero, op)
('d', 3)
Of course in practice you can just use RDD.max
data.max(key=itemgetter(1))
('d', 3)
Its working fine. Thanks.
– Kumkum Sharma
Nov 9 at 11:39
You're welcome. Could you accept the answer?
– user10465355
Nov 12 at 23:27
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
Use neutral value (a one which can be merge an arbitrary number of times without changing the final result) suitable for a particular operation and matches the type of data (the function should be
(T, T) => T
or with Python conventionsCallable[[T, T], T]
). Withmax
by value it makes sense to usefloat("-Inf")
and a dummy key:
zero = (None, float("-Inf"))
To reduce use
max
withkey
:
from functools import partial
from operator import itemgetter
op = partial(max, key=itemgetter(1))
Combined:
data.fold(zero, op)
('d', 3)
Of course in practice you can just use RDD.max
data.max(key=itemgetter(1))
('d', 3)
Its working fine. Thanks.
– Kumkum Sharma
Nov 9 at 11:39
You're welcome. Could you accept the answer?
– user10465355
Nov 12 at 23:27
add a comment |
up vote
2
down vote
Use neutral value (a one which can be merge an arbitrary number of times without changing the final result) suitable for a particular operation and matches the type of data (the function should be
(T, T) => T
or with Python conventionsCallable[[T, T], T]
). Withmax
by value it makes sense to usefloat("-Inf")
and a dummy key:
zero = (None, float("-Inf"))
To reduce use
max
withkey
:
from functools import partial
from operator import itemgetter
op = partial(max, key=itemgetter(1))
Combined:
data.fold(zero, op)
('d', 3)
Of course in practice you can just use RDD.max
data.max(key=itemgetter(1))
('d', 3)
Its working fine. Thanks.
– Kumkum Sharma
Nov 9 at 11:39
You're welcome. Could you accept the answer?
– user10465355
Nov 12 at 23:27
add a comment |
up vote
2
down vote
up vote
2
down vote
Use neutral value (a one which can be merge an arbitrary number of times without changing the final result) suitable for a particular operation and matches the type of data (the function should be
(T, T) => T
or with Python conventionsCallable[[T, T], T]
). Withmax
by value it makes sense to usefloat("-Inf")
and a dummy key:
zero = (None, float("-Inf"))
To reduce use
max
withkey
:
from functools import partial
from operator import itemgetter
op = partial(max, key=itemgetter(1))
Combined:
data.fold(zero, op)
('d', 3)
Of course in practice you can just use RDD.max
data.max(key=itemgetter(1))
('d', 3)
Use neutral value (a one which can be merge an arbitrary number of times without changing the final result) suitable for a particular operation and matches the type of data (the function should be
(T, T) => T
or with Python conventionsCallable[[T, T], T]
). Withmax
by value it makes sense to usefloat("-Inf")
and a dummy key:
zero = (None, float("-Inf"))
To reduce use
max
withkey
:
from functools import partial
from operator import itemgetter
op = partial(max, key=itemgetter(1))
Combined:
data.fold(zero, op)
('d', 3)
Of course in practice you can just use RDD.max
data.max(key=itemgetter(1))
('d', 3)
edited Nov 9 at 11:16
answered Nov 9 at 11:01
user10465355
89339
89339
Its working fine. Thanks.
– Kumkum Sharma
Nov 9 at 11:39
You're welcome. Could you accept the answer?
– user10465355
Nov 12 at 23:27
add a comment |
Its working fine. Thanks.
– Kumkum Sharma
Nov 9 at 11:39
You're welcome. Could you accept the answer?
– user10465355
Nov 12 at 23:27
Its working fine. Thanks.
– Kumkum Sharma
Nov 9 at 11:39
Its working fine. Thanks.
– Kumkum Sharma
Nov 9 at 11:39
You're welcome. Could you accept the answer?
– user10465355
Nov 12 at 23:27
You're welcome. Could you accept the answer?
– user10465355
Nov 12 at 23:27
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53223557%2fhow-to-find-max-using-pyspark-fold-operation-in-following-example%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Do you understand what lambdas are and how
fold
works? Another hint: What you need is not actually a simple maximum (although it will involve calculating a maximum), because you don't need only the maximal value but the whole row containing the value.– Bernhard Stadler
Nov 9 at 11:03