Efficiently take min of a column with other column as the key
up vote
1
down vote
favorite
I recently transitioned over from using Presto to Hive. I have the following scenario. Column A, B, C. I want to aggregate on A and find the value of B for which the value of C is minimized. In presto you can something like this as
SELECT A, min_by(B, C) from <TABLE> GROUP BY A
Now I want to do the same thing in Hive. But unfortunately I couldn't find a UDF similar to this anywhere in the documentation. Now I know I can do the following
SELECT A, COALESCE(B, 0)
from <TABLE> as primary
JOIN (
SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A
) secondary
ON primary.A = secondary.A AND primary.C = secondary.C
GROUP BY A
I have 2 problems with this solution
- It's not concise at all.
- It's not efficient either. I am doing an extra subquery resulting and an extra aggregation and an extra JOIN. It would be nice to have a first class aggregation support for such a function.
Is there a way to achieve what I am trying to do without writing your custom UDF ?
hive hiveql
add a comment |
up vote
1
down vote
favorite
I recently transitioned over from using Presto to Hive. I have the following scenario. Column A, B, C. I want to aggregate on A and find the value of B for which the value of C is minimized. In presto you can something like this as
SELECT A, min_by(B, C) from <TABLE> GROUP BY A
Now I want to do the same thing in Hive. But unfortunately I couldn't find a UDF similar to this anywhere in the documentation. Now I know I can do the following
SELECT A, COALESCE(B, 0)
from <TABLE> as primary
JOIN (
SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A
) secondary
ON primary.A = secondary.A AND primary.C = secondary.C
GROUP BY A
I have 2 problems with this solution
- It's not concise at all.
- It's not efficient either. I am doing an extra subquery resulting and an extra aggregation and an extra JOIN. It would be nice to have a first class aggregation support for such a function.
Is there a way to achieve what I am trying to do without writing your custom UDF ?
hive hiveql
why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
– kpie
Nov 9 at 0:08
@kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
– gibraltar
Nov 9 at 0:13
can you post sample input so that we can work on a solution?
– stack0114106
Nov 9 at 4:54
Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials aboutMIN(x) OVER (PARTITION BY a,b) AS xx
or sthg like that
– Samson Scharfrichter
Nov 9 at 8:00
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I recently transitioned over from using Presto to Hive. I have the following scenario. Column A, B, C. I want to aggregate on A and find the value of B for which the value of C is minimized. In presto you can something like this as
SELECT A, min_by(B, C) from <TABLE> GROUP BY A
Now I want to do the same thing in Hive. But unfortunately I couldn't find a UDF similar to this anywhere in the documentation. Now I know I can do the following
SELECT A, COALESCE(B, 0)
from <TABLE> as primary
JOIN (
SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A
) secondary
ON primary.A = secondary.A AND primary.C = secondary.C
GROUP BY A
I have 2 problems with this solution
- It's not concise at all.
- It's not efficient either. I am doing an extra subquery resulting and an extra aggregation and an extra JOIN. It would be nice to have a first class aggregation support for such a function.
Is there a way to achieve what I am trying to do without writing your custom UDF ?
hive hiveql
I recently transitioned over from using Presto to Hive. I have the following scenario. Column A, B, C. I want to aggregate on A and find the value of B for which the value of C is minimized. In presto you can something like this as
SELECT A, min_by(B, C) from <TABLE> GROUP BY A
Now I want to do the same thing in Hive. But unfortunately I couldn't find a UDF similar to this anywhere in the documentation. Now I know I can do the following
SELECT A, COALESCE(B, 0)
from <TABLE> as primary
JOIN (
SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A
) secondary
ON primary.A = secondary.A AND primary.C = secondary.C
GROUP BY A
I have 2 problems with this solution
- It's not concise at all.
- It's not efficient either. I am doing an extra subquery resulting and an extra aggregation and an extra JOIN. It would be nice to have a first class aggregation support for such a function.
Is there a way to achieve what I am trying to do without writing your custom UDF ?
hive hiveql
hive hiveql
edited Nov 9 at 0:34
Gaurang Shah
2,67311030
2,67311030
asked Nov 9 at 0:04
gibraltar
84341429
84341429
why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
– kpie
Nov 9 at 0:08
@kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
– gibraltar
Nov 9 at 0:13
can you post sample input so that we can work on a solution?
– stack0114106
Nov 9 at 4:54
Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials aboutMIN(x) OVER (PARTITION BY a,b) AS xx
or sthg like that
– Samson Scharfrichter
Nov 9 at 8:00
add a comment |
why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
– kpie
Nov 9 at 0:08
@kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
– gibraltar
Nov 9 at 0:13
can you post sample input so that we can work on a solution?
– stack0114106
Nov 9 at 4:54
Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials aboutMIN(x) OVER (PARTITION BY a,b) AS xx
or sthg like that
– Samson Scharfrichter
Nov 9 at 8:00
why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
– kpie
Nov 9 at 0:08
why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
– kpie
Nov 9 at 0:08
@kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
– gibraltar
Nov 9 at 0:13
@kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
– gibraltar
Nov 9 at 0:13
can you post sample input so that we can work on a solution?
– stack0114106
Nov 9 at 4:54
can you post sample input so that we can work on a solution?
– stack0114106
Nov 9 at 4:54
Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials about
MIN(x) OVER (PARTITION BY a,b) AS xx
or sthg like that– Samson Scharfrichter
Nov 9 at 8:00
Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials about
MIN(x) OVER (PARTITION BY a,b) AS xx
or sthg like that– Samson Scharfrichter
Nov 9 at 8:00
add a comment |
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
Join works slower than analytic functions, try this approach without join, and table will be scanned only once:
select s.*
from
(
SELECT A, COALESCE(B, 0) as B, C
min(C) over (partition by A) as min_C
from <TABLE> as primary
)s
where s.C=s.min_C;
If you need min(C) to be calculated by more group columns, add them to the partition BY
clause.
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
Join works slower than analytic functions, try this approach without join, and table will be scanned only once:
select s.*
from
(
SELECT A, COALESCE(B, 0) as B, C
min(C) over (partition by A) as min_C
from <TABLE> as primary
)s
where s.C=s.min_C;
If you need min(C) to be calculated by more group columns, add them to the partition BY
clause.
add a comment |
up vote
2
down vote
accepted
Join works slower than analytic functions, try this approach without join, and table will be scanned only once:
select s.*
from
(
SELECT A, COALESCE(B, 0) as B, C
min(C) over (partition by A) as min_C
from <TABLE> as primary
)s
where s.C=s.min_C;
If you need min(C) to be calculated by more group columns, add them to the partition BY
clause.
add a comment |
up vote
2
down vote
accepted
up vote
2
down vote
accepted
Join works slower than analytic functions, try this approach without join, and table will be scanned only once:
select s.*
from
(
SELECT A, COALESCE(B, 0) as B, C
min(C) over (partition by A) as min_C
from <TABLE> as primary
)s
where s.C=s.min_C;
If you need min(C) to be calculated by more group columns, add them to the partition BY
clause.
Join works slower than analytic functions, try this approach without join, and table will be scanned only once:
select s.*
from
(
SELECT A, COALESCE(B, 0) as B, C
min(C) over (partition by A) as min_C
from <TABLE> as primary
)s
where s.C=s.min_C;
If you need min(C) to be calculated by more group columns, add them to the partition BY
clause.
edited Nov 9 at 13:21
answered Nov 9 at 13:16
leftjoin
7,61421950
7,61421950
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53217972%2fefficiently-take-min-of-a-column-with-other-column-as-the-key%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
– kpie
Nov 9 at 0:08
@kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
– gibraltar
Nov 9 at 0:13
can you post sample input so that we can work on a solution?
– stack0114106
Nov 9 at 4:54
Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials about
MIN(x) OVER (PARTITION BY a,b) AS xx
or sthg like that– Samson Scharfrichter
Nov 9 at 8:00