Efficiently take min of a column with other column as the key

up vote
1
down vote

favorite

I recently transitioned over from using Presto to Hive. I have the following scenario. Column A, B, C. I want to aggregate on A and find the value of B for which the value of C is minimized. In presto you can something like this as
SELECT A, min_by(B, C) from <TABLE> GROUP BY A
Now I want to do the same thing in Hive. But unfortunately I couldn't find a UDF similar to this anywhere in the documentation. Now I know I can do the following

SELECT A, COALESCE(B, 0)

from <TABLE> as primary

JOIN (

SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A 

) secondary

ON primary.A = secondary.A AND primary.C = secondary.C

GROUP BY A

I have 2 problems with this solution

It's not concise at all.

It's not efficient either. I am doing an extra subquery resulting and an extra aggregation and an extra JOIN. It would be nice to have a first class aggregation support for such a function.

Is there a way to achieve what I am trying to do without writing your custom UDF ?

edited Nov 9 at 0:34

Gaurang Shah

2,67311030

asked Nov 9 at 0:04

gibraltar

84341429

why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
– kpie
Nov 9 at 0:08

@kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
– gibraltar
Nov 9 at 0:13

can you post sample input so that we can work on a solution?
– stack0114106
Nov 9 at 4:54

Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials about MIN(x) OVER (PARTITION BY a,b) AS xx or sthg like that
– Samson Scharfrichter
Nov 9 at 8:00

add a comment |

up vote
1
down vote

favorite

SELECT A, COALESCE(B, 0)

from <TABLE> as primary

JOIN (

SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A 

) secondary

ON primary.A = secondary.A AND primary.C = secondary.C

GROUP BY A

I have 2 problems with this solution

It's not concise at all.

It's not efficient either. I am doing an extra subquery resulting and an extra aggregation and an extra JOIN. It would be nice to have a first class aggregation support for such a function.

Is there a way to achieve what I am trying to do without writing your custom UDF ?

edited Nov 9 at 0:34

Gaurang Shah

2,67311030

asked Nov 9 at 0:04

gibraltar

84341429

why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
– kpie
Nov 9 at 0:08

@kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
– gibraltar
Nov 9 at 0:13

can you post sample input so that we can work on a solution?
– stack0114106
Nov 9 at 4:54

Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials about MIN(x) OVER (PARTITION BY a,b) AS xx or sthg like that
– Samson Scharfrichter
Nov 9 at 8:00

add a comment |

up vote
1
down vote

favorite

SELECT A, COALESCE(B, 0)

from <TABLE> as primary

JOIN (

SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A 

) secondary

ON primary.A = secondary.A AND primary.C = secondary.C

GROUP BY A

I have 2 problems with this solution

It's not concise at all.

It's not efficient either. I am doing an extra subquery resulting and an extra aggregation and an extra JOIN. It would be nice to have a first class aggregation support for such a function.

Is there a way to achieve what I am trying to do without writing your custom UDF ?

edited Nov 9 at 0:34

Gaurang Shah

2,67311030

asked Nov 9 at 0:04

gibraltar

84341429

SELECT A, COALESCE(B, 0)

from <TABLE> as primary

JOIN (

SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A 

) secondary

ON primary.A = secondary.A AND primary.C = secondary.C

GROUP BY A

I have 2 problems with this solution

It's not concise at all.

It's not efficient either. I am doing an extra subquery resulting and an extra aggregation and an extra JOIN. It would be nice to have a first class aggregation support for such a function.

Is there a way to achieve what I am trying to do without writing your custom UDF ?

hive hiveql

edited Nov 9 at 0:34

Gaurang Shah

2,67311030

asked Nov 9 at 0:04

gibraltar

84341429

edited Nov 9 at 0:34

Gaurang Shah

2,67311030

asked Nov 9 at 0:04

gibraltar

84341429

edited Nov 9 at 0:34

Gaurang Shah

2,67311030

edited Nov 9 at 0:34

Gaurang Shah

2,67311030

edited Nov 9 at 0:34

Gaurang Shah

2,67311030

asked Nov 9 at 0:04

gibraltar

84341429

asked Nov 9 at 0:04

gibraltar

84341429

asked Nov 9 at 0:04

gibraltar

84341429

why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
– kpie
Nov 9 at 0:08

@kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
– gibraltar
Nov 9 at 0:13

can you post sample input so that we can work on a solution?
– stack0114106
Nov 9 at 4:54

Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials about MIN(x) OVER (PARTITION BY a,b) AS xx or sthg like that
– Samson Scharfrichter
Nov 9 at 8:00

add a comment |

why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
– kpie
Nov 9 at 0:08

@kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
– gibraltar
Nov 9 at 0:13

can you post sample input so that we can work on a solution?
– stack0114106
Nov 9 at 4:54

Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials about MIN(x) OVER (PARTITION BY a,b) AS xx or sthg like that
– Samson Scharfrichter
Nov 9 at 8:00

why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
– kpie
Nov 9 at 0:08

@kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
– gibraltar
Nov 9 at 0:13

can you post sample input so that we can work on a solution?
– stack0114106
Nov 9 at 4:54

Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials about MIN(x) OVER (PARTITION BY a,b) AS xx or sthg like that
– Samson Scharfrichter
Nov 9 at 8:00

add a comment |

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

Join works slower than analytic functions, try this approach without join, and table will be scanned only once:

select s.*

from

(

SELECT A, COALESCE(B, 0) as B, C

      min(C) over (partition by A) as min_C

from <TABLE> as primary

)s

where s.C=s.min_C;

If you need min(C) to be calculated by more group columns, add them to the partition BY clause.

edited Nov 9 at 13:21

answered Nov 9 at 13:16

leftjoin

7,61421950

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53217972%2fefficiently-take-min-of-a-column-with-other-column-as-the-key%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

Join works slower than analytic functions, try this approach without join, and table will be scanned only once:

select s.*

from

(

SELECT A, COALESCE(B, 0) as B, C

      min(C) over (partition by A) as min_C

from <TABLE> as primary

)s

where s.C=s.min_C;

If you need min(C) to be calculated by more group columns, add them to the partition BY clause.

edited Nov 9 at 13:21

answered Nov 9 at 13:16

leftjoin

7,61421950

add a comment |

up vote
2
down vote

accepted

Join works slower than analytic functions, try this approach without join, and table will be scanned only once:

select s.*

from

(

SELECT A, COALESCE(B, 0) as B, C

      min(C) over (partition by A) as min_C

from <TABLE> as primary

)s

where s.C=s.min_C;

If you need min(C) to be calculated by more group columns, add them to the partition BY clause.

edited Nov 9 at 13:21

answered Nov 9 at 13:16

leftjoin

7,61421950

add a comment |

up vote
2
down vote

accepted

Join works slower than analytic functions, try this approach without join, and table will be scanned only once:

select s.*

from

(

SELECT A, COALESCE(B, 0) as B, C

      min(C) over (partition by A) as min_C

from <TABLE> as primary

)s

where s.C=s.min_C;

If you need min(C) to be calculated by more group columns, add them to the partition BY clause.

edited Nov 9 at 13:21

answered Nov 9 at 13:16

leftjoin

7,61421950

Join works slower than analytic functions, try this approach without join, and table will be scanned only once:

select s.*

from

(

SELECT A, COALESCE(B, 0) as B, C

      min(C) over (partition by A) as min_C

from <TABLE> as primary

)s

where s.C=s.min_C;

If you need min(C) to be calculated by more group columns, add them to the partition BY clause.

edited Nov 9 at 13:21

answered Nov 9 at 13:16

leftjoin

7,61421950

edited Nov 9 at 13:21

answered Nov 9 at 13:16

leftjoin

7,61421950

answered Nov 9 at 13:16

leftjoin

7,61421950

answered Nov 9 at 13:16

leftjoin

7,61421950

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Xtykutl