Efficiently take min of a column with other column as the key











up vote
1
down vote

favorite












I recently transitioned over from using Presto to Hive. I have the following scenario. Column A, B, C. I want to aggregate on A and find the value of B for which the value of C is minimized. In presto you can something like this as

SELECT A, min_by(B, C) from <TABLE> GROUP BY A

Now I want to do the same thing in Hive. But unfortunately I couldn't find a UDF similar to this anywhere in the documentation. Now I know I can do the following



SELECT A, COALESCE(B, 0)
from <TABLE> as primary
JOIN (
SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A
) secondary
ON primary.A = secondary.A AND primary.C = secondary.C
GROUP BY A


I have 2 problems with this solution




  1. It's not concise at all.

  2. It's not efficient either. I am doing an extra subquery resulting and an extra aggregation and an extra JOIN. It would be nice to have a first class aggregation support for such a function.


Is there a way to achieve what I am trying to do without writing your custom UDF ?










share|improve this question
























  • why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
    – kpie
    Nov 9 at 0:08










  • @kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
    – gibraltar
    Nov 9 at 0:13










  • can you post sample input so that we can work on a solution?
    – stack0114106
    Nov 9 at 4:54










  • Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials about MIN(x) OVER (PARTITION BY a,b) AS xx or sthg like that
    – Samson Scharfrichter
    Nov 9 at 8:00

















up vote
1
down vote

favorite












I recently transitioned over from using Presto to Hive. I have the following scenario. Column A, B, C. I want to aggregate on A and find the value of B for which the value of C is minimized. In presto you can something like this as

SELECT A, min_by(B, C) from <TABLE> GROUP BY A

Now I want to do the same thing in Hive. But unfortunately I couldn't find a UDF similar to this anywhere in the documentation. Now I know I can do the following



SELECT A, COALESCE(B, 0)
from <TABLE> as primary
JOIN (
SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A
) secondary
ON primary.A = secondary.A AND primary.C = secondary.C
GROUP BY A


I have 2 problems with this solution




  1. It's not concise at all.

  2. It's not efficient either. I am doing an extra subquery resulting and an extra aggregation and an extra JOIN. It would be nice to have a first class aggregation support for such a function.


Is there a way to achieve what I am trying to do without writing your custom UDF ?










share|improve this question
























  • why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
    – kpie
    Nov 9 at 0:08










  • @kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
    – gibraltar
    Nov 9 at 0:13










  • can you post sample input so that we can work on a solution?
    – stack0114106
    Nov 9 at 4:54










  • Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials about MIN(x) OVER (PARTITION BY a,b) AS xx or sthg like that
    – Samson Scharfrichter
    Nov 9 at 8:00















up vote
1
down vote

favorite









up vote
1
down vote

favorite











I recently transitioned over from using Presto to Hive. I have the following scenario. Column A, B, C. I want to aggregate on A and find the value of B for which the value of C is minimized. In presto you can something like this as

SELECT A, min_by(B, C) from <TABLE> GROUP BY A

Now I want to do the same thing in Hive. But unfortunately I couldn't find a UDF similar to this anywhere in the documentation. Now I know I can do the following



SELECT A, COALESCE(B, 0)
from <TABLE> as primary
JOIN (
SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A
) secondary
ON primary.A = secondary.A AND primary.C = secondary.C
GROUP BY A


I have 2 problems with this solution




  1. It's not concise at all.

  2. It's not efficient either. I am doing an extra subquery resulting and an extra aggregation and an extra JOIN. It would be nice to have a first class aggregation support for such a function.


Is there a way to achieve what I am trying to do without writing your custom UDF ?










share|improve this question















I recently transitioned over from using Presto to Hive. I have the following scenario. Column A, B, C. I want to aggregate on A and find the value of B for which the value of C is minimized. In presto you can something like this as

SELECT A, min_by(B, C) from <TABLE> GROUP BY A

Now I want to do the same thing in Hive. But unfortunately I couldn't find a UDF similar to this anywhere in the documentation. Now I know I can do the following



SELECT A, COALESCE(B, 0)
from <TABLE> as primary
JOIN (
SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A
) secondary
ON primary.A = secondary.A AND primary.C = secondary.C
GROUP BY A


I have 2 problems with this solution




  1. It's not concise at all.

  2. It's not efficient either. I am doing an extra subquery resulting and an extra aggregation and an extra JOIN. It would be nice to have a first class aggregation support for such a function.


Is there a way to achieve what I am trying to do without writing your custom UDF ?







hive hiveql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 9 at 0:34









Gaurang Shah

2,67311030




2,67311030










asked Nov 9 at 0:04









gibraltar

84341429




84341429












  • why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
    – kpie
    Nov 9 at 0:08










  • @kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
    – gibraltar
    Nov 9 at 0:13










  • can you post sample input so that we can work on a solution?
    – stack0114106
    Nov 9 at 4:54










  • Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials about MIN(x) OVER (PARTITION BY a,b) AS xx or sthg like that
    – Samson Scharfrichter
    Nov 9 at 8:00




















  • why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
    – kpie
    Nov 9 at 0:08










  • @kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
    – gibraltar
    Nov 9 at 0:13










  • can you post sample input so that we can work on a solution?
    – stack0114106
    Nov 9 at 4:54










  • Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials about MIN(x) OVER (PARTITION BY a,b) AS xx or sthg like that
    – Samson Scharfrichter
    Nov 9 at 8:00


















why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
– kpie
Nov 9 at 0:08




why not just sort your data and take the head instead of selecting for the min? stackoverflow.com/questions/13715044/…
– kpie
Nov 9 at 0:08












@kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
– gibraltar
Nov 9 at 0:13




@kpie Can you give me an example query of how would I do that ? The way I can think of is little worse than the current one.
– gibraltar
Nov 9 at 0:13












can you post sample input so that we can work on a solution?
– stack0114106
Nov 9 at 4:54




can you post sample input so that we can work on a solution?
– stack0114106
Nov 9 at 4:54












Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials about MIN(x) OVER (PARTITION BY a,b) AS xx or sthg like that
– Samson Scharfrichter
Nov 9 at 8:00






Hive supports standard SQL "analytic functions" (with some specificities) > manual is not usable as tutorial cwiki.apache.org/confluence/display/Hive/… so look for tutorials about MIN(x) OVER (PARTITION BY a,b) AS xx or sthg like that
– Samson Scharfrichter
Nov 9 at 8:00














1 Answer
1






active

oldest

votes

















up vote
2
down vote



accepted










Join works slower than analytic functions, try this approach without join, and table will be scanned only once:



select s.*
from
(
SELECT A, COALESCE(B, 0) as B, C
min(C) over (partition by A) as min_C
from <TABLE> as primary
)s
where s.C=s.min_C;


If you need min(C) to be calculated by more group columns, add them to the partition BY clause.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53217972%2fefficiently-take-min-of-a-column-with-other-column-as-the-key%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    2
    down vote



    accepted










    Join works slower than analytic functions, try this approach without join, and table will be scanned only once:



    select s.*
    from
    (
    SELECT A, COALESCE(B, 0) as B, C
    min(C) over (partition by A) as min_C
    from <TABLE> as primary
    )s
    where s.C=s.min_C;


    If you need min(C) to be calculated by more group columns, add them to the partition BY clause.






    share|improve this answer



























      up vote
      2
      down vote



      accepted










      Join works slower than analytic functions, try this approach without join, and table will be scanned only once:



      select s.*
      from
      (
      SELECT A, COALESCE(B, 0) as B, C
      min(C) over (partition by A) as min_C
      from <TABLE> as primary
      )s
      where s.C=s.min_C;


      If you need min(C) to be calculated by more group columns, add them to the partition BY clause.






      share|improve this answer

























        up vote
        2
        down vote



        accepted







        up vote
        2
        down vote



        accepted






        Join works slower than analytic functions, try this approach without join, and table will be scanned only once:



        select s.*
        from
        (
        SELECT A, COALESCE(B, 0) as B, C
        min(C) over (partition by A) as min_C
        from <TABLE> as primary
        )s
        where s.C=s.min_C;


        If you need min(C) to be calculated by more group columns, add them to the partition BY clause.






        share|improve this answer














        Join works slower than analytic functions, try this approach without join, and table will be scanned only once:



        select s.*
        from
        (
        SELECT A, COALESCE(B, 0) as B, C
        min(C) over (partition by A) as min_C
        from <TABLE> as primary
        )s
        where s.C=s.min_C;


        If you need min(C) to be calculated by more group columns, add them to the partition BY clause.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 9 at 13:21

























        answered Nov 9 at 13:16









        leftjoin

        7,61421950




        7,61421950






























             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53217972%2fefficiently-take-min-of-a-column-with-other-column-as-the-key%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Schultheiß

            Verwaltungsgliederung Dänemarks

            Liste der Kulturdenkmale in Wilsdruff