Postgres time update on large table











up vote
2
down vote

favorite












I am working with a very large (5.9 billion rows) table on AWS (RDS m4.large), and I am having some trouble getting the data to a point that I can work with and query effectively.



The data has a time variable in seconds from the beginning of the record (type: bigint), but I would like to be able to aggregate the data on at the daily level (i.e. GROUP BY day).



I began this process by simply creating a day column, and updating the table to populate the day field by converting seconds to days:



ALTER TABLE tbl
ADD COLUMN day INTEGER;

UPDATE tbl
SET day=tbl.eventtime/86400+1;


which worked fine in a testing environment (first 10 days of the 90 day record), but, perhaps unexpectedly, it is not working so well with the full 5 billion-line dataset.



After a few tries of this process I realized that my previously created indexes were slowing the process down considerably (update queries never finished). I have since dropped all indexes and am trying the update again. Here 20 hours later, and query still running, I wonder if this approach is woefully inappropriate for the amount of data I'm working with.



I realize another approach would be to convert the time in seconds to some sort of postgres timestamp type - but I am unfamiliar timestamps and am not sure where I would begin.



Is performing an update on a table this large feasible? Is there a different approach that would be more efficient to get the second-resolution data to a point where I could aggregate it by day?



Please let me know if any additional information is needed.










share|improve this question









New contributor




JordanL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
























    up vote
    2
    down vote

    favorite












    I am working with a very large (5.9 billion rows) table on AWS (RDS m4.large), and I am having some trouble getting the data to a point that I can work with and query effectively.



    The data has a time variable in seconds from the beginning of the record (type: bigint), but I would like to be able to aggregate the data on at the daily level (i.e. GROUP BY day).



    I began this process by simply creating a day column, and updating the table to populate the day field by converting seconds to days:



    ALTER TABLE tbl
    ADD COLUMN day INTEGER;

    UPDATE tbl
    SET day=tbl.eventtime/86400+1;


    which worked fine in a testing environment (first 10 days of the 90 day record), but, perhaps unexpectedly, it is not working so well with the full 5 billion-line dataset.



    After a few tries of this process I realized that my previously created indexes were slowing the process down considerably (update queries never finished). I have since dropped all indexes and am trying the update again. Here 20 hours later, and query still running, I wonder if this approach is woefully inappropriate for the amount of data I'm working with.



    I realize another approach would be to convert the time in seconds to some sort of postgres timestamp type - but I am unfamiliar timestamps and am not sure where I would begin.



    Is performing an update on a table this large feasible? Is there a different approach that would be more efficient to get the second-resolution data to a point where I could aggregate it by day?



    Please let me know if any additional information is needed.










    share|improve this question









    New contributor




    JordanL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






















      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I am working with a very large (5.9 billion rows) table on AWS (RDS m4.large), and I am having some trouble getting the data to a point that I can work with and query effectively.



      The data has a time variable in seconds from the beginning of the record (type: bigint), but I would like to be able to aggregate the data on at the daily level (i.e. GROUP BY day).



      I began this process by simply creating a day column, and updating the table to populate the day field by converting seconds to days:



      ALTER TABLE tbl
      ADD COLUMN day INTEGER;

      UPDATE tbl
      SET day=tbl.eventtime/86400+1;


      which worked fine in a testing environment (first 10 days of the 90 day record), but, perhaps unexpectedly, it is not working so well with the full 5 billion-line dataset.



      After a few tries of this process I realized that my previously created indexes were slowing the process down considerably (update queries never finished). I have since dropped all indexes and am trying the update again. Here 20 hours later, and query still running, I wonder if this approach is woefully inappropriate for the amount of data I'm working with.



      I realize another approach would be to convert the time in seconds to some sort of postgres timestamp type - but I am unfamiliar timestamps and am not sure where I would begin.



      Is performing an update on a table this large feasible? Is there a different approach that would be more efficient to get the second-resolution data to a point where I could aggregate it by day?



      Please let me know if any additional information is needed.










      share|improve this question









      New contributor




      JordanL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      I am working with a very large (5.9 billion rows) table on AWS (RDS m4.large), and I am having some trouble getting the data to a point that I can work with and query effectively.



      The data has a time variable in seconds from the beginning of the record (type: bigint), but I would like to be able to aggregate the data on at the daily level (i.e. GROUP BY day).



      I began this process by simply creating a day column, and updating the table to populate the day field by converting seconds to days:



      ALTER TABLE tbl
      ADD COLUMN day INTEGER;

      UPDATE tbl
      SET day=tbl.eventtime/86400+1;


      which worked fine in a testing environment (first 10 days of the 90 day record), but, perhaps unexpectedly, it is not working so well with the full 5 billion-line dataset.



      After a few tries of this process I realized that my previously created indexes were slowing the process down considerably (update queries never finished). I have since dropped all indexes and am trying the update again. Here 20 hours later, and query still running, I wonder if this approach is woefully inappropriate for the amount of data I'm working with.



      I realize another approach would be to convert the time in seconds to some sort of postgres timestamp type - but I am unfamiliar timestamps and am not sure where I would begin.



      Is performing an update on a table this large feasible? Is there a different approach that would be more efficient to get the second-resolution data to a point where I could aggregate it by day?



      Please let me know if any additional information is needed.







      postgresql amazon-web-services amazon-rds large-data






      share|improve this question









      New contributor




      JordanL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      JordanL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited Nov 8 at 9:37









      Laurenz Albe

      40.9k92745




      40.9k92745






      New contributor




      JordanL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked Nov 7 at 15:31









      JordanL

      111




      111




      New contributor




      JordanL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      JordanL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      JordanL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          Indexes slow down data modifications a lot, but you cannot search a large table efficiently without them.



          So the art here is to have as few indexes as possible.



          Your idea with an additional column is a bad one. Not only will it bloat the already large table with redundant data, it will also not speed up searches unless you create an index.



          If you need to search all rows for a certain day, you could create an index on the corresponding expression:



          CREATE INDEX ON tbl ((eventtime / 86400 + 1));


          Then any search that uses this expression can use the index, and you don't have to store additional data in the table.



          This is only useful for querying the data. If you want to aggregate efficiently, an index won't help you much. In that case you should use a materialized view, perhaps one that you keep up to date with a trigger.






          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });






            JordanL is a new contributor. Be nice, and check out our Code of Conduct.










             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53192617%2fpostgres-time-update-on-large-table%23new-answer', 'question_page');
            }
            );

            Post as a guest
































            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote













            Indexes slow down data modifications a lot, but you cannot search a large table efficiently without them.



            So the art here is to have as few indexes as possible.



            Your idea with an additional column is a bad one. Not only will it bloat the already large table with redundant data, it will also not speed up searches unless you create an index.



            If you need to search all rows for a certain day, you could create an index on the corresponding expression:



            CREATE INDEX ON tbl ((eventtime / 86400 + 1));


            Then any search that uses this expression can use the index, and you don't have to store additional data in the table.



            This is only useful for querying the data. If you want to aggregate efficiently, an index won't help you much. In that case you should use a materialized view, perhaps one that you keep up to date with a trigger.






            share|improve this answer

























              up vote
              0
              down vote













              Indexes slow down data modifications a lot, but you cannot search a large table efficiently without them.



              So the art here is to have as few indexes as possible.



              Your idea with an additional column is a bad one. Not only will it bloat the already large table with redundant data, it will also not speed up searches unless you create an index.



              If you need to search all rows for a certain day, you could create an index on the corresponding expression:



              CREATE INDEX ON tbl ((eventtime / 86400 + 1));


              Then any search that uses this expression can use the index, and you don't have to store additional data in the table.



              This is only useful for querying the data. If you want to aggregate efficiently, an index won't help you much. In that case you should use a materialized view, perhaps one that you keep up to date with a trigger.






              share|improve this answer























                up vote
                0
                down vote










                up vote
                0
                down vote









                Indexes slow down data modifications a lot, but you cannot search a large table efficiently without them.



                So the art here is to have as few indexes as possible.



                Your idea with an additional column is a bad one. Not only will it bloat the already large table with redundant data, it will also not speed up searches unless you create an index.



                If you need to search all rows for a certain day, you could create an index on the corresponding expression:



                CREATE INDEX ON tbl ((eventtime / 86400 + 1));


                Then any search that uses this expression can use the index, and you don't have to store additional data in the table.



                This is only useful for querying the data. If you want to aggregate efficiently, an index won't help you much. In that case you should use a materialized view, perhaps one that you keep up to date with a trigger.






                share|improve this answer












                Indexes slow down data modifications a lot, but you cannot search a large table efficiently without them.



                So the art here is to have as few indexes as possible.



                Your idea with an additional column is a bad one. Not only will it bloat the already large table with redundant data, it will also not speed up searches unless you create an index.



                If you need to search all rows for a certain day, you could create an index on the corresponding expression:



                CREATE INDEX ON tbl ((eventtime / 86400 + 1));


                Then any search that uses this expression can use the index, and you don't have to store additional data in the table.



                This is only useful for querying the data. If you want to aggregate efficiently, an index won't help you much. In that case you should use a materialized view, perhaps one that you keep up to date with a trigger.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 8 at 9:34









                Laurenz Albe

                40.9k92745




                40.9k92745






















                    JordanL is a new contributor. Be nice, and check out our Code of Conduct.










                     

                    draft saved


                    draft discarded


















                    JordanL is a new contributor. Be nice, and check out our Code of Conduct.













                    JordanL is a new contributor. Be nice, and check out our Code of Conduct.












                    JordanL is a new contributor. Be nice, and check out our Code of Conduct.















                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53192617%2fpostgres-time-update-on-large-table%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest




















































































                    Popular posts from this blog

                    Schultheiß

                    Verwaltungsgliederung Dänemarks

                    Liste der Kulturdenkmale in Wilsdruff