Postgres time update on large table
up vote
2
down vote
favorite
I am working with a very large (5.9 billion rows) table on AWS (RDS m4.large), and I am having some trouble getting the data to a point that I can work with and query effectively.
The data has a time variable in seconds from the beginning of the record (type: bigint), but I would like to be able to aggregate the data on at the daily level (i.e. GROUP BY day).
I began this process by simply creating a day column, and updating the table to populate the day field by converting seconds to days:
ALTER TABLE tbl
ADD COLUMN day INTEGER;
UPDATE tbl
SET day=tbl.eventtime/86400+1;
which worked fine in a testing environment (first 10 days of the 90 day record), but, perhaps unexpectedly, it is not working so well with the full 5 billion-line dataset.
After a few tries of this process I realized that my previously created indexes were slowing the process down considerably (update queries never finished). I have since dropped all indexes and am trying the update again. Here 20 hours later, and query still running, I wonder if this approach is woefully inappropriate for the amount of data I'm working with.
I realize another approach would be to convert the time in seconds to some sort of postgres timestamp type - but I am unfamiliar timestamps and am not sure where I would begin.
Is performing an update on a table this large feasible? Is there a different approach that would be more efficient to get the second-resolution data to a point where I could aggregate it by day?
Please let me know if any additional information is needed.
postgresql amazon-web-services amazon-rds large-data
New contributor
add a comment |
up vote
2
down vote
favorite
I am working with a very large (5.9 billion rows) table on AWS (RDS m4.large), and I am having some trouble getting the data to a point that I can work with and query effectively.
The data has a time variable in seconds from the beginning of the record (type: bigint), but I would like to be able to aggregate the data on at the daily level (i.e. GROUP BY day).
I began this process by simply creating a day column, and updating the table to populate the day field by converting seconds to days:
ALTER TABLE tbl
ADD COLUMN day INTEGER;
UPDATE tbl
SET day=tbl.eventtime/86400+1;
which worked fine in a testing environment (first 10 days of the 90 day record), but, perhaps unexpectedly, it is not working so well with the full 5 billion-line dataset.
After a few tries of this process I realized that my previously created indexes were slowing the process down considerably (update queries never finished). I have since dropped all indexes and am trying the update again. Here 20 hours later, and query still running, I wonder if this approach is woefully inappropriate for the amount of data I'm working with.
I realize another approach would be to convert the time in seconds to some sort of postgres timestamp type - but I am unfamiliar timestamps and am not sure where I would begin.
Is performing an update on a table this large feasible? Is there a different approach that would be more efficient to get the second-resolution data to a point where I could aggregate it by day?
Please let me know if any additional information is needed.
postgresql amazon-web-services amazon-rds large-data
New contributor
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I am working with a very large (5.9 billion rows) table on AWS (RDS m4.large), and I am having some trouble getting the data to a point that I can work with and query effectively.
The data has a time variable in seconds from the beginning of the record (type: bigint), but I would like to be able to aggregate the data on at the daily level (i.e. GROUP BY day).
I began this process by simply creating a day column, and updating the table to populate the day field by converting seconds to days:
ALTER TABLE tbl
ADD COLUMN day INTEGER;
UPDATE tbl
SET day=tbl.eventtime/86400+1;
which worked fine in a testing environment (first 10 days of the 90 day record), but, perhaps unexpectedly, it is not working so well with the full 5 billion-line dataset.
After a few tries of this process I realized that my previously created indexes were slowing the process down considerably (update queries never finished). I have since dropped all indexes and am trying the update again. Here 20 hours later, and query still running, I wonder if this approach is woefully inappropriate for the amount of data I'm working with.
I realize another approach would be to convert the time in seconds to some sort of postgres timestamp type - but I am unfamiliar timestamps and am not sure where I would begin.
Is performing an update on a table this large feasible? Is there a different approach that would be more efficient to get the second-resolution data to a point where I could aggregate it by day?
Please let me know if any additional information is needed.
postgresql amazon-web-services amazon-rds large-data
New contributor
I am working with a very large (5.9 billion rows) table on AWS (RDS m4.large), and I am having some trouble getting the data to a point that I can work with and query effectively.
The data has a time variable in seconds from the beginning of the record (type: bigint), but I would like to be able to aggregate the data on at the daily level (i.e. GROUP BY day).
I began this process by simply creating a day column, and updating the table to populate the day field by converting seconds to days:
ALTER TABLE tbl
ADD COLUMN day INTEGER;
UPDATE tbl
SET day=tbl.eventtime/86400+1;
which worked fine in a testing environment (first 10 days of the 90 day record), but, perhaps unexpectedly, it is not working so well with the full 5 billion-line dataset.
After a few tries of this process I realized that my previously created indexes were slowing the process down considerably (update queries never finished). I have since dropped all indexes and am trying the update again. Here 20 hours later, and query still running, I wonder if this approach is woefully inappropriate for the amount of data I'm working with.
I realize another approach would be to convert the time in seconds to some sort of postgres timestamp type - but I am unfamiliar timestamps and am not sure where I would begin.
Is performing an update on a table this large feasible? Is there a different approach that would be more efficient to get the second-resolution data to a point where I could aggregate it by day?
Please let me know if any additional information is needed.
postgresql amazon-web-services amazon-rds large-data
postgresql amazon-web-services amazon-rds large-data
New contributor
New contributor
edited Nov 8 at 9:37
Laurenz Albe
40.9k92745
40.9k92745
New contributor
asked Nov 7 at 15:31
JordanL
111
111
New contributor
New contributor
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
Indexes slow down data modifications a lot, but you cannot search a large table efficiently without them.
So the art here is to have as few indexes as possible.
Your idea with an additional column is a bad one. Not only will it bloat the already large table with redundant data, it will also not speed up searches unless you create an index.
If you need to search all rows for a certain day, you could create an index on the corresponding expression:
CREATE INDEX ON tbl ((eventtime / 86400 + 1));
Then any search that uses this expression can use the index, and you don't have to store additional data in the table.
This is only useful for querying the data. If you want to aggregate efficiently, an index won't help you much. In that case you should use a materialized view, perhaps one that you keep up to date with a trigger.
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Indexes slow down data modifications a lot, but you cannot search a large table efficiently without them.
So the art here is to have as few indexes as possible.
Your idea with an additional column is a bad one. Not only will it bloat the already large table with redundant data, it will also not speed up searches unless you create an index.
If you need to search all rows for a certain day, you could create an index on the corresponding expression:
CREATE INDEX ON tbl ((eventtime / 86400 + 1));
Then any search that uses this expression can use the index, and you don't have to store additional data in the table.
This is only useful for querying the data. If you want to aggregate efficiently, an index won't help you much. In that case you should use a materialized view, perhaps one that you keep up to date with a trigger.
add a comment |
up vote
0
down vote
Indexes slow down data modifications a lot, but you cannot search a large table efficiently without them.
So the art here is to have as few indexes as possible.
Your idea with an additional column is a bad one. Not only will it bloat the already large table with redundant data, it will also not speed up searches unless you create an index.
If you need to search all rows for a certain day, you could create an index on the corresponding expression:
CREATE INDEX ON tbl ((eventtime / 86400 + 1));
Then any search that uses this expression can use the index, and you don't have to store additional data in the table.
This is only useful for querying the data. If you want to aggregate efficiently, an index won't help you much. In that case you should use a materialized view, perhaps one that you keep up to date with a trigger.
add a comment |
up vote
0
down vote
up vote
0
down vote
Indexes slow down data modifications a lot, but you cannot search a large table efficiently without them.
So the art here is to have as few indexes as possible.
Your idea with an additional column is a bad one. Not only will it bloat the already large table with redundant data, it will also not speed up searches unless you create an index.
If you need to search all rows for a certain day, you could create an index on the corresponding expression:
CREATE INDEX ON tbl ((eventtime / 86400 + 1));
Then any search that uses this expression can use the index, and you don't have to store additional data in the table.
This is only useful for querying the data. If you want to aggregate efficiently, an index won't help you much. In that case you should use a materialized view, perhaps one that you keep up to date with a trigger.
Indexes slow down data modifications a lot, but you cannot search a large table efficiently without them.
So the art here is to have as few indexes as possible.
Your idea with an additional column is a bad one. Not only will it bloat the already large table with redundant data, it will also not speed up searches unless you create an index.
If you need to search all rows for a certain day, you could create an index on the corresponding expression:
CREATE INDEX ON tbl ((eventtime / 86400 + 1));
Then any search that uses this expression can use the index, and you don't have to store additional data in the table.
This is only useful for querying the data. If you want to aggregate efficiently, an index won't help you much. In that case you should use a materialized view, perhaps one that you keep up to date with a trigger.
answered Nov 8 at 9:34
Laurenz Albe
40.9k92745
40.9k92745
add a comment |
add a comment |
JordanL is a new contributor. Be nice, and check out our Code of Conduct.
JordanL is a new contributor. Be nice, and check out our Code of Conduct.
JordanL is a new contributor. Be nice, and check out our Code of Conduct.
JordanL is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53192617%2fpostgres-time-update-on-large-table%23new-answer', 'question_page');
}
);
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password