Spark SQL with different data sources

up vote
0
down vote

favorite

Is it possible to create data frames from 2 different sources and perform operations on those.

For example,

df1 = <create from a file or folder from S3>

df2 = <create from a hive table>



df1.join(df2).where("df1Key" === "df2Key")

If this is possible, what are the implications in doing so?

asked Nov 9 at 6:19

learninghuman

2,415102846

add a comment |

up vote
0
down vote

favorite

Is it possible to create data frames from 2 different sources and perform operations on those.

For example,

df1 = <create from a file or folder from S3>

df2 = <create from a hive table>



df1.join(df2).where("df1Key" === "df2Key")

If this is possible, what are the implications in doing so?

asked Nov 9 at 6:19

learninghuman

2,415102846

add a comment |

up vote
0
down vote

favorite

Is it possible to create data frames from 2 different sources and perform operations on those.

For example,

df1 = <create from a file or folder from S3>

df2 = <create from a hive table>



df1.join(df2).where("df1Key" === "df2Key")

If this is possible, what are the implications in doing so?

asked Nov 9 at 6:19

learninghuman

2,415102846

Is it possible to create data frames from 2 different sources and perform operations on those.

For example,

df1 = <create from a file or folder from S3>

df2 = <create from a hive table>



df1.join(df2).where("df1Key" === "df2Key")

If this is possible, what are the implications in doing so?

apache-spark amazon-s3 hive apache-spark-sql

asked Nov 9 at 6:19

learninghuman

2,415102846

asked Nov 9 at 6:19

learninghuman

2,415102846

asked Nov 9 at 6:19

learninghuman

2,415102846

asked Nov 9 at 6:19

learninghuman

2,415102846

asked Nov 9 at 6:19

learninghuman

2,415102846

add a comment |

2 Answers
2

active

oldest

votes

up vote
1
down vote

Yes.. It is possible to read from different datasource and perform operations on it.
In fact many application will need those kind of requirements.

df1.join(df2).where("df1Key" === "df2Key")

This will do Cartesian join and then apply filter on it.

df1.join(df2,$"df1Key" === $"df2Key")

This should provide same output.

answered Nov 9 at 7:53

undefined_variable

4,44211332

add a comment |

up vote
1
down vote

Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki

The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.

You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe

One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.

answered Nov 9 at 9:20

Avishek Bhattacharya

2,50521432

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53220754%2fspark-sql-with-different-data-sources%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

Yes.. It is possible to read from different datasource and perform operations on it.
In fact many application will need those kind of requirements.

df1.join(df2).where("df1Key" === "df2Key")

This will do Cartesian join and then apply filter on it.

df1.join(df2,$"df1Key" === $"df2Key")

This should provide same output.

answered Nov 9 at 7:53

undefined_variable

4,44211332

add a comment |

up vote
1
down vote

Yes.. It is possible to read from different datasource and perform operations on it.
In fact many application will need those kind of requirements.

df1.join(df2).where("df1Key" === "df2Key")

This will do Cartesian join and then apply filter on it.

df1.join(df2,$"df1Key" === $"df2Key")

This should provide same output.

answered Nov 9 at 7:53

undefined_variable

4,44211332

add a comment |

up vote
1
down vote

Yes.. It is possible to read from different datasource and perform operations on it.
In fact many application will need those kind of requirements.

df1.join(df2).where("df1Key" === "df2Key")

This will do Cartesian join and then apply filter on it.

df1.join(df2,$"df1Key" === $"df2Key")

This should provide same output.

answered Nov 9 at 7:53

undefined_variable

4,44211332

Yes.. It is possible to read from different datasource and perform operations on it.
In fact many application will need those kind of requirements.

df1.join(df2).where("df1Key" === "df2Key")

This will do Cartesian join and then apply filter on it.

df1.join(df2,$"df1Key" === $"df2Key")

This should provide same output.

answered Nov 9 at 7:53

undefined_variable

4,44211332

answered Nov 9 at 7:53

undefined_variable

4,44211332

answered Nov 9 at 7:53

undefined_variable

4,44211332

answered Nov 9 at 7:53

undefined_variable

4,44211332

add a comment |

up vote
1
down vote

Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki

The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.

answered Nov 9 at 9:20

Avishek Bhattacharya

2,50521432

add a comment |

up vote
1
down vote

Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki

The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.

answered Nov 9 at 9:20

Avishek Bhattacharya

2,50521432

add a comment |

up vote
1
down vote

Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki

The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.

answered Nov 9 at 9:20

Avishek Bhattacharya

2,50521432

Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki

The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.

answered Nov 9 at 9:20

Avishek Bhattacharya

2,50521432

answered Nov 9 at 9:20

Avishek Bhattacharya

2,50521432

answered Nov 9 at 9:20

Avishek Bhattacharya

2,50521432

answered Nov 9 at 9:20

Avishek Bhattacharya

2,50521432

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Xtykutl