Spark SQL with different data sources
up vote
0
down vote
favorite
Is it possible to create data frames from 2 different sources and perform operations on those.
For example,
df1 = <create from a file or folder from S3>
df2 = <create from a hive table>
df1.join(df2).where("df1Key" === "df2Key")
If this is possible, what are the implications in doing so?
apache-spark amazon-s3 hive apache-spark-sql
add a comment |
up vote
0
down vote
favorite
Is it possible to create data frames from 2 different sources and perform operations on those.
For example,
df1 = <create from a file or folder from S3>
df2 = <create from a hive table>
df1.join(df2).where("df1Key" === "df2Key")
If this is possible, what are the implications in doing so?
apache-spark amazon-s3 hive apache-spark-sql
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
Is it possible to create data frames from 2 different sources and perform operations on those.
For example,
df1 = <create from a file or folder from S3>
df2 = <create from a hive table>
df1.join(df2).where("df1Key" === "df2Key")
If this is possible, what are the implications in doing so?
apache-spark amazon-s3 hive apache-spark-sql
Is it possible to create data frames from 2 different sources and perform operations on those.
For example,
df1 = <create from a file or folder from S3>
df2 = <create from a hive table>
df1.join(df2).where("df1Key" === "df2Key")
If this is possible, what are the implications in doing so?
apache-spark amazon-s3 hive apache-spark-sql
apache-spark amazon-s3 hive apache-spark-sql
asked Nov 9 at 6:19
learninghuman
2,415102846
2,415102846
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
1
down vote
Yes.. It is possible to read from different datasource and perform operations on it.
In fact many application will need those kind of requirements.
df1.join(df2).where("df1Key" === "df2Key")
This will do Cartesian join and then apply filter on it.
df1.join(df2,$"df1Key" === $"df2Key")
This should provide same output.
add a comment |
up vote
1
down vote
Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki
The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.
You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe
One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
Yes.. It is possible to read from different datasource and perform operations on it.
In fact many application will need those kind of requirements.
df1.join(df2).where("df1Key" === "df2Key")
This will do Cartesian join and then apply filter on it.
df1.join(df2,$"df1Key" === $"df2Key")
This should provide same output.
add a comment |
up vote
1
down vote
Yes.. It is possible to read from different datasource and perform operations on it.
In fact many application will need those kind of requirements.
df1.join(df2).where("df1Key" === "df2Key")
This will do Cartesian join and then apply filter on it.
df1.join(df2,$"df1Key" === $"df2Key")
This should provide same output.
add a comment |
up vote
1
down vote
up vote
1
down vote
Yes.. It is possible to read from different datasource and perform operations on it.
In fact many application will need those kind of requirements.
df1.join(df2).where("df1Key" === "df2Key")
This will do Cartesian join and then apply filter on it.
df1.join(df2,$"df1Key" === $"df2Key")
This should provide same output.
Yes.. It is possible to read from different datasource and perform operations on it.
In fact many application will need those kind of requirements.
df1.join(df2).where("df1Key" === "df2Key")
This will do Cartesian join and then apply filter on it.
df1.join(df2,$"df1Key" === $"df2Key")
This should provide same output.
answered Nov 9 at 7:53
undefined_variable
4,44211332
4,44211332
add a comment |
add a comment |
up vote
1
down vote
Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki
The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.
You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe
One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.
add a comment |
up vote
1
down vote
Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki
The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.
You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe
One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.
add a comment |
up vote
1
down vote
up vote
1
down vote
Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki
The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.
You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe
One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.
Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki
The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.
You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe
One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.
answered Nov 9 at 9:20
Avishek Bhattacharya
2,50521432
2,50521432
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53220754%2fspark-sql-with-different-data-sources%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown