Spark SQL with different data sources











up vote
0
down vote

favorite












Is it possible to create data frames from 2 different sources and perform operations on those.



For example,



df1 = <create from a file or folder from S3>
df2 = <create from a hive table>

df1.join(df2).where("df1Key" === "df2Key")


If this is possible, what are the implications in doing so?










share|improve this question


























    up vote
    0
    down vote

    favorite












    Is it possible to create data frames from 2 different sources and perform operations on those.



    For example,



    df1 = <create from a file or folder from S3>
    df2 = <create from a hive table>

    df1.join(df2).where("df1Key" === "df2Key")


    If this is possible, what are the implications in doing so?










    share|improve this question
























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      Is it possible to create data frames from 2 different sources and perform operations on those.



      For example,



      df1 = <create from a file or folder from S3>
      df2 = <create from a hive table>

      df1.join(df2).where("df1Key" === "df2Key")


      If this is possible, what are the implications in doing so?










      share|improve this question













      Is it possible to create data frames from 2 different sources and perform operations on those.



      For example,



      df1 = <create from a file or folder from S3>
      df2 = <create from a hive table>

      df1.join(df2).where("df1Key" === "df2Key")


      If this is possible, what are the implications in doing so?







      apache-spark amazon-s3 hive apache-spark-sql






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 9 at 6:19









      learninghuman

      2,415102846




      2,415102846
























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          1
          down vote













          Yes.. It is possible to read from different datasource and perform operations on it.
          In fact many application will need those kind of requirements.



          df1.join(df2).where("df1Key" === "df2Key")


          This will do Cartesian join and then apply filter on it.



          df1.join(df2,$"df1Key" === $"df2Key")


          This should provide same output.






          share|improve this answer




























            up vote
            1
            down vote













            Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki



            The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.



            You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe



            One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.






            share|improve this answer





















              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














               

              draft saved


              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53220754%2fspark-sql-with-different-data-sources%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes








              up vote
              1
              down vote













              Yes.. It is possible to read from different datasource and perform operations on it.
              In fact many application will need those kind of requirements.



              df1.join(df2).where("df1Key" === "df2Key")


              This will do Cartesian join and then apply filter on it.



              df1.join(df2,$"df1Key" === $"df2Key")


              This should provide same output.






              share|improve this answer

























                up vote
                1
                down vote













                Yes.. It is possible to read from different datasource and perform operations on it.
                In fact many application will need those kind of requirements.



                df1.join(df2).where("df1Key" === "df2Key")


                This will do Cartesian join and then apply filter on it.



                df1.join(df2,$"df1Key" === $"df2Key")


                This should provide same output.






                share|improve this answer























                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  Yes.. It is possible to read from different datasource and perform operations on it.
                  In fact many application will need those kind of requirements.



                  df1.join(df2).where("df1Key" === "df2Key")


                  This will do Cartesian join and then apply filter on it.



                  df1.join(df2,$"df1Key" === $"df2Key")


                  This should provide same output.






                  share|improve this answer












                  Yes.. It is possible to read from different datasource and perform operations on it.
                  In fact many application will need those kind of requirements.



                  df1.join(df2).where("df1Key" === "df2Key")


                  This will do Cartesian join and then apply filter on it.



                  df1.join(df2,$"df1Key" === $"df2Key")


                  This should provide same output.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 9 at 7:53









                  undefined_variable

                  4,44211332




                  4,44211332
























                      up vote
                      1
                      down vote













                      Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki



                      The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.



                      You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe



                      One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.






                      share|improve this answer

























                        up vote
                        1
                        down vote













                        Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki



                        The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.



                        You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe



                        One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.






                        share|improve this answer























                          up vote
                          1
                          down vote










                          up vote
                          1
                          down vote









                          Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki



                          The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.



                          You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe



                          One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.






                          share|improve this answer












                          Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki



                          The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.



                          You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe



                          One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 9 at 9:20









                          Avishek Bhattacharya

                          2,50521432




                          2,50521432






























                               

                              draft saved


                              draft discarded



















































                               


                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53220754%2fspark-sql-with-different-data-sources%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Schultheiß

                              Verwaltungsgliederung Dänemarks

                              Liste der Kulturdenkmale in Wilsdruff