Matching and Joining Two Inconsistent DataFrames











up vote
-1
down vote

favorite












I have two dataframes that are being queried off two separate databases that share common characteristics, but not always the same characteristics, and I need to find a way to reliably join the two together.



As an example:



import pandas as pd
inp = [{'Name':'Jose', 'Age':12,'Location':'Frankfurt','Occupation':'Student','Mothers Name':'Rosy'}, {'Name':'Katherine','Age':23,'Location':'Maui','Occupation':'Lawyer','Mothers Name':'Amy'}, {'Name':'Larry','Age':22,'Location':'Dallas','Occupation':'Nurse','Mothers Name':'Monica'}]
df = pd.DataFrame(inp)
print (df)

Age Location Mothers Name Name Occupation
0 12 Frankfurt Rosy Jose Student
1 23 Maui Amy Katherine Lawyer
2 22 Dallas Monica Larry Nurse



inp2 = [{'Name': '','Occupation':'Nurse','Favorite Hobby':'Basketball','Mothers Name':'Monica'},{'Name':'Jose','Occupation':'','Favorite Hobby':'Sewing','Mothers Name':'Rosy'},{'Name':'Katherine','Occupation':'Lawyer','Favorite Hobby':'Reading','Mothers Name':''}]
df2 = pd.DataFrame(inp2)
print(df2)

Favorite Hobby Mothers Name Name Occupation
0 Basketball Monica Nurse
1 Sewing Rosy Jose
2 Reading Katherine Lawyer


I need to figure out a way to reliably join these two dataframes without the data always being consistent. To further complexify the problem, the two databases are not always the same length. Any ideas?










share|improve this question






















  • so if you can join on any 2 of (mother's name, name, occupation), that's ok?
    – richflow
    Nov 10 at 1:02










  • you'll need to provide more info - show what is the expected end result (an example given your input)
    – adhg
    Nov 10 at 1:23















up vote
-1
down vote

favorite












I have two dataframes that are being queried off two separate databases that share common characteristics, but not always the same characteristics, and I need to find a way to reliably join the two together.



As an example:



import pandas as pd
inp = [{'Name':'Jose', 'Age':12,'Location':'Frankfurt','Occupation':'Student','Mothers Name':'Rosy'}, {'Name':'Katherine','Age':23,'Location':'Maui','Occupation':'Lawyer','Mothers Name':'Amy'}, {'Name':'Larry','Age':22,'Location':'Dallas','Occupation':'Nurse','Mothers Name':'Monica'}]
df = pd.DataFrame(inp)
print (df)

Age Location Mothers Name Name Occupation
0 12 Frankfurt Rosy Jose Student
1 23 Maui Amy Katherine Lawyer
2 22 Dallas Monica Larry Nurse



inp2 = [{'Name': '','Occupation':'Nurse','Favorite Hobby':'Basketball','Mothers Name':'Monica'},{'Name':'Jose','Occupation':'','Favorite Hobby':'Sewing','Mothers Name':'Rosy'},{'Name':'Katherine','Occupation':'Lawyer','Favorite Hobby':'Reading','Mothers Name':''}]
df2 = pd.DataFrame(inp2)
print(df2)

Favorite Hobby Mothers Name Name Occupation
0 Basketball Monica Nurse
1 Sewing Rosy Jose
2 Reading Katherine Lawyer


I need to figure out a way to reliably join these two dataframes without the data always being consistent. To further complexify the problem, the two databases are not always the same length. Any ideas?










share|improve this question






















  • so if you can join on any 2 of (mother's name, name, occupation), that's ok?
    – richflow
    Nov 10 at 1:02










  • you'll need to provide more info - show what is the expected end result (an example given your input)
    – adhg
    Nov 10 at 1:23













up vote
-1
down vote

favorite









up vote
-1
down vote

favorite











I have two dataframes that are being queried off two separate databases that share common characteristics, but not always the same characteristics, and I need to find a way to reliably join the two together.



As an example:



import pandas as pd
inp = [{'Name':'Jose', 'Age':12,'Location':'Frankfurt','Occupation':'Student','Mothers Name':'Rosy'}, {'Name':'Katherine','Age':23,'Location':'Maui','Occupation':'Lawyer','Mothers Name':'Amy'}, {'Name':'Larry','Age':22,'Location':'Dallas','Occupation':'Nurse','Mothers Name':'Monica'}]
df = pd.DataFrame(inp)
print (df)

Age Location Mothers Name Name Occupation
0 12 Frankfurt Rosy Jose Student
1 23 Maui Amy Katherine Lawyer
2 22 Dallas Monica Larry Nurse



inp2 = [{'Name': '','Occupation':'Nurse','Favorite Hobby':'Basketball','Mothers Name':'Monica'},{'Name':'Jose','Occupation':'','Favorite Hobby':'Sewing','Mothers Name':'Rosy'},{'Name':'Katherine','Occupation':'Lawyer','Favorite Hobby':'Reading','Mothers Name':''}]
df2 = pd.DataFrame(inp2)
print(df2)

Favorite Hobby Mothers Name Name Occupation
0 Basketball Monica Nurse
1 Sewing Rosy Jose
2 Reading Katherine Lawyer


I need to figure out a way to reliably join these two dataframes without the data always being consistent. To further complexify the problem, the two databases are not always the same length. Any ideas?










share|improve this question













I have two dataframes that are being queried off two separate databases that share common characteristics, but not always the same characteristics, and I need to find a way to reliably join the two together.



As an example:



import pandas as pd
inp = [{'Name':'Jose', 'Age':12,'Location':'Frankfurt','Occupation':'Student','Mothers Name':'Rosy'}, {'Name':'Katherine','Age':23,'Location':'Maui','Occupation':'Lawyer','Mothers Name':'Amy'}, {'Name':'Larry','Age':22,'Location':'Dallas','Occupation':'Nurse','Mothers Name':'Monica'}]
df = pd.DataFrame(inp)
print (df)

Age Location Mothers Name Name Occupation
0 12 Frankfurt Rosy Jose Student
1 23 Maui Amy Katherine Lawyer
2 22 Dallas Monica Larry Nurse



inp2 = [{'Name': '','Occupation':'Nurse','Favorite Hobby':'Basketball','Mothers Name':'Monica'},{'Name':'Jose','Occupation':'','Favorite Hobby':'Sewing','Mothers Name':'Rosy'},{'Name':'Katherine','Occupation':'Lawyer','Favorite Hobby':'Reading','Mothers Name':''}]
df2 = pd.DataFrame(inp2)
print(df2)

Favorite Hobby Mothers Name Name Occupation
0 Basketball Monica Nurse
1 Sewing Rosy Jose
2 Reading Katherine Lawyer


I need to figure out a way to reliably join these two dataframes without the data always being consistent. To further complexify the problem, the two databases are not always the same length. Any ideas?







python pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 10 at 0:51









jojo.t.c

42




42












  • so if you can join on any 2 of (mother's name, name, occupation), that's ok?
    – richflow
    Nov 10 at 1:02










  • you'll need to provide more info - show what is the expected end result (an example given your input)
    – adhg
    Nov 10 at 1:23


















  • so if you can join on any 2 of (mother's name, name, occupation), that's ok?
    – richflow
    Nov 10 at 1:02










  • you'll need to provide more info - show what is the expected end result (an example given your input)
    – adhg
    Nov 10 at 1:23
















so if you can join on any 2 of (mother's name, name, occupation), that's ok?
– richflow
Nov 10 at 1:02




so if you can join on any 2 of (mother's name, name, occupation), that's ok?
– richflow
Nov 10 at 1:02












you'll need to provide more info - show what is the expected end result (an example given your input)
– adhg
Nov 10 at 1:23




you'll need to provide more info - show what is the expected end result (an example given your input)
– adhg
Nov 10 at 1:23












1 Answer
1






active

oldest

votes

















up vote
0
down vote













you can preform your merge on your possible column combinations and concat those dfs then merge your new df on the first (complete) df:



# do your three possible merges on 'Mothers Name', 'Name', and 'Occupation'
# then concat your dataframes
new_df = pd.concat([df.merge(df2, on=['Mothers Name', 'Name']),
df.merge(df2, on=['Name', 'Occupation']),
df.merge(df2, on=['Mothers Name', 'Occupation'])], sort=False)

# take the first dataframe, which is complete, and merge with your new_df and drop dups
df.merge(new_df[['Age', 'Location', 'Favorite Hobby']], on=['Age', 'Location']).drop_duplicates()

Age Location Mothers Name Name Occupation Favorite Hobby
0 12 Frankfurt Rosy Jose Student Sewing
2 23 Maui Amy Katherine Lawyer Reading
4 22 Dallas Monica Larry Nurse Basketball


This assumes that each rows age and location are unique






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53235059%2fmatching-and-joining-two-inconsistent-dataframes%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    you can preform your merge on your possible column combinations and concat those dfs then merge your new df on the first (complete) df:



    # do your three possible merges on 'Mothers Name', 'Name', and 'Occupation'
    # then concat your dataframes
    new_df = pd.concat([df.merge(df2, on=['Mothers Name', 'Name']),
    df.merge(df2, on=['Name', 'Occupation']),
    df.merge(df2, on=['Mothers Name', 'Occupation'])], sort=False)

    # take the first dataframe, which is complete, and merge with your new_df and drop dups
    df.merge(new_df[['Age', 'Location', 'Favorite Hobby']], on=['Age', 'Location']).drop_duplicates()

    Age Location Mothers Name Name Occupation Favorite Hobby
    0 12 Frankfurt Rosy Jose Student Sewing
    2 23 Maui Amy Katherine Lawyer Reading
    4 22 Dallas Monica Larry Nurse Basketball


    This assumes that each rows age and location are unique






    share|improve this answer



























      up vote
      0
      down vote













      you can preform your merge on your possible column combinations and concat those dfs then merge your new df on the first (complete) df:



      # do your three possible merges on 'Mothers Name', 'Name', and 'Occupation'
      # then concat your dataframes
      new_df = pd.concat([df.merge(df2, on=['Mothers Name', 'Name']),
      df.merge(df2, on=['Name', 'Occupation']),
      df.merge(df2, on=['Mothers Name', 'Occupation'])], sort=False)

      # take the first dataframe, which is complete, and merge with your new_df and drop dups
      df.merge(new_df[['Age', 'Location', 'Favorite Hobby']], on=['Age', 'Location']).drop_duplicates()

      Age Location Mothers Name Name Occupation Favorite Hobby
      0 12 Frankfurt Rosy Jose Student Sewing
      2 23 Maui Amy Katherine Lawyer Reading
      4 22 Dallas Monica Larry Nurse Basketball


      This assumes that each rows age and location are unique






      share|improve this answer

























        up vote
        0
        down vote










        up vote
        0
        down vote









        you can preform your merge on your possible column combinations and concat those dfs then merge your new df on the first (complete) df:



        # do your three possible merges on 'Mothers Name', 'Name', and 'Occupation'
        # then concat your dataframes
        new_df = pd.concat([df.merge(df2, on=['Mothers Name', 'Name']),
        df.merge(df2, on=['Name', 'Occupation']),
        df.merge(df2, on=['Mothers Name', 'Occupation'])], sort=False)

        # take the first dataframe, which is complete, and merge with your new_df and drop dups
        df.merge(new_df[['Age', 'Location', 'Favorite Hobby']], on=['Age', 'Location']).drop_duplicates()

        Age Location Mothers Name Name Occupation Favorite Hobby
        0 12 Frankfurt Rosy Jose Student Sewing
        2 23 Maui Amy Katherine Lawyer Reading
        4 22 Dallas Monica Larry Nurse Basketball


        This assumes that each rows age and location are unique






        share|improve this answer














        you can preform your merge on your possible column combinations and concat those dfs then merge your new df on the first (complete) df:



        # do your three possible merges on 'Mothers Name', 'Name', and 'Occupation'
        # then concat your dataframes
        new_df = pd.concat([df.merge(df2, on=['Mothers Name', 'Name']),
        df.merge(df2, on=['Name', 'Occupation']),
        df.merge(df2, on=['Mothers Name', 'Occupation'])], sort=False)

        # take the first dataframe, which is complete, and merge with your new_df and drop dups
        df.merge(new_df[['Age', 'Location', 'Favorite Hobby']], on=['Age', 'Location']).drop_duplicates()

        Age Location Mothers Name Name Occupation Favorite Hobby
        0 12 Frankfurt Rosy Jose Student Sewing
        2 23 Maui Amy Katherine Lawyer Reading
        4 22 Dallas Monica Larry Nurse Basketball


        This assumes that each rows age and location are unique







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 10 at 3:36

























        answered Nov 10 at 3:06









        Chris

        1,3731210




        1,3731210






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53235059%2fmatching-and-joining-two-inconsistent-dataframes%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Schultheiß

            Verwaltungsgliederung Dänemarks

            Liste der Kulturdenkmale in Wilsdruff