Scraping name list with varying numbers of last names











up vote
1
down vote

favorite












Trying to scrape swedish members of parliament with Beautiful Soup. When I run the scraper I get "ValueError: too many values to unpack (expected 3)".



The script outputs a csv, but only with five names. The sixth person on the list is named Alm Ericson, Janine (MP). I suppose the problem is that she has two last names - Alm Ericson, and the code only expects three values, firstname, lastname and party.



How should I code the field-split to make this work also for double last names?



The names on the page are written as



Last_name, first_name (party)


Code:



import urllib.request
import bs4 as bs
import csv

source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-partier/").read()
soup = bs.BeautifulSoup(source, "lxml")

data =

for span in soup.find_all("span", {"class": "fellow-name"}):
cleanednames = span.text.strip()
data.append(cleanednames) #fields are appended to list rather printing

with open("riksdagsledamoter.csv", "w") as stream:
fieldnames = ["Last_Name","First_Name","Party"]
var = csv.DictWriter(stream, fieldnames=fieldnames)
var.writeheader()
for item in data:
last_name, First_name, party = item.split() #splitting data in 3 fields
last_name = last_name.replace(",","") #removing ',' from last name
party = party.replace("(","").replace(")","") #removing "()" from party
var.writerow({"Last_Name": last_name,"First_Name": First_name, "Party": party}) #writing to csv row









share|improve this question
























  • Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
    – BlueSheepToken
    Nov 8 at 11:16















up vote
1
down vote

favorite












Trying to scrape swedish members of parliament with Beautiful Soup. When I run the scraper I get "ValueError: too many values to unpack (expected 3)".



The script outputs a csv, but only with five names. The sixth person on the list is named Alm Ericson, Janine (MP). I suppose the problem is that she has two last names - Alm Ericson, and the code only expects three values, firstname, lastname and party.



How should I code the field-split to make this work also for double last names?



The names on the page are written as



Last_name, first_name (party)


Code:



import urllib.request
import bs4 as bs
import csv

source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-partier/").read()
soup = bs.BeautifulSoup(source, "lxml")

data =

for span in soup.find_all("span", {"class": "fellow-name"}):
cleanednames = span.text.strip()
data.append(cleanednames) #fields are appended to list rather printing

with open("riksdagsledamoter.csv", "w") as stream:
fieldnames = ["Last_Name","First_Name","Party"]
var = csv.DictWriter(stream, fieldnames=fieldnames)
var.writeheader()
for item in data:
last_name, First_name, party = item.split() #splitting data in 3 fields
last_name = last_name.replace(",","") #removing ',' from last name
party = party.replace("(","").replace(")","") #removing "()" from party
var.writerow({"Last_Name": last_name,"First_Name": First_name, "Party": party}) #writing to csv row









share|improve this question
























  • Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
    – BlueSheepToken
    Nov 8 at 11:16













up vote
1
down vote

favorite









up vote
1
down vote

favorite











Trying to scrape swedish members of parliament with Beautiful Soup. When I run the scraper I get "ValueError: too many values to unpack (expected 3)".



The script outputs a csv, but only with five names. The sixth person on the list is named Alm Ericson, Janine (MP). I suppose the problem is that she has two last names - Alm Ericson, and the code only expects three values, firstname, lastname and party.



How should I code the field-split to make this work also for double last names?



The names on the page are written as



Last_name, first_name (party)


Code:



import urllib.request
import bs4 as bs
import csv

source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-partier/").read()
soup = bs.BeautifulSoup(source, "lxml")

data =

for span in soup.find_all("span", {"class": "fellow-name"}):
cleanednames = span.text.strip()
data.append(cleanednames) #fields are appended to list rather printing

with open("riksdagsledamoter.csv", "w") as stream:
fieldnames = ["Last_Name","First_Name","Party"]
var = csv.DictWriter(stream, fieldnames=fieldnames)
var.writeheader()
for item in data:
last_name, First_name, party = item.split() #splitting data in 3 fields
last_name = last_name.replace(",","") #removing ',' from last name
party = party.replace("(","").replace(")","") #removing "()" from party
var.writerow({"Last_Name": last_name,"First_Name": First_name, "Party": party}) #writing to csv row









share|improve this question















Trying to scrape swedish members of parliament with Beautiful Soup. When I run the scraper I get "ValueError: too many values to unpack (expected 3)".



The script outputs a csv, but only with five names. The sixth person on the list is named Alm Ericson, Janine (MP). I suppose the problem is that she has two last names - Alm Ericson, and the code only expects three values, firstname, lastname and party.



How should I code the field-split to make this work also for double last names?



The names on the page are written as



Last_name, first_name (party)


Code:



import urllib.request
import bs4 as bs
import csv

source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-partier/").read()
soup = bs.BeautifulSoup(source, "lxml")

data =

for span in soup.find_all("span", {"class": "fellow-name"}):
cleanednames = span.text.strip()
data.append(cleanednames) #fields are appended to list rather printing

with open("riksdagsledamoter.csv", "w") as stream:
fieldnames = ["Last_Name","First_Name","Party"]
var = csv.DictWriter(stream, fieldnames=fieldnames)
var.writeheader()
for item in data:
last_name, First_name, party = item.split() #splitting data in 3 fields
last_name = last_name.replace(",","") #removing ',' from last name
party = party.replace("(","").replace(")","") #removing "()" from party
var.writerow({"Last_Name": last_name,"First_Name": First_name, "Party": party}) #writing to csv row






python






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 8 at 11:33

























asked Nov 8 at 10:59









Harald

163




163












  • Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
    – BlueSheepToken
    Nov 8 at 11:16


















  • Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
    – BlueSheepToken
    Nov 8 at 11:16
















Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
– BlueSheepToken
Nov 8 at 11:16




Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
– BlueSheepToken
Nov 8 at 11:16












3 Answers
3






active

oldest

votes

















up vote
2
down vote



accepted










Here is a simple regex that should do the trick



 import re
print(re.match("(.*), (.*) ((.*))", 'Alm Ericson, Janine (MP)').groups())


Inspired from Corentin's answer






share|improve this answer























  • Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
    – Harald
    Nov 8 at 11:33






  • 1




    Thanks, edited !
    – Corentin Limier
    Nov 8 at 11:52


















up vote
4
down vote













Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)



Using regexp :



import re
re.match('([^,]*), ([^(]*) ((.*))', 'Alm Ericson, Janine (MP)').groups()


Returns



('Alm Ericson', 'Janine', 'MP')





share|improve this answer






























    up vote
    0
    down vote













    I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.



    def getParts(inputString):
    list1 = inputString.split(",")
    list2 = list1[1].split("(")
    finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]
    return finalList

    inputString = 'Alm Ericson, Janine (MP)'

    print(getParts(s))





    share|improve this answer























    • I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
      – BlueSheepToken
      Nov 8 at 11:24










    • I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
      – QHarr
      Nov 8 at 11:25












    • It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
      – BlueSheepToken
      Nov 8 at 13:27













    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53206352%2fscraping-name-list-with-varying-numbers-of-last-names%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    2
    down vote



    accepted










    Here is a simple regex that should do the trick



     import re
    print(re.match("(.*), (.*) ((.*))", 'Alm Ericson, Janine (MP)').groups())


    Inspired from Corentin's answer






    share|improve this answer























    • Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
      – Harald
      Nov 8 at 11:33






    • 1




      Thanks, edited !
      – Corentin Limier
      Nov 8 at 11:52















    up vote
    2
    down vote



    accepted










    Here is a simple regex that should do the trick



     import re
    print(re.match("(.*), (.*) ((.*))", 'Alm Ericson, Janine (MP)').groups())


    Inspired from Corentin's answer






    share|improve this answer























    • Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
      – Harald
      Nov 8 at 11:33






    • 1




      Thanks, edited !
      – Corentin Limier
      Nov 8 at 11:52













    up vote
    2
    down vote



    accepted







    up vote
    2
    down vote



    accepted






    Here is a simple regex that should do the trick



     import re
    print(re.match("(.*), (.*) ((.*))", 'Alm Ericson, Janine (MP)').groups())


    Inspired from Corentin's answer






    share|improve this answer














    Here is a simple regex that should do the trick



     import re
    print(re.match("(.*), (.*) ((.*))", 'Alm Ericson, Janine (MP)').groups())


    Inspired from Corentin's answer







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 8 at 13:26

























    answered Nov 8 at 11:24









    BlueSheepToken

    500110




    500110












    • Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
      – Harald
      Nov 8 at 11:33






    • 1




      Thanks, edited !
      – Corentin Limier
      Nov 8 at 11:52


















    • Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
      – Harald
      Nov 8 at 11:33






    • 1




      Thanks, edited !
      – Corentin Limier
      Nov 8 at 11:52
















    Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
    – Harald
    Nov 8 at 11:33




    Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
    – Harald
    Nov 8 at 11:33




    1




    1




    Thanks, edited !
    – Corentin Limier
    Nov 8 at 11:52




    Thanks, edited !
    – Corentin Limier
    Nov 8 at 11:52












    up vote
    4
    down vote













    Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)



    Using regexp :



    import re
    re.match('([^,]*), ([^(]*) ((.*))', 'Alm Ericson, Janine (MP)').groups()


    Returns



    ('Alm Ericson', 'Janine', 'MP')





    share|improve this answer



























      up vote
      4
      down vote













      Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)



      Using regexp :



      import re
      re.match('([^,]*), ([^(]*) ((.*))', 'Alm Ericson, Janine (MP)').groups()


      Returns



      ('Alm Ericson', 'Janine', 'MP')





      share|improve this answer

























        up vote
        4
        down vote










        up vote
        4
        down vote









        Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)



        Using regexp :



        import re
        re.match('([^,]*), ([^(]*) ((.*))', 'Alm Ericson, Janine (MP)').groups()


        Returns



        ('Alm Ericson', 'Janine', 'MP')





        share|improve this answer














        Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)



        Using regexp :



        import re
        re.match('([^,]*), ([^(]*) ((.*))', 'Alm Ericson, Janine (MP)').groups()


        Returns



        ('Alm Ericson', 'Janine', 'MP')






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 8 at 11:51

























        answered Nov 8 at 11:07









        Corentin Limier

        1,37249




        1,37249






















            up vote
            0
            down vote













            I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.



            def getParts(inputString):
            list1 = inputString.split(",")
            list2 = list1[1].split("(")
            finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]
            return finalList

            inputString = 'Alm Ericson, Janine (MP)'

            print(getParts(s))





            share|improve this answer























            • I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
              – BlueSheepToken
              Nov 8 at 11:24










            • I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
              – QHarr
              Nov 8 at 11:25












            • It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
              – BlueSheepToken
              Nov 8 at 13:27

















            up vote
            0
            down vote













            I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.



            def getParts(inputString):
            list1 = inputString.split(",")
            list2 = list1[1].split("(")
            finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]
            return finalList

            inputString = 'Alm Ericson, Janine (MP)'

            print(getParts(s))





            share|improve this answer























            • I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
              – BlueSheepToken
              Nov 8 at 11:24










            • I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
              – QHarr
              Nov 8 at 11:25












            • It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
              – BlueSheepToken
              Nov 8 at 13:27















            up vote
            0
            down vote










            up vote
            0
            down vote









            I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.



            def getParts(inputString):
            list1 = inputString.split(",")
            list2 = list1[1].split("(")
            finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]
            return finalList

            inputString = 'Alm Ericson, Janine (MP)'

            print(getParts(s))





            share|improve this answer














            I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.



            def getParts(inputString):
            list1 = inputString.split(",")
            list2 = list1[1].split("(")
            finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]
            return finalList

            inputString = 'Alm Ericson, Janine (MP)'

            print(getParts(s))






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 8 at 11:26

























            answered Nov 8 at 11:21









            QHarr

            25.5k81839




            25.5k81839












            • I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
              – BlueSheepToken
              Nov 8 at 11:24










            • I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
              – QHarr
              Nov 8 at 11:25












            • It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
              – BlueSheepToken
              Nov 8 at 13:27




















            • I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
              – BlueSheepToken
              Nov 8 at 11:24










            • I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
              – QHarr
              Nov 8 at 11:25












            • It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
              – BlueSheepToken
              Nov 8 at 13:27


















            I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
            – BlueSheepToken
            Nov 8 at 11:24




            I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
            – BlueSheepToken
            Nov 8 at 11:24












            I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
            – QHarr
            Nov 8 at 11:25






            I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
            – QHarr
            Nov 8 at 11:25














            It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
            – BlueSheepToken
            Nov 8 at 13:27






            It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
            – BlueSheepToken
            Nov 8 at 13:27




















             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53206352%2fscraping-name-list-with-varying-numbers-of-last-names%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Schultheiß

            Verwaltungsgliederung Dänemarks

            Liste der Kulturdenkmale in Wilsdruff