Scraping name list with varying numbers of last names

up vote
1
down vote

favorite

Trying to scrape swedish members of parliament with Beautiful Soup. When I run the scraper I get "ValueError: too many values to unpack (expected 3)".

The script outputs a csv, but only with five names. The sixth person on the list is named Alm Ericson, Janine (MP). I suppose the problem is that she has two last names - Alm Ericson, and the code only expects three values, firstname, lastname and party.

How should I code the field-split to make this work also for double last names?

The names on the page are written as

Last_name, first_name (party)

Code:

import urllib.request

import bs4 as bs

import csv



source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-partier/").read()

soup = bs.BeautifulSoup(source, "lxml")



data = 



for span in soup.find_all("span", {"class": "fellow-name"}):

    cleanednames = span.text.strip()

    data.append(cleanednames)  #fields are appended to list rather printing



with open("riksdagsledamoter.csv", "w") as stream:

    fieldnames = ["Last_Name","First_Name","Party"]

    var = csv.DictWriter(stream, fieldnames=fieldnames)

    var.writeheader()

    for item in data:

        last_name, First_name, party = item.split()  #splitting data in 3 fields

        last_name = last_name.replace(",","")  #removing ',' from last name

        party = party.replace("(","").replace(")","")  #removing "()" from party

        var.writerow({"Last_Name": last_name,"First_Name": First_name, "Party": party})  #writing to csv row

edited Nov 8 at 11:33

asked Nov 8 at 10:59

Harald

163

Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
– BlueSheepToken
Nov 8 at 11:16

add a comment |

up vote
1
down vote

favorite

Trying to scrape swedish members of parliament with Beautiful Soup. When I run the scraper I get "ValueError: too many values to unpack (expected 3)".

How should I code the field-split to make this work also for double last names?

The names on the page are written as

Last_name, first_name (party)

Code:

import urllib.request

import bs4 as bs

import csv



source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-partier/").read()

soup = bs.BeautifulSoup(source, "lxml")



data = 



for span in soup.find_all("span", {"class": "fellow-name"}):

    cleanednames = span.text.strip()

    data.append(cleanednames)  #fields are appended to list rather printing



with open("riksdagsledamoter.csv", "w") as stream:

    fieldnames = ["Last_Name","First_Name","Party"]

    var = csv.DictWriter(stream, fieldnames=fieldnames)

    var.writeheader()

    for item in data:

        last_name, First_name, party = item.split()  #splitting data in 3 fields

        last_name = last_name.replace(",","")  #removing ',' from last name

        party = party.replace("(","").replace(")","")  #removing "()" from party

        var.writerow({"Last_Name": last_name,"First_Name": First_name, "Party": party})  #writing to csv row

edited Nov 8 at 11:33

asked Nov 8 at 10:59

Harald

163

Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
– BlueSheepToken
Nov 8 at 11:16

add a comment |

up vote
1
down vote

favorite

Trying to scrape swedish members of parliament with Beautiful Soup. When I run the scraper I get "ValueError: too many values to unpack (expected 3)".

How should I code the field-split to make this work also for double last names?

The names on the page are written as

Last_name, first_name (party)

Code:

import urllib.request

import bs4 as bs

import csv



source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-partier/").read()

soup = bs.BeautifulSoup(source, "lxml")



data = 



for span in soup.find_all("span", {"class": "fellow-name"}):

    cleanednames = span.text.strip()

    data.append(cleanednames)  #fields are appended to list rather printing



with open("riksdagsledamoter.csv", "w") as stream:

    fieldnames = ["Last_Name","First_Name","Party"]

    var = csv.DictWriter(stream, fieldnames=fieldnames)

    var.writeheader()

    for item in data:

        last_name, First_name, party = item.split()  #splitting data in 3 fields

        last_name = last_name.replace(",","")  #removing ',' from last name

        party = party.replace("(","").replace(")","")  #removing "()" from party

        var.writerow({"Last_Name": last_name,"First_Name": First_name, "Party": party})  #writing to csv row

edited Nov 8 at 11:33

asked Nov 8 at 10:59

Harald

163

Trying to scrape swedish members of parliament with Beautiful Soup. When I run the scraper I get "ValueError: too many values to unpack (expected 3)".

How should I code the field-split to make this work also for double last names?

The names on the page are written as

Last_name, first_name (party)

Code:

import urllib.request

import bs4 as bs

import csv



source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-partier/").read()

soup = bs.BeautifulSoup(source, "lxml")



data = 



for span in soup.find_all("span", {"class": "fellow-name"}):

    cleanednames = span.text.strip()

    data.append(cleanednames)  #fields are appended to list rather printing



with open("riksdagsledamoter.csv", "w") as stream:

    fieldnames = ["Last_Name","First_Name","Party"]

    var = csv.DictWriter(stream, fieldnames=fieldnames)

    var.writeheader()

    for item in data:

        last_name, First_name, party = item.split()  #splitting data in 3 fields

        last_name = last_name.replace(",","")  #removing ',' from last name

        party = party.replace("(","").replace(")","")  #removing "()" from party

        var.writerow({"Last_Name": last_name,"First_Name": First_name, "Party": party})  #writing to csv row

python

edited Nov 8 at 11:33

asked Nov 8 at 10:59

Harald

163

edited Nov 8 at 11:33

asked Nov 8 at 10:59

Harald

163

edited Nov 8 at 11:33

asked Nov 8 at 10:59

Harald

163

asked Nov 8 at 10:59

Harald

163

asked Nov 8 at 10:59

Harald

163

Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
– BlueSheepToken
Nov 8 at 11:16

add a comment |

Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
– BlueSheepToken
Nov 8 at 11:16

Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
– BlueSheepToken
Nov 8 at 11:16

add a comment |

3 Answers
3

active

oldest

votes

up vote
2
down vote

accepted

Here is a simple regex that should do the trick

 import re

 print(re.match("(.*), (.*) ((.*))", 'Alm Ericson, Janine (MP)').groups())

Inspired from Corentin's answer

edited Nov 8 at 13:26

answered Nov 8 at 11:24

BlueSheepToken

500110

Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
– Harald
Nov 8 at 11:33

1

Thanks, edited !
– Corentin Limier
Nov 8 at 11:52

add a comment |

up vote
4
down vote

Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)

Using regexp :

import re

re.match('([^,]*), ([^(]*) ((.*))', 'Alm Ericson, Janine (MP)').groups()

Returns

('Alm Ericson', 'Janine', 'MP')

edited Nov 8 at 11:51

answered Nov 8 at 11:07

Corentin Limier

1,37249

add a comment |

up vote
0
down vote

I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.

def getParts(inputString):

    list1 = inputString.split(",")

    list2 = list1[1].split("(")

    finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]

    return finalList



inputString = 'Alm Ericson, Janine (MP)'



print(getParts(s))

edited Nov 8 at 11:26

answered Nov 8 at 11:21

QHarr

25.5k81839

I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
– BlueSheepToken
Nov 8 at 11:24

I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
– QHarr
Nov 8 at 11:25

It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
– BlueSheepToken
Nov 8 at 13:27

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53206352%2fscraping-name-list-with-varying-numbers-of-last-names%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
2
down vote

accepted

Here is a simple regex that should do the trick

 import re

 print(re.match("(.*), (.*) ((.*))", 'Alm Ericson, Janine (MP)').groups())

Inspired from Corentin's answer

edited Nov 8 at 13:26

answered Nov 8 at 11:24

BlueSheepToken

500110

Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
– Harald
Nov 8 at 11:33

1

Thanks, edited !
– Corentin Limier
Nov 8 at 11:52

add a comment |

up vote
2
down vote

accepted

Here is a simple regex that should do the trick

 import re

 print(re.match("(.*), (.*) ((.*))", 'Alm Ericson, Janine (MP)').groups())

Inspired from Corentin's answer

edited Nov 8 at 13:26

answered Nov 8 at 11:24

BlueSheepToken

500110

Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
– Harald
Nov 8 at 11:33

1

Thanks, edited !
– Corentin Limier
Nov 8 at 11:52

add a comment |

up vote
2
down vote

accepted

Here is a simple regex that should do the trick

 import re

 print(re.match("(.*), (.*) ((.*))", 'Alm Ericson, Janine (MP)').groups())

Inspired from Corentin's answer

edited Nov 8 at 13:26

answered Nov 8 at 11:24

BlueSheepToken

500110

Here is a simple regex that should do the trick

 import re

 print(re.match("(.*), (.*) ((.*))", 'Alm Ericson, Janine (MP)').groups())

Inspired from Corentin's answer

edited Nov 8 at 13:26

answered Nov 8 at 11:24

BlueSheepToken

500110

edited Nov 8 at 13:26

answered Nov 8 at 11:24

BlueSheepToken

500110

answered Nov 8 at 11:24

BlueSheepToken

500110

answered Nov 8 at 11:24

BlueSheepToken

500110

Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
– Harald
Nov 8 at 11:33

1

Thanks, edited !
– Corentin Limier
Nov 8 at 11:52

add a comment |

Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
– Harald
Nov 8 at 11:33

1

Thanks, edited !
– Corentin Limier
Nov 8 at 11:52

Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
– Harald
Nov 8 at 11:33

Thanks, edited !
– Corentin Limier
Nov 8 at 11:52

add a comment |

up vote
4
down vote

Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)

Using regexp :

import re

re.match('([^,]*), ([^(]*) ((.*))', 'Alm Ericson, Janine (MP)').groups()

Returns

('Alm Ericson', 'Janine', 'MP')

edited Nov 8 at 11:51

answered Nov 8 at 11:07

Corentin Limier

1,37249

add a comment |

up vote
4
down vote

Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)

Using regexp :

import re

re.match('([^,]*), ([^(]*) ((.*))', 'Alm Ericson, Janine (MP)').groups()

Returns

('Alm Ericson', 'Janine', 'MP')

edited Nov 8 at 11:51

answered Nov 8 at 11:07

Corentin Limier

1,37249

add a comment |

up vote
4
down vote

Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)

Using regexp :

import re

re.match('([^,]*), ([^(]*) ((.*))', 'Alm Ericson, Janine (MP)').groups()

Returns

('Alm Ericson', 'Janine', 'MP')

edited Nov 8 at 11:51

answered Nov 8 at 11:07

Corentin Limier

1,37249

Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)

Using regexp :

import re

re.match('([^,]*), ([^(]*) ((.*))', 'Alm Ericson, Janine (MP)').groups()

Returns

('Alm Ericson', 'Janine', 'MP')

edited Nov 8 at 11:51

answered Nov 8 at 11:07

Corentin Limier

1,37249

edited Nov 8 at 11:51

answered Nov 8 at 11:07

Corentin Limier

1,37249

answered Nov 8 at 11:07

Corentin Limier

1,37249

answered Nov 8 at 11:07

Corentin Limier

1,37249

add a comment |

up vote
0
down vote

I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.

def getParts(inputString):

    list1 = inputString.split(",")

    list2 = list1[1].split("(")

    finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]

    return finalList



inputString = 'Alm Ericson, Janine (MP)'



print(getParts(s))

edited Nov 8 at 11:26

answered Nov 8 at 11:21

QHarr

25.5k81839

I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
– BlueSheepToken
Nov 8 at 11:24

I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
– QHarr
Nov 8 at 11:25

It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
– BlueSheepToken
Nov 8 at 13:27

add a comment |

up vote
0
down vote

I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.

def getParts(inputString):

    list1 = inputString.split(",")

    list2 = list1[1].split("(")

    finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]

    return finalList



inputString = 'Alm Ericson, Janine (MP)'



print(getParts(s))

edited Nov 8 at 11:26

answered Nov 8 at 11:21

QHarr

25.5k81839

I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
– BlueSheepToken
Nov 8 at 11:24

I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
– QHarr
Nov 8 at 11:25

It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
– BlueSheepToken
Nov 8 at 13:27

add a comment |

up vote
0
down vote

I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.

def getParts(inputString):

    list1 = inputString.split(",")

    list2 = list1[1].split("(")

    finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]

    return finalList



inputString = 'Alm Ericson, Janine (MP)'



print(getParts(s))

edited Nov 8 at 11:26

answered Nov 8 at 11:21

QHarr

25.5k81839

I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.

def getParts(inputString):

    list1 = inputString.split(",")

    list2 = list1[1].split("(")

    finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]

    return finalList



inputString = 'Alm Ericson, Janine (MP)'



print(getParts(s))

edited Nov 8 at 11:26

answered Nov 8 at 11:21

QHarr

25.5k81839

edited Nov 8 at 11:26

answered Nov 8 at 11:21

QHarr

25.5k81839

answered Nov 8 at 11:21

QHarr

25.5k81839

answered Nov 8 at 11:21

QHarr

25.5k81839

I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
– BlueSheepToken
Nov 8 at 11:24

I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
– QHarr
Nov 8 at 11:25

It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
– BlueSheepToken
Nov 8 at 13:27

add a comment |

I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
– BlueSheepToken
Nov 8 at 11:24

I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
– QHarr
Nov 8 at 11:25

It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
– BlueSheepToken
Nov 8 at 13:27

I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
– BlueSheepToken
Nov 8 at 11:24

I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
– QHarr
Nov 8 at 11:25

It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
– BlueSheepToken
Nov 8 at 13:27

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Xtykutl