Scraping name list with varying numbers of last names
up vote
1
down vote
favorite
Trying to scrape swedish members of parliament with Beautiful Soup. When I run the scraper I get "ValueError: too many values to unpack (expected 3)".
The script outputs a csv, but only with five names. The sixth person on the list is named Alm Ericson, Janine (MP). I suppose the problem is that she has two last names - Alm Ericson, and the code only expects three values, firstname, lastname and party.
How should I code the field-split to make this work also for double last names?
The names on the page are written as
Last_name, first_name (party)
Code:
import urllib.request
import bs4 as bs
import csv
source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-partier/").read()
soup = bs.BeautifulSoup(source, "lxml")
data =
for span in soup.find_all("span", {"class": "fellow-name"}):
cleanednames = span.text.strip()
data.append(cleanednames) #fields are appended to list rather printing
with open("riksdagsledamoter.csv", "w") as stream:
fieldnames = ["Last_Name","First_Name","Party"]
var = csv.DictWriter(stream, fieldnames=fieldnames)
var.writeheader()
for item in data:
last_name, First_name, party = item.split() #splitting data in 3 fields
last_name = last_name.replace(",","") #removing ',' from last name
party = party.replace("(","").replace(")","") #removing "()" from party
var.writerow({"Last_Name": last_name,"First_Name": First_name, "Party": party}) #writing to csv row
python
add a comment |
up vote
1
down vote
favorite
Trying to scrape swedish members of parliament with Beautiful Soup. When I run the scraper I get "ValueError: too many values to unpack (expected 3)".
The script outputs a csv, but only with five names. The sixth person on the list is named Alm Ericson, Janine (MP). I suppose the problem is that she has two last names - Alm Ericson, and the code only expects three values, firstname, lastname and party.
How should I code the field-split to make this work also for double last names?
The names on the page are written as
Last_name, first_name (party)
Code:
import urllib.request
import bs4 as bs
import csv
source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-partier/").read()
soup = bs.BeautifulSoup(source, "lxml")
data =
for span in soup.find_all("span", {"class": "fellow-name"}):
cleanednames = span.text.strip()
data.append(cleanednames) #fields are appended to list rather printing
with open("riksdagsledamoter.csv", "w") as stream:
fieldnames = ["Last_Name","First_Name","Party"]
var = csv.DictWriter(stream, fieldnames=fieldnames)
var.writeheader()
for item in data:
last_name, First_name, party = item.split() #splitting data in 3 fields
last_name = last_name.replace(",","") #removing ',' from last name
party = party.replace("(","").replace(")","") #removing "()" from party
var.writerow({"Last_Name": last_name,"First_Name": First_name, "Party": party}) #writing to csv row
python
Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
– BlueSheepToken
Nov 8 at 11:16
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
Trying to scrape swedish members of parliament with Beautiful Soup. When I run the scraper I get "ValueError: too many values to unpack (expected 3)".
The script outputs a csv, but only with five names. The sixth person on the list is named Alm Ericson, Janine (MP). I suppose the problem is that she has two last names - Alm Ericson, and the code only expects three values, firstname, lastname and party.
How should I code the field-split to make this work also for double last names?
The names on the page are written as
Last_name, first_name (party)
Code:
import urllib.request
import bs4 as bs
import csv
source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-partier/").read()
soup = bs.BeautifulSoup(source, "lxml")
data =
for span in soup.find_all("span", {"class": "fellow-name"}):
cleanednames = span.text.strip()
data.append(cleanednames) #fields are appended to list rather printing
with open("riksdagsledamoter.csv", "w") as stream:
fieldnames = ["Last_Name","First_Name","Party"]
var = csv.DictWriter(stream, fieldnames=fieldnames)
var.writeheader()
for item in data:
last_name, First_name, party = item.split() #splitting data in 3 fields
last_name = last_name.replace(",","") #removing ',' from last name
party = party.replace("(","").replace(")","") #removing "()" from party
var.writerow({"Last_Name": last_name,"First_Name": First_name, "Party": party}) #writing to csv row
python
Trying to scrape swedish members of parliament with Beautiful Soup. When I run the scraper I get "ValueError: too many values to unpack (expected 3)".
The script outputs a csv, but only with five names. The sixth person on the list is named Alm Ericson, Janine (MP). I suppose the problem is that she has two last names - Alm Ericson, and the code only expects three values, firstname, lastname and party.
How should I code the field-split to make this work also for double last names?
The names on the page are written as
Last_name, first_name (party)
Code:
import urllib.request
import bs4 as bs
import csv
source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-partier/").read()
soup = bs.BeautifulSoup(source, "lxml")
data =
for span in soup.find_all("span", {"class": "fellow-name"}):
cleanednames = span.text.strip()
data.append(cleanednames) #fields are appended to list rather printing
with open("riksdagsledamoter.csv", "w") as stream:
fieldnames = ["Last_Name","First_Name","Party"]
var = csv.DictWriter(stream, fieldnames=fieldnames)
var.writeheader()
for item in data:
last_name, First_name, party = item.split() #splitting data in 3 fields
last_name = last_name.replace(",","") #removing ',' from last name
party = party.replace("(","").replace(")","") #removing "()" from party
var.writerow({"Last_Name": last_name,"First_Name": First_name, "Party": party}) #writing to csv row
python
python
edited Nov 8 at 11:33
asked Nov 8 at 10:59
Harald
163
163
Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
– BlueSheepToken
Nov 8 at 11:16
add a comment |
Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
– BlueSheepToken
Nov 8 at 11:16
Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
– BlueSheepToken
Nov 8 at 11:16
Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
– BlueSheepToken
Nov 8 at 11:16
add a comment |
3 Answers
3
active
oldest
votes
up vote
2
down vote
accepted
Here is a simple regex that should do the trick
import re
print(re.match("(.*), (.*) ((.*))", 'Alm Ericson, Janine (MP)').groups())
Inspired from Corentin's answer
Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
– Harald
Nov 8 at 11:33
1
Thanks, edited !
– Corentin Limier
Nov 8 at 11:52
add a comment |
up vote
4
down vote
Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)
Using regexp :
import re
re.match('([^,]*), ([^(]*) ((.*))', 'Alm Ericson, Janine (MP)').groups()
Returns
('Alm Ericson', 'Janine', 'MP')
add a comment |
up vote
0
down vote
I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.
def getParts(inputString):
list1 = inputString.split(",")
list2 = list1[1].split("(")
finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]
return finalList
inputString = 'Alm Ericson, Janine (MP)'
print(getParts(s))
I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
– BlueSheepToken
Nov 8 at 11:24
I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
– QHarr
Nov 8 at 11:25
It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
– BlueSheepToken
Nov 8 at 13:27
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
Here is a simple regex that should do the trick
import re
print(re.match("(.*), (.*) ((.*))", 'Alm Ericson, Janine (MP)').groups())
Inspired from Corentin's answer
Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
– Harald
Nov 8 at 11:33
1
Thanks, edited !
– Corentin Limier
Nov 8 at 11:52
add a comment |
up vote
2
down vote
accepted
Here is a simple regex that should do the trick
import re
print(re.match("(.*), (.*) ((.*))", 'Alm Ericson, Janine (MP)').groups())
Inspired from Corentin's answer
Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
– Harald
Nov 8 at 11:33
1
Thanks, edited !
– Corentin Limier
Nov 8 at 11:52
add a comment |
up vote
2
down vote
accepted
up vote
2
down vote
accepted
Here is a simple regex that should do the trick
import re
print(re.match("(.*), (.*) ((.*))", 'Alm Ericson, Janine (MP)').groups())
Inspired from Corentin's answer
Here is a simple regex that should do the trick
import re
print(re.match("(.*), (.*) ((.*))", 'Alm Ericson, Janine (MP)').groups())
Inspired from Corentin's answer
edited Nov 8 at 13:26
answered Nov 8 at 11:24
BlueSheepToken
500110
500110
Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
– Harald
Nov 8 at 11:33
1
Thanks, edited !
– Corentin Limier
Nov 8 at 11:52
add a comment |
Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
– Harald
Nov 8 at 11:33
1
Thanks, edited !
– Corentin Limier
Nov 8 at 11:52
Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
– Harald
Nov 8 at 11:33
Aha, okey. I'm very new to Python and coding in general, could you show me where in the code I should put that?
– Harald
Nov 8 at 11:33
1
1
Thanks, edited !
– Corentin Limier
Nov 8 at 11:52
Thanks, edited !
– Corentin Limier
Nov 8 at 11:52
add a comment |
up vote
4
down vote
Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)
Using regexp :
import re
re.match('([^,]*), ([^(]*) ((.*))', 'Alm Ericson, Janine (MP)').groups()
Returns
('Alm Ericson', 'Janine', 'MP')
add a comment |
up vote
4
down vote
Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)
Using regexp :
import re
re.match('([^,]*), ([^(]*) ((.*))', 'Alm Ericson, Janine (MP)').groups()
Returns
('Alm Ericson', 'Janine', 'MP')
add a comment |
up vote
4
down vote
up vote
4
down vote
Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)
Using regexp :
import re
re.match('([^,]*), ([^(]*) ((.*))', 'Alm Ericson, Janine (MP)').groups()
Returns
('Alm Ericson', 'Janine', 'MP')
Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)
Using regexp :
import re
re.match('([^,]*), ([^(]*) ((.*))', 'Alm Ericson, Janine (MP)').groups()
Returns
('Alm Ericson', 'Janine', 'MP')
edited Nov 8 at 11:51
answered Nov 8 at 11:07
Corentin Limier
1,37249
1,37249
add a comment |
add a comment |
up vote
0
down vote
I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.
def getParts(inputString):
list1 = inputString.split(",")
list2 = list1[1].split("(")
finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]
return finalList
inputString = 'Alm Ericson, Janine (MP)'
print(getParts(s))
I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
– BlueSheepToken
Nov 8 at 11:24
I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
– QHarr
Nov 8 at 11:25
It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
– BlueSheepToken
Nov 8 at 13:27
add a comment |
up vote
0
down vote
I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.
def getParts(inputString):
list1 = inputString.split(",")
list2 = list1[1].split("(")
finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]
return finalList
inputString = 'Alm Ericson, Janine (MP)'
print(getParts(s))
I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
– BlueSheepToken
Nov 8 at 11:24
I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
– QHarr
Nov 8 at 11:25
It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
– BlueSheepToken
Nov 8 at 13:27
add a comment |
up vote
0
down vote
up vote
0
down vote
I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.
def getParts(inputString):
list1 = inputString.split(",")
list2 = list1[1].split("(")
finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]
return finalList
inputString = 'Alm Ericson, Janine (MP)'
print(getParts(s))
I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.
def getParts(inputString):
list1 = inputString.split(",")
list2 = list1[1].split("(")
finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]
return finalList
inputString = 'Alm Ericson, Janine (MP)'
print(getParts(s))
edited Nov 8 at 11:26
answered Nov 8 at 11:21
QHarr
25.5k81839
25.5k81839
I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
– BlueSheepToken
Nov 8 at 11:24
I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
– QHarr
Nov 8 at 11:25
It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
– BlueSheepToken
Nov 8 at 13:27
add a comment |
I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
– BlueSheepToken
Nov 8 at 11:24
I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
– QHarr
Nov 8 at 11:25
It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
– BlueSheepToken
Nov 8 at 13:27
I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
– BlueSheepToken
Nov 8 at 11:24
I think Corentin's answer is more appropriate for this case. It's more readable for other developpers and easier to maintain
– BlueSheepToken
Nov 8 at 11:24
I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
– QHarr
Nov 8 at 11:25
I would agree @BlueSheepToken. I was just having a stab at writing some python. I'll happily take on board improvement notes. I already upvoted theirs.
– QHarr
Nov 8 at 11:25
It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
– BlueSheepToken
Nov 8 at 13:27
It is just I prefer the regex approach since it is clearer, but your code is totally readable :) and more user friendly for new developers
– BlueSheepToken
Nov 8 at 13:27
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53206352%2fscraping-name-list-with-varying-numbers-of-last-names%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Since it's not beautiful soup related, you should hcnage the tag and the question. There are a lot of noise in this code
– BlueSheepToken
Nov 8 at 11:16