Create JSON with XML file using BeautifulSoup

up vote
0
down vote

favorite

I am using Jupyer notebook, running python 3. My task is to extract data from XML file and convert it to json format (perhaps even save the json in an output.dat file). I am using BeautifulSoup to navigate through the nodes. I have the following data:

<?xml version='1.0' encoding='UTF-8'?> 

<Terms>   

 <Term>

    <Title>.177 (4.5mm) Airgun</Title>

    <Description>The standard airgun calibre for international target 

                 shooting.</Description>

    <RelatedTerms>

      <Term>

        <Title>Shooting sport equipment</Title>

        <Relationship>Narrower Term</Relationship>

      </Term>

    </RelatedTerms>   

 </Term>

 <Term>

    <Title>1 Kilometre Time Trial</Title>

    <Description>test2</Description>

    <RelatedTerms>

    <Term>

      <Title>1 Kilometre TT</Title>

      <Relationship>Used For</Relationship>

    </Term>

    <Term>

      <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km TT</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>One km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

</RelatedTerms>

</Term>

This is the following output that I am expecting in JSON:

{

"thesaurus": [

{

"Description": "The standard airgun calibre for international target shooting.",

"RelatedTerms": [

{

"Relationship": "Narrower Term",

"Title": "Shooting sport equipment"

}

],

"Title": ".177 (4.5mm) Airgun"

}, 



{

"Description": "test2",

"RelatedTerms": [

{

"Relationship": "Used For",

"Title": "1 Kilometre TT"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km TT"

},

{

"Relationship": "Used For",

"Title": "One km Time Trial"

}

],

"Title": "1 Kilometre Time Trial"

},

I am navigating through the tags so that I can create dictionaries as seen in the output example. Since I am new to text scraping, this is quite frustrating.

I was able to extract the "Description" tag with the following code:

xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, encoding="utf8"),"xml")

elements = btree.find_all('Description')

descriptionTag = 

for element in elements:

    descriptionTag.append(element.text)

Like the above Description tag, I am not sure how to create a list of dictionaries for the information stored between the "RelatedTerms" tag.
Ideally, I would parse all the tags to a dataframe which would then convert the data to JSON format.

So, can someone please help in determining how to extract the information from "RelatedTerms" tag.

asked Nov 10 at 9:35

Timetraveller

129114

add a comment |

up vote
0
down vote

favorite

<?xml version='1.0' encoding='UTF-8'?> 

<Terms>   

 <Term>

    <Title>.177 (4.5mm) Airgun</Title>

    <Description>The standard airgun calibre for international target 

                 shooting.</Description>

    <RelatedTerms>

      <Term>

        <Title>Shooting sport equipment</Title>

        <Relationship>Narrower Term</Relationship>

      </Term>

    </RelatedTerms>   

 </Term>

 <Term>

    <Title>1 Kilometre Time Trial</Title>

    <Description>test2</Description>

    <RelatedTerms>

    <Term>

      <Title>1 Kilometre TT</Title>

      <Relationship>Used For</Relationship>

    </Term>

    <Term>

      <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km TT</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>One km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

</RelatedTerms>

</Term>

This is the following output that I am expecting in JSON:

{

"thesaurus": [

{

"Description": "The standard airgun calibre for international target shooting.",

"RelatedTerms": [

{

"Relationship": "Narrower Term",

"Title": "Shooting sport equipment"

}

],

"Title": ".177 (4.5mm) Airgun"

}, 



{

"Description": "test2",

"RelatedTerms": [

{

"Relationship": "Used For",

"Title": "1 Kilometre TT"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km TT"

},

{

"Relationship": "Used For",

"Title": "One km Time Trial"

}

],

"Title": "1 Kilometre Time Trial"

},

I am navigating through the tags so that I can create dictionaries as seen in the output example. Since I am new to text scraping, this is quite frustrating.

I was able to extract the "Description" tag with the following code:

xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, encoding="utf8"),"xml")

elements = btree.find_all('Description')

descriptionTag = 

for element in elements:

    descriptionTag.append(element.text)

So, can someone please help in determining how to extract the information from "RelatedTerms" tag.

asked Nov 10 at 9:35

Timetraveller

129114

add a comment |

up vote
0
down vote

favorite

<?xml version='1.0' encoding='UTF-8'?> 

<Terms>   

 <Term>

    <Title>.177 (4.5mm) Airgun</Title>

    <Description>The standard airgun calibre for international target 

                 shooting.</Description>

    <RelatedTerms>

      <Term>

        <Title>Shooting sport equipment</Title>

        <Relationship>Narrower Term</Relationship>

      </Term>

    </RelatedTerms>   

 </Term>

 <Term>

    <Title>1 Kilometre Time Trial</Title>

    <Description>test2</Description>

    <RelatedTerms>

    <Term>

      <Title>1 Kilometre TT</Title>

      <Relationship>Used For</Relationship>

    </Term>

    <Term>

      <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km TT</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>One km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

</RelatedTerms>

</Term>

This is the following output that I am expecting in JSON:

{

"thesaurus": [

{

"Description": "The standard airgun calibre for international target shooting.",

"RelatedTerms": [

{

"Relationship": "Narrower Term",

"Title": "Shooting sport equipment"

}

],

"Title": ".177 (4.5mm) Airgun"

}, 



{

"Description": "test2",

"RelatedTerms": [

{

"Relationship": "Used For",

"Title": "1 Kilometre TT"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km TT"

},

{

"Relationship": "Used For",

"Title": "One km Time Trial"

}

],

"Title": "1 Kilometre Time Trial"

},

I am navigating through the tags so that I can create dictionaries as seen in the output example. Since I am new to text scraping, this is quite frustrating.

I was able to extract the "Description" tag with the following code:

xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, encoding="utf8"),"xml")

elements = btree.find_all('Description')

descriptionTag = 

for element in elements:

    descriptionTag.append(element.text)

So, can someone please help in determining how to extract the information from "RelatedTerms" tag.

asked Nov 10 at 9:35

Timetraveller

129114

<?xml version='1.0' encoding='UTF-8'?> 

<Terms>   

 <Term>

    <Title>.177 (4.5mm) Airgun</Title>

    <Description>The standard airgun calibre for international target 

                 shooting.</Description>

    <RelatedTerms>

      <Term>

        <Title>Shooting sport equipment</Title>

        <Relationship>Narrower Term</Relationship>

      </Term>

    </RelatedTerms>   

 </Term>

 <Term>

    <Title>1 Kilometre Time Trial</Title>

    <Description>test2</Description>

    <RelatedTerms>

    <Term>

      <Title>1 Kilometre TT</Title>

      <Relationship>Used For</Relationship>

    </Term>

    <Term>

      <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>1km TT</Title>

    <Relationship>Used For</Relationship>

  </Term>

  <Term>

    <Title>One km Time Trial</Title>

    <Relationship>Used For</Relationship>

  </Term>

</RelatedTerms>

</Term>

This is the following output that I am expecting in JSON:

{

"thesaurus": [

{

"Description": "The standard airgun calibre for international target shooting.",

"RelatedTerms": [

{

"Relationship": "Narrower Term",

"Title": "Shooting sport equipment"

}

],

"Title": ".177 (4.5mm) Airgun"

}, 



{

"Description": "test2",

"RelatedTerms": [

{

"Relationship": "Used For",

"Title": "1 Kilometre TT"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km Time Trial"

},

{

"Relationship": "Used For",

"Title": "1km TT"

},

{

"Relationship": "Used For",

"Title": "One km Time Trial"

}

],

"Title": "1 Kilometre Time Trial"

},

I am navigating through the tags so that I can create dictionaries as seen in the output example. Since I am new to text scraping, this is quite frustrating.

I was able to extract the "Description" tag with the following code:

xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, encoding="utf8"),"xml")

elements = btree.find_all('Description')

descriptionTag = 

for element in elements:

    descriptionTag.append(element.text)

So, can someone please help in determining how to extract the information from "RelatedTerms" tag.

json xml beautifulsoup

asked Nov 10 at 9:35

Timetraveller

129114

asked Nov 10 at 9:35

Timetraveller

129114

asked Nov 10 at 9:35

Timetraveller

129114

asked Nov 10 at 9:35

Timetraveller

129114

asked Nov 10 at 9:35

Timetraveller

129114

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

to extract RelatedTerms first you have to extract top Term element using btree.select('Terms > Term') now you can loop it and extract Term inside RelatedTerms using term.select('RelatedTerms > Term')

import json

from bs4 import BeautifulSoup



xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, 'r'), "xml")

Terms = btree.select('Terms > Term')

jsonObj = {"thesaurus": }



for term in Terms:

    termDetail = {

        "Description": term.find('Description').text,

        "Title": term.find('Title').text

    }

    RelatedTerms = term.select('RelatedTerms > Term')

    if RelatedTerms:

        termDetail["RelatedTerms"] = 

        for rterm in RelatedTerms:

            termDetail["RelatedTerms"].append({

                "Title": rterm.find('Title').text,

                "Relationship": rterm.find('Relationship').text

            })

    jsonObj["thesaurus"].append(termDetail)



print json.dumps(jsonObj, indent=4)

edited Nov 10 at 12:29

answered Nov 10 at 12:23

ewwink

8,78622236

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53237663%2fcreate-json-with-xml-file-using-beautifulsoup%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

import json

from bs4 import BeautifulSoup



xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, 'r'), "xml")

Terms = btree.select('Terms > Term')

jsonObj = {"thesaurus": }



for term in Terms:

    termDetail = {

        "Description": term.find('Description').text,

        "Title": term.find('Title').text

    }

    RelatedTerms = term.select('RelatedTerms > Term')

    if RelatedTerms:

        termDetail["RelatedTerms"] = 

        for rterm in RelatedTerms:

            termDetail["RelatedTerms"].append({

                "Title": rterm.find('Title').text,

                "Relationship": rterm.find('Relationship').text

            })

    jsonObj["thesaurus"].append(termDetail)



print json.dumps(jsonObj, indent=4)

edited Nov 10 at 12:29

answered Nov 10 at 12:23

ewwink

8,78622236

add a comment |

up vote
1
down vote

accepted

import json

from bs4 import BeautifulSoup



xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, 'r'), "xml")

Terms = btree.select('Terms > Term')

jsonObj = {"thesaurus": }



for term in Terms:

    termDetail = {

        "Description": term.find('Description').text,

        "Title": term.find('Title').text

    }

    RelatedTerms = term.select('RelatedTerms > Term')

    if RelatedTerms:

        termDetail["RelatedTerms"] = 

        for rterm in RelatedTerms:

            termDetail["RelatedTerms"].append({

                "Title": rterm.find('Title').text,

                "Relationship": rterm.find('Relationship').text

            })

    jsonObj["thesaurus"].append(termDetail)



print json.dumps(jsonObj, indent=4)

edited Nov 10 at 12:29

answered Nov 10 at 12:23

ewwink

8,78622236

add a comment |

up vote
1
down vote

accepted

import json

from bs4 import BeautifulSoup



xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, 'r'), "xml")

Terms = btree.select('Terms > Term')

jsonObj = {"thesaurus": }



for term in Terms:

    termDetail = {

        "Description": term.find('Description').text,

        "Title": term.find('Title').text

    }

    RelatedTerms = term.select('RelatedTerms > Term')

    if RelatedTerms:

        termDetail["RelatedTerms"] = 

        for rterm in RelatedTerms:

            termDetail["RelatedTerms"].append({

                "Title": rterm.find('Title').text,

                "Relationship": rterm.find('Relationship').text

            })

    jsonObj["thesaurus"].append(termDetail)



print json.dumps(jsonObj, indent=4)

edited Nov 10 at 12:29

answered Nov 10 at 12:23

ewwink

8,78622236

import json

from bs4 import BeautifulSoup



xml_file = './xml.xml'

btree = BeautifulSoup(open(xml_file, 'r'), "xml")

Terms = btree.select('Terms > Term')

jsonObj = {"thesaurus": }



for term in Terms:

    termDetail = {

        "Description": term.find('Description').text,

        "Title": term.find('Title').text

    }

    RelatedTerms = term.select('RelatedTerms > Term')

    if RelatedTerms:

        termDetail["RelatedTerms"] = 

        for rterm in RelatedTerms:

            termDetail["RelatedTerms"].append({

                "Title": rterm.find('Title').text,

                "Relationship": rterm.find('Relationship').text

            })

    jsonObj["thesaurus"].append(termDetail)



print json.dumps(jsonObj, indent=4)

edited Nov 10 at 12:29

answered Nov 10 at 12:23

ewwink

8,78622236

edited Nov 10 at 12:29

answered Nov 10 at 12:23

ewwink

8,78622236

answered Nov 10 at 12:23

ewwink

8,78622236

answered Nov 10 at 12:23

ewwink

8,78622236

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Xtykutl