Create JSON with XML file using BeautifulSoup











up vote
0
down vote

favorite












I am using Jupyer notebook, running python 3. My task is to extract data from XML file and convert it to json format (perhaps even save the json in an output.dat file). I am using BeautifulSoup to navigate through the nodes. I have the following data:



<?xml version='1.0' encoding='UTF-8'?> 
<Terms>
<Term>
<Title>.177 (4.5mm) Airgun</Title>
<Description>The standard airgun calibre for international target
shooting.</Description>
<RelatedTerms>
<Term>
<Title>Shooting sport equipment</Title>
<Relationship>Narrower Term</Relationship>
</Term>
</RelatedTerms>
</Term>
<Term>
<Title>1 Kilometre Time Trial</Title>
<Description>test2</Description>
<RelatedTerms>
<Term>
<Title>1 Kilometre TT</Title>
<Relationship>Used For</Relationship>
</Term>
<Term>
<Title>1km Time Trial</Title>
<Relationship>Used For</Relationship>
</Term>
<Term>
<Title>1km Time Trial</Title>
<Relationship>Used For</Relationship>
</Term>
<Term>
<Title>1km TT</Title>
<Relationship>Used For</Relationship>
</Term>
<Term>
<Title>One km Time Trial</Title>
<Relationship>Used For</Relationship>
</Term>
</RelatedTerms>
</Term>


This is the following output that I am expecting in JSON:



{
"thesaurus": [
{
"Description": "The standard airgun calibre for international target shooting.",
"RelatedTerms": [
{
"Relationship": "Narrower Term",
"Title": "Shooting sport equipment"
}
],
"Title": ".177 (4.5mm) Airgun"
},

{
"Description": "test2",
"RelatedTerms": [
{
"Relationship": "Used For",
"Title": "1 Kilometre TT"
},
{
"Relationship": "Used For",
"Title": "1km Time Trial"
},
{
"Relationship": "Used For",
"Title": "1km Time Trial"
},
{
"Relationship": "Used For",
"Title": "1km TT"
},
{
"Relationship": "Used For",
"Title": "One km Time Trial"
}
],
"Title": "1 Kilometre Time Trial"
},


I am navigating through the tags so that I can create dictionaries as seen in the output example. Since I am new to text scraping, this is quite frustrating.



I was able to extract the "Description" tag with the following code:



xml_file = './xml.xml'
btree = BeautifulSoup(open(xml_file, encoding="utf8"),"xml")
elements = btree.find_all('Description')
descriptionTag =
for element in elements:
descriptionTag.append(element.text)


Like the above Description tag, I am not sure how to create a list of dictionaries for the information stored between the "RelatedTerms" tag.
Ideally, I would parse all the tags to a dataframe which would then convert the data to JSON format.



So, can someone please help in determining how to extract the information from "RelatedTerms" tag.










share|improve this question


























    up vote
    0
    down vote

    favorite












    I am using Jupyer notebook, running python 3. My task is to extract data from XML file and convert it to json format (perhaps even save the json in an output.dat file). I am using BeautifulSoup to navigate through the nodes. I have the following data:



    <?xml version='1.0' encoding='UTF-8'?> 
    <Terms>
    <Term>
    <Title>.177 (4.5mm) Airgun</Title>
    <Description>The standard airgun calibre for international target
    shooting.</Description>
    <RelatedTerms>
    <Term>
    <Title>Shooting sport equipment</Title>
    <Relationship>Narrower Term</Relationship>
    </Term>
    </RelatedTerms>
    </Term>
    <Term>
    <Title>1 Kilometre Time Trial</Title>
    <Description>test2</Description>
    <RelatedTerms>
    <Term>
    <Title>1 Kilometre TT</Title>
    <Relationship>Used For</Relationship>
    </Term>
    <Term>
    <Title>1km Time Trial</Title>
    <Relationship>Used For</Relationship>
    </Term>
    <Term>
    <Title>1km Time Trial</Title>
    <Relationship>Used For</Relationship>
    </Term>
    <Term>
    <Title>1km TT</Title>
    <Relationship>Used For</Relationship>
    </Term>
    <Term>
    <Title>One km Time Trial</Title>
    <Relationship>Used For</Relationship>
    </Term>
    </RelatedTerms>
    </Term>


    This is the following output that I am expecting in JSON:



    {
    "thesaurus": [
    {
    "Description": "The standard airgun calibre for international target shooting.",
    "RelatedTerms": [
    {
    "Relationship": "Narrower Term",
    "Title": "Shooting sport equipment"
    }
    ],
    "Title": ".177 (4.5mm) Airgun"
    },

    {
    "Description": "test2",
    "RelatedTerms": [
    {
    "Relationship": "Used For",
    "Title": "1 Kilometre TT"
    },
    {
    "Relationship": "Used For",
    "Title": "1km Time Trial"
    },
    {
    "Relationship": "Used For",
    "Title": "1km Time Trial"
    },
    {
    "Relationship": "Used For",
    "Title": "1km TT"
    },
    {
    "Relationship": "Used For",
    "Title": "One km Time Trial"
    }
    ],
    "Title": "1 Kilometre Time Trial"
    },


    I am navigating through the tags so that I can create dictionaries as seen in the output example. Since I am new to text scraping, this is quite frustrating.



    I was able to extract the "Description" tag with the following code:



    xml_file = './xml.xml'
    btree = BeautifulSoup(open(xml_file, encoding="utf8"),"xml")
    elements = btree.find_all('Description')
    descriptionTag =
    for element in elements:
    descriptionTag.append(element.text)


    Like the above Description tag, I am not sure how to create a list of dictionaries for the information stored between the "RelatedTerms" tag.
    Ideally, I would parse all the tags to a dataframe which would then convert the data to JSON format.



    So, can someone please help in determining how to extract the information from "RelatedTerms" tag.










    share|improve this question
























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I am using Jupyer notebook, running python 3. My task is to extract data from XML file and convert it to json format (perhaps even save the json in an output.dat file). I am using BeautifulSoup to navigate through the nodes. I have the following data:



      <?xml version='1.0' encoding='UTF-8'?> 
      <Terms>
      <Term>
      <Title>.177 (4.5mm) Airgun</Title>
      <Description>The standard airgun calibre for international target
      shooting.</Description>
      <RelatedTerms>
      <Term>
      <Title>Shooting sport equipment</Title>
      <Relationship>Narrower Term</Relationship>
      </Term>
      </RelatedTerms>
      </Term>
      <Term>
      <Title>1 Kilometre Time Trial</Title>
      <Description>test2</Description>
      <RelatedTerms>
      <Term>
      <Title>1 Kilometre TT</Title>
      <Relationship>Used For</Relationship>
      </Term>
      <Term>
      <Title>1km Time Trial</Title>
      <Relationship>Used For</Relationship>
      </Term>
      <Term>
      <Title>1km Time Trial</Title>
      <Relationship>Used For</Relationship>
      </Term>
      <Term>
      <Title>1km TT</Title>
      <Relationship>Used For</Relationship>
      </Term>
      <Term>
      <Title>One km Time Trial</Title>
      <Relationship>Used For</Relationship>
      </Term>
      </RelatedTerms>
      </Term>


      This is the following output that I am expecting in JSON:



      {
      "thesaurus": [
      {
      "Description": "The standard airgun calibre for international target shooting.",
      "RelatedTerms": [
      {
      "Relationship": "Narrower Term",
      "Title": "Shooting sport equipment"
      }
      ],
      "Title": ".177 (4.5mm) Airgun"
      },

      {
      "Description": "test2",
      "RelatedTerms": [
      {
      "Relationship": "Used For",
      "Title": "1 Kilometre TT"
      },
      {
      "Relationship": "Used For",
      "Title": "1km Time Trial"
      },
      {
      "Relationship": "Used For",
      "Title": "1km Time Trial"
      },
      {
      "Relationship": "Used For",
      "Title": "1km TT"
      },
      {
      "Relationship": "Used For",
      "Title": "One km Time Trial"
      }
      ],
      "Title": "1 Kilometre Time Trial"
      },


      I am navigating through the tags so that I can create dictionaries as seen in the output example. Since I am new to text scraping, this is quite frustrating.



      I was able to extract the "Description" tag with the following code:



      xml_file = './xml.xml'
      btree = BeautifulSoup(open(xml_file, encoding="utf8"),"xml")
      elements = btree.find_all('Description')
      descriptionTag =
      for element in elements:
      descriptionTag.append(element.text)


      Like the above Description tag, I am not sure how to create a list of dictionaries for the information stored between the "RelatedTerms" tag.
      Ideally, I would parse all the tags to a dataframe which would then convert the data to JSON format.



      So, can someone please help in determining how to extract the information from "RelatedTerms" tag.










      share|improve this question













      I am using Jupyer notebook, running python 3. My task is to extract data from XML file and convert it to json format (perhaps even save the json in an output.dat file). I am using BeautifulSoup to navigate through the nodes. I have the following data:



      <?xml version='1.0' encoding='UTF-8'?> 
      <Terms>
      <Term>
      <Title>.177 (4.5mm) Airgun</Title>
      <Description>The standard airgun calibre for international target
      shooting.</Description>
      <RelatedTerms>
      <Term>
      <Title>Shooting sport equipment</Title>
      <Relationship>Narrower Term</Relationship>
      </Term>
      </RelatedTerms>
      </Term>
      <Term>
      <Title>1 Kilometre Time Trial</Title>
      <Description>test2</Description>
      <RelatedTerms>
      <Term>
      <Title>1 Kilometre TT</Title>
      <Relationship>Used For</Relationship>
      </Term>
      <Term>
      <Title>1km Time Trial</Title>
      <Relationship>Used For</Relationship>
      </Term>
      <Term>
      <Title>1km Time Trial</Title>
      <Relationship>Used For</Relationship>
      </Term>
      <Term>
      <Title>1km TT</Title>
      <Relationship>Used For</Relationship>
      </Term>
      <Term>
      <Title>One km Time Trial</Title>
      <Relationship>Used For</Relationship>
      </Term>
      </RelatedTerms>
      </Term>


      This is the following output that I am expecting in JSON:



      {
      "thesaurus": [
      {
      "Description": "The standard airgun calibre for international target shooting.",
      "RelatedTerms": [
      {
      "Relationship": "Narrower Term",
      "Title": "Shooting sport equipment"
      }
      ],
      "Title": ".177 (4.5mm) Airgun"
      },

      {
      "Description": "test2",
      "RelatedTerms": [
      {
      "Relationship": "Used For",
      "Title": "1 Kilometre TT"
      },
      {
      "Relationship": "Used For",
      "Title": "1km Time Trial"
      },
      {
      "Relationship": "Used For",
      "Title": "1km Time Trial"
      },
      {
      "Relationship": "Used For",
      "Title": "1km TT"
      },
      {
      "Relationship": "Used For",
      "Title": "One km Time Trial"
      }
      ],
      "Title": "1 Kilometre Time Trial"
      },


      I am navigating through the tags so that I can create dictionaries as seen in the output example. Since I am new to text scraping, this is quite frustrating.



      I was able to extract the "Description" tag with the following code:



      xml_file = './xml.xml'
      btree = BeautifulSoup(open(xml_file, encoding="utf8"),"xml")
      elements = btree.find_all('Description')
      descriptionTag =
      for element in elements:
      descriptionTag.append(element.text)


      Like the above Description tag, I am not sure how to create a list of dictionaries for the information stored between the "RelatedTerms" tag.
      Ideally, I would parse all the tags to a dataframe which would then convert the data to JSON format.



      So, can someone please help in determining how to extract the information from "RelatedTerms" tag.







      json xml beautifulsoup






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 10 at 9:35









      Timetraveller

      129114




      129114
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          1
          down vote



          accepted










          to extract RelatedTerms first you have to extract top Term element using btree.select('Terms > Term') now you can loop it and extract Term inside RelatedTerms using term.select('RelatedTerms > Term')



          import json
          from bs4 import BeautifulSoup

          xml_file = './xml.xml'
          btree = BeautifulSoup(open(xml_file, 'r'), "xml")
          Terms = btree.select('Terms > Term')
          jsonObj = {"thesaurus": }

          for term in Terms:
          termDetail = {
          "Description": term.find('Description').text,
          "Title": term.find('Title').text
          }
          RelatedTerms = term.select('RelatedTerms > Term')
          if RelatedTerms:
          termDetail["RelatedTerms"] =
          for rterm in RelatedTerms:
          termDetail["RelatedTerms"].append({
          "Title": rterm.find('Title').text,
          "Relationship": rterm.find('Relationship').text
          })
          jsonObj["thesaurus"].append(termDetail)

          print json.dumps(jsonObj, indent=4)





          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53237663%2fcreate-json-with-xml-file-using-beautifulsoup%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote



            accepted










            to extract RelatedTerms first you have to extract top Term element using btree.select('Terms > Term') now you can loop it and extract Term inside RelatedTerms using term.select('RelatedTerms > Term')



            import json
            from bs4 import BeautifulSoup

            xml_file = './xml.xml'
            btree = BeautifulSoup(open(xml_file, 'r'), "xml")
            Terms = btree.select('Terms > Term')
            jsonObj = {"thesaurus": }

            for term in Terms:
            termDetail = {
            "Description": term.find('Description').text,
            "Title": term.find('Title').text
            }
            RelatedTerms = term.select('RelatedTerms > Term')
            if RelatedTerms:
            termDetail["RelatedTerms"] =
            for rterm in RelatedTerms:
            termDetail["RelatedTerms"].append({
            "Title": rterm.find('Title').text,
            "Relationship": rterm.find('Relationship').text
            })
            jsonObj["thesaurus"].append(termDetail)

            print json.dumps(jsonObj, indent=4)





            share|improve this answer



























              up vote
              1
              down vote



              accepted










              to extract RelatedTerms first you have to extract top Term element using btree.select('Terms > Term') now you can loop it and extract Term inside RelatedTerms using term.select('RelatedTerms > Term')



              import json
              from bs4 import BeautifulSoup

              xml_file = './xml.xml'
              btree = BeautifulSoup(open(xml_file, 'r'), "xml")
              Terms = btree.select('Terms > Term')
              jsonObj = {"thesaurus": }

              for term in Terms:
              termDetail = {
              "Description": term.find('Description').text,
              "Title": term.find('Title').text
              }
              RelatedTerms = term.select('RelatedTerms > Term')
              if RelatedTerms:
              termDetail["RelatedTerms"] =
              for rterm in RelatedTerms:
              termDetail["RelatedTerms"].append({
              "Title": rterm.find('Title').text,
              "Relationship": rterm.find('Relationship').text
              })
              jsonObj["thesaurus"].append(termDetail)

              print json.dumps(jsonObj, indent=4)





              share|improve this answer

























                up vote
                1
                down vote



                accepted







                up vote
                1
                down vote



                accepted






                to extract RelatedTerms first you have to extract top Term element using btree.select('Terms > Term') now you can loop it and extract Term inside RelatedTerms using term.select('RelatedTerms > Term')



                import json
                from bs4 import BeautifulSoup

                xml_file = './xml.xml'
                btree = BeautifulSoup(open(xml_file, 'r'), "xml")
                Terms = btree.select('Terms > Term')
                jsonObj = {"thesaurus": }

                for term in Terms:
                termDetail = {
                "Description": term.find('Description').text,
                "Title": term.find('Title').text
                }
                RelatedTerms = term.select('RelatedTerms > Term')
                if RelatedTerms:
                termDetail["RelatedTerms"] =
                for rterm in RelatedTerms:
                termDetail["RelatedTerms"].append({
                "Title": rterm.find('Title').text,
                "Relationship": rterm.find('Relationship').text
                })
                jsonObj["thesaurus"].append(termDetail)

                print json.dumps(jsonObj, indent=4)





                share|improve this answer














                to extract RelatedTerms first you have to extract top Term element using btree.select('Terms > Term') now you can loop it and extract Term inside RelatedTerms using term.select('RelatedTerms > Term')



                import json
                from bs4 import BeautifulSoup

                xml_file = './xml.xml'
                btree = BeautifulSoup(open(xml_file, 'r'), "xml")
                Terms = btree.select('Terms > Term')
                jsonObj = {"thesaurus": }

                for term in Terms:
                termDetail = {
                "Description": term.find('Description').text,
                "Title": term.find('Title').text
                }
                RelatedTerms = term.select('RelatedTerms > Term')
                if RelatedTerms:
                termDetail["RelatedTerms"] =
                for rterm in RelatedTerms:
                termDetail["RelatedTerms"].append({
                "Title": rterm.find('Title').text,
                "Relationship": rterm.find('Relationship').text
                })
                jsonObj["thesaurus"].append(termDetail)

                print json.dumps(jsonObj, indent=4)






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 10 at 12:29

























                answered Nov 10 at 12:23









                ewwink

                8,78622236




                8,78622236






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53237663%2fcreate-json-with-xml-file-using-beautifulsoup%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Schultheiß

                    Verwaltungsgliederung Dänemarks

                    Liste der Kulturdenkmale in Wilsdruff