Loading XML file to Hive tables











up vote
0
down vote

favorite












We are working on loading a semi-structured XML file in Hive tables. These are retail purchase data.

I have attached a sample XML file to understand how the data looks and also the Hive table definition which i am using to read that XML file.



The file has multiple Payment and Amount(Amt) sections under "Tender" for each Basket in the XML. For example a customer can pay by Cash,EFTPOS,Credit card,Loyalty card or sometimes use a combination of these.



While reading the data from the attached XML , the results show the Payment column merged with entries in the XML and for 'Amt' it shows NULL.



I understood that the table which I am using to read the data is NOT 100% correct.

Please let me know how i can create multiple records for same element names "Payment" and "Amt"using the xpath functions



I tried to find some documentation in internet for the same and couldn't find anything which is similar to my scenario.



Sample data:



<Bskt>
<TillNo>4</TillNo>
<BsktNo>1747</BsktNo>
<DateTime>2017-10-31T10:51:25.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2679911927</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1748</BsktNo>
<DateTime>2017-10-31T10:53:11.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2619183833</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>28.10</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>8263734549</CardNo>
<Program>SmartRewards</Program>
<Earn>28.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>


Hive table:



CREATE EXTERNAL TABLE BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` string,
`Amt` float,
`CardNo` string,
`Program` string,
`Earn` float,
`Burn` float
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/Paytype/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '<hdfs file location>'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);


Hive query output:



select * from BASKET_TENDER


Query result










share|improve this question
























  • Typo on /Paytype/text()...
    – cricket_007
    Nov 9 at 1:00










  • So, you have two possible /Bskt/Tender/ paths for every <Bskt>. How are you letting Hive know which one it needs to parse?
    – cricket_007
    Nov 9 at 1:03












  • Typo Corrected. Even my confusion is same.How can i let Hive know which one it needs to parse in my case where each 'Bskt' has two 'Tenders' with same names for elements "Payment" and "Amt".
    – Nandakumar_bigdata
    Nov 9 at 4:09












  • I need to achieve either of these: 1) Create another row for 2nd tender with same Basket details. OR 2) Create another column called "Payment_Loyalty" and "Amt_Loyalty" and load the second tender details there. I'm wondering how i can achieve either of these solutions.
    – Nandakumar_bigdata
    Nov 9 at 4:13

















up vote
0
down vote

favorite












We are working on loading a semi-structured XML file in Hive tables. These are retail purchase data.

I have attached a sample XML file to understand how the data looks and also the Hive table definition which i am using to read that XML file.



The file has multiple Payment and Amount(Amt) sections under "Tender" for each Basket in the XML. For example a customer can pay by Cash,EFTPOS,Credit card,Loyalty card or sometimes use a combination of these.



While reading the data from the attached XML , the results show the Payment column merged with entries in the XML and for 'Amt' it shows NULL.



I understood that the table which I am using to read the data is NOT 100% correct.

Please let me know how i can create multiple records for same element names "Payment" and "Amt"using the xpath functions



I tried to find some documentation in internet for the same and couldn't find anything which is similar to my scenario.



Sample data:



<Bskt>
<TillNo>4</TillNo>
<BsktNo>1747</BsktNo>
<DateTime>2017-10-31T10:51:25.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2679911927</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1748</BsktNo>
<DateTime>2017-10-31T10:53:11.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2619183833</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>28.10</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>8263734549</CardNo>
<Program>SmartRewards</Program>
<Earn>28.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>


Hive table:



CREATE EXTERNAL TABLE BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` string,
`Amt` float,
`CardNo` string,
`Program` string,
`Earn` float,
`Burn` float
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/Paytype/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '<hdfs file location>'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);


Hive query output:



select * from BASKET_TENDER


Query result










share|improve this question
























  • Typo on /Paytype/text()...
    – cricket_007
    Nov 9 at 1:00










  • So, you have two possible /Bskt/Tender/ paths for every <Bskt>. How are you letting Hive know which one it needs to parse?
    – cricket_007
    Nov 9 at 1:03












  • Typo Corrected. Even my confusion is same.How can i let Hive know which one it needs to parse in my case where each 'Bskt' has two 'Tenders' with same names for elements "Payment" and "Amt".
    – Nandakumar_bigdata
    Nov 9 at 4:09












  • I need to achieve either of these: 1) Create another row for 2nd tender with same Basket details. OR 2) Create another column called "Payment_Loyalty" and "Amt_Loyalty" and load the second tender details there. I'm wondering how i can achieve either of these solutions.
    – Nandakumar_bigdata
    Nov 9 at 4:13















up vote
0
down vote

favorite









up vote
0
down vote

favorite











We are working on loading a semi-structured XML file in Hive tables. These are retail purchase data.

I have attached a sample XML file to understand how the data looks and also the Hive table definition which i am using to read that XML file.



The file has multiple Payment and Amount(Amt) sections under "Tender" for each Basket in the XML. For example a customer can pay by Cash,EFTPOS,Credit card,Loyalty card or sometimes use a combination of these.



While reading the data from the attached XML , the results show the Payment column merged with entries in the XML and for 'Amt' it shows NULL.



I understood that the table which I am using to read the data is NOT 100% correct.

Please let me know how i can create multiple records for same element names "Payment" and "Amt"using the xpath functions



I tried to find some documentation in internet for the same and couldn't find anything which is similar to my scenario.



Sample data:



<Bskt>
<TillNo>4</TillNo>
<BsktNo>1747</BsktNo>
<DateTime>2017-10-31T10:51:25.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2679911927</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1748</BsktNo>
<DateTime>2017-10-31T10:53:11.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2619183833</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>28.10</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>8263734549</CardNo>
<Program>SmartRewards</Program>
<Earn>28.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>


Hive table:



CREATE EXTERNAL TABLE BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` string,
`Amt` float,
`CardNo` string,
`Program` string,
`Earn` float,
`Burn` float
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/Paytype/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '<hdfs file location>'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);


Hive query output:



select * from BASKET_TENDER


Query result










share|improve this question















We are working on loading a semi-structured XML file in Hive tables. These are retail purchase data.

I have attached a sample XML file to understand how the data looks and also the Hive table definition which i am using to read that XML file.



The file has multiple Payment and Amount(Amt) sections under "Tender" for each Basket in the XML. For example a customer can pay by Cash,EFTPOS,Credit card,Loyalty card or sometimes use a combination of these.



While reading the data from the attached XML , the results show the Payment column merged with entries in the XML and for 'Amt' it shows NULL.



I understood that the table which I am using to read the data is NOT 100% correct.

Please let me know how i can create multiple records for same element names "Payment" and "Amt"using the xpath functions



I tried to find some documentation in internet for the same and couldn't find anything which is similar to my scenario.



Sample data:



<Bskt>
<TillNo>4</TillNo>
<BsktNo>1747</BsktNo>
<DateTime>2017-10-31T10:51:25.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2679911927</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1748</BsktNo>
<DateTime>2017-10-31T10:53:11.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2619183833</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>28.10</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>8263734549</CardNo>
<Program>SmartRewards</Program>
<Earn>28.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>


Hive table:



CREATE EXTERNAL TABLE BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` string,
`Amt` float,
`CardNo` string,
`Program` string,
`Earn` float,
`Burn` float
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/Paytype/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '<hdfs file location>'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);


Hive query output:



select * from BASKET_TENDER


Query result







xml parsing hadoop xpath hive






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 9 at 0:59









cricket_007

76.8k1042106




76.8k1042106










asked Nov 8 at 23:31









Nandakumar_bigdata

53




53












  • Typo on /Paytype/text()...
    – cricket_007
    Nov 9 at 1:00










  • So, you have two possible /Bskt/Tender/ paths for every <Bskt>. How are you letting Hive know which one it needs to parse?
    – cricket_007
    Nov 9 at 1:03












  • Typo Corrected. Even my confusion is same.How can i let Hive know which one it needs to parse in my case where each 'Bskt' has two 'Tenders' with same names for elements "Payment" and "Amt".
    – Nandakumar_bigdata
    Nov 9 at 4:09












  • I need to achieve either of these: 1) Create another row for 2nd tender with same Basket details. OR 2) Create another column called "Payment_Loyalty" and "Amt_Loyalty" and load the second tender details there. I'm wondering how i can achieve either of these solutions.
    – Nandakumar_bigdata
    Nov 9 at 4:13




















  • Typo on /Paytype/text()...
    – cricket_007
    Nov 9 at 1:00










  • So, you have two possible /Bskt/Tender/ paths for every <Bskt>. How are you letting Hive know which one it needs to parse?
    – cricket_007
    Nov 9 at 1:03












  • Typo Corrected. Even my confusion is same.How can i let Hive know which one it needs to parse in my case where each 'Bskt' has two 'Tenders' with same names for elements "Payment" and "Amt".
    – Nandakumar_bigdata
    Nov 9 at 4:09












  • I need to achieve either of these: 1) Create another row for 2nd tender with same Basket details. OR 2) Create another column called "Payment_Loyalty" and "Amt_Loyalty" and load the second tender details there. I'm wondering how i can achieve either of these solutions.
    – Nandakumar_bigdata
    Nov 9 at 4:13


















Typo on /Paytype/text()...
– cricket_007
Nov 9 at 1:00




Typo on /Paytype/text()...
– cricket_007
Nov 9 at 1:00












So, you have two possible /Bskt/Tender/ paths for every <Bskt>. How are you letting Hive know which one it needs to parse?
– cricket_007
Nov 9 at 1:03






So, you have two possible /Bskt/Tender/ paths for every <Bskt>. How are you letting Hive know which one it needs to parse?
– cricket_007
Nov 9 at 1:03














Typo Corrected. Even my confusion is same.How can i let Hive know which one it needs to parse in my case where each 'Bskt' has two 'Tenders' with same names for elements "Payment" and "Amt".
– Nandakumar_bigdata
Nov 9 at 4:09






Typo Corrected. Even my confusion is same.How can i let Hive know which one it needs to parse in my case where each 'Bskt' has two 'Tenders' with same names for elements "Payment" and "Amt".
– Nandakumar_bigdata
Nov 9 at 4:09














I need to achieve either of these: 1) Create another row for 2nd tender with same Basket details. OR 2) Create another column called "Payment_Loyalty" and "Amt_Loyalty" and load the second tender details there. I'm wondering how i can achieve either of these solutions.
– Nandakumar_bigdata
Nov 9 at 4:13






I need to achieve either of these: 1) Create another row for 2nd tender with same Basket details. OR 2) Create another column called "Payment_Loyalty" and "Amt_Loyalty" and load the second tender details there. I'm wondering how i can achieve either of these solutions.
– Nandakumar_bigdata
Nov 9 at 4:13














1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










GET only one Field



You can use subscript operator () to parse xml. However, please make sure index starts with 1 and not 0.



In your case, I assume you want second instance of tender. Then just use following xml path.



"column.xpath.PayType"="/Bskt/Tender[2]/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender[2]/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender[2]/Amt/text()",


This will give you following values.



enter image description here



Get Both the fields



if you want to get both the fields as part of array, you need to define those field as array and do not provide any subscript operator when selecting xpath, as mentioned below



drop table temp.BASKET_TENDER;
CREATE EXTERNAL TABLE temp.BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` array<String>,
`Amt` array<float>,
`CardNo` array<string>,
`Program` string,
`Earn` float,
`Burn` float
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/tmp/BASKET_TENDER'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);

select * from temp.BASKET_TENDER;


The output would be as mentioned blow



enter image description here






share|improve this answer























  • Would it be possible to get XPATH to fill a Hive array field?
    – cricket_007
    Nov 9 at 16:09










  • yes, it's possible in that case you need to have fields as array
    – Gaurang Shah
    Nov 9 at 17:42










  • Thank you so much Gaurang.. It solved my problem. But the options provided above works fine :)
    – Nandakumar_bigdata
    Nov 11 at 22:18










  • Typo.. Both the options provided works fine :)
    – Nandakumar_bigdata
    Nov 11 at 23:19










  • HI Gaurang.. Could you please help me with a solution for my problem here:stackoverflow.com/questions/53384728/…
    – Nandakumar_bigdata
    Nov 20 at 3:05













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53217712%2floading-xml-file-to-hive-tables%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










GET only one Field



You can use subscript operator () to parse xml. However, please make sure index starts with 1 and not 0.



In your case, I assume you want second instance of tender. Then just use following xml path.



"column.xpath.PayType"="/Bskt/Tender[2]/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender[2]/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender[2]/Amt/text()",


This will give you following values.



enter image description here



Get Both the fields



if you want to get both the fields as part of array, you need to define those field as array and do not provide any subscript operator when selecting xpath, as mentioned below



drop table temp.BASKET_TENDER;
CREATE EXTERNAL TABLE temp.BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` array<String>,
`Amt` array<float>,
`CardNo` array<string>,
`Program` string,
`Earn` float,
`Burn` float
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/tmp/BASKET_TENDER'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);

select * from temp.BASKET_TENDER;


The output would be as mentioned blow



enter image description here






share|improve this answer























  • Would it be possible to get XPATH to fill a Hive array field?
    – cricket_007
    Nov 9 at 16:09










  • yes, it's possible in that case you need to have fields as array
    – Gaurang Shah
    Nov 9 at 17:42










  • Thank you so much Gaurang.. It solved my problem. But the options provided above works fine :)
    – Nandakumar_bigdata
    Nov 11 at 22:18










  • Typo.. Both the options provided works fine :)
    – Nandakumar_bigdata
    Nov 11 at 23:19










  • HI Gaurang.. Could you please help me with a solution for my problem here:stackoverflow.com/questions/53384728/…
    – Nandakumar_bigdata
    Nov 20 at 3:05

















up vote
1
down vote



accepted










GET only one Field



You can use subscript operator () to parse xml. However, please make sure index starts with 1 and not 0.



In your case, I assume you want second instance of tender. Then just use following xml path.



"column.xpath.PayType"="/Bskt/Tender[2]/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender[2]/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender[2]/Amt/text()",


This will give you following values.



enter image description here



Get Both the fields



if you want to get both the fields as part of array, you need to define those field as array and do not provide any subscript operator when selecting xpath, as mentioned below



drop table temp.BASKET_TENDER;
CREATE EXTERNAL TABLE temp.BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` array<String>,
`Amt` array<float>,
`CardNo` array<string>,
`Program` string,
`Earn` float,
`Burn` float
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/tmp/BASKET_TENDER'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);

select * from temp.BASKET_TENDER;


The output would be as mentioned blow



enter image description here






share|improve this answer























  • Would it be possible to get XPATH to fill a Hive array field?
    – cricket_007
    Nov 9 at 16:09










  • yes, it's possible in that case you need to have fields as array
    – Gaurang Shah
    Nov 9 at 17:42










  • Thank you so much Gaurang.. It solved my problem. But the options provided above works fine :)
    – Nandakumar_bigdata
    Nov 11 at 22:18










  • Typo.. Both the options provided works fine :)
    – Nandakumar_bigdata
    Nov 11 at 23:19










  • HI Gaurang.. Could you please help me with a solution for my problem here:stackoverflow.com/questions/53384728/…
    – Nandakumar_bigdata
    Nov 20 at 3:05















up vote
1
down vote



accepted







up vote
1
down vote



accepted






GET only one Field



You can use subscript operator () to parse xml. However, please make sure index starts with 1 and not 0.



In your case, I assume you want second instance of tender. Then just use following xml path.



"column.xpath.PayType"="/Bskt/Tender[2]/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender[2]/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender[2]/Amt/text()",


This will give you following values.



enter image description here



Get Both the fields



if you want to get both the fields as part of array, you need to define those field as array and do not provide any subscript operator when selecting xpath, as mentioned below



drop table temp.BASKET_TENDER;
CREATE EXTERNAL TABLE temp.BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` array<String>,
`Amt` array<float>,
`CardNo` array<string>,
`Program` string,
`Earn` float,
`Burn` float
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/tmp/BASKET_TENDER'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);

select * from temp.BASKET_TENDER;


The output would be as mentioned blow



enter image description here






share|improve this answer














GET only one Field



You can use subscript operator () to parse xml. However, please make sure index starts with 1 and not 0.



In your case, I assume you want second instance of tender. Then just use following xml path.



"column.xpath.PayType"="/Bskt/Tender[2]/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender[2]/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender[2]/Amt/text()",


This will give you following values.



enter image description here



Get Both the fields



if you want to get both the fields as part of array, you need to define those field as array and do not provide any subscript operator when selecting xpath, as mentioned below



drop table temp.BASKET_TENDER;
CREATE EXTERNAL TABLE temp.BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` array<String>,
`Amt` array<float>,
`CardNo` array<string>,
`Program` string,
`Earn` float,
`Burn` float
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/tmp/BASKET_TENDER'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);

select * from temp.BASKET_TENDER;


The output would be as mentioned blow



enter image description here







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 9 at 17:46

























answered Nov 9 at 14:30









Gaurang Shah

2,67311030




2,67311030












  • Would it be possible to get XPATH to fill a Hive array field?
    – cricket_007
    Nov 9 at 16:09










  • yes, it's possible in that case you need to have fields as array
    – Gaurang Shah
    Nov 9 at 17:42










  • Thank you so much Gaurang.. It solved my problem. But the options provided above works fine :)
    – Nandakumar_bigdata
    Nov 11 at 22:18










  • Typo.. Both the options provided works fine :)
    – Nandakumar_bigdata
    Nov 11 at 23:19










  • HI Gaurang.. Could you please help me with a solution for my problem here:stackoverflow.com/questions/53384728/…
    – Nandakumar_bigdata
    Nov 20 at 3:05




















  • Would it be possible to get XPATH to fill a Hive array field?
    – cricket_007
    Nov 9 at 16:09










  • yes, it's possible in that case you need to have fields as array
    – Gaurang Shah
    Nov 9 at 17:42










  • Thank you so much Gaurang.. It solved my problem. But the options provided above works fine :)
    – Nandakumar_bigdata
    Nov 11 at 22:18










  • Typo.. Both the options provided works fine :)
    – Nandakumar_bigdata
    Nov 11 at 23:19










  • HI Gaurang.. Could you please help me with a solution for my problem here:stackoverflow.com/questions/53384728/…
    – Nandakumar_bigdata
    Nov 20 at 3:05


















Would it be possible to get XPATH to fill a Hive array field?
– cricket_007
Nov 9 at 16:09




Would it be possible to get XPATH to fill a Hive array field?
– cricket_007
Nov 9 at 16:09












yes, it's possible in that case you need to have fields as array
– Gaurang Shah
Nov 9 at 17:42




yes, it's possible in that case you need to have fields as array
– Gaurang Shah
Nov 9 at 17:42












Thank you so much Gaurang.. It solved my problem. But the options provided above works fine :)
– Nandakumar_bigdata
Nov 11 at 22:18




Thank you so much Gaurang.. It solved my problem. But the options provided above works fine :)
– Nandakumar_bigdata
Nov 11 at 22:18












Typo.. Both the options provided works fine :)
– Nandakumar_bigdata
Nov 11 at 23:19




Typo.. Both the options provided works fine :)
– Nandakumar_bigdata
Nov 11 at 23:19












HI Gaurang.. Could you please help me with a solution for my problem here:stackoverflow.com/questions/53384728/…
– Nandakumar_bigdata
Nov 20 at 3:05






HI Gaurang.. Could you please help me with a solution for my problem here:stackoverflow.com/questions/53384728/…
– Nandakumar_bigdata
Nov 20 at 3:05




















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53217712%2floading-xml-file-to-hive-tables%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Schultheiß

Verwaltungsgliederung Dänemarks

Liste der Kulturdenkmale in Wilsdruff