Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
259 views
in Technique[技术] by (71.8m points)

python 3.x - Reading CDATA with lxml, problem with end of line

Hello I am parsing a xml document with contains bunch of CDATA sections. I was working with no problems till now. I realised that when I am reading the an element and getting the text abribute I am getting end of line characters at the beggining and also at the end of the text read it.

A piece of the important code as follow:

for comments in self.xml.iter("Comments"):
    for comment in comments.iter("Comment"):
        description = comment.get('Description')

        if language == "Arab":
            tag = self.name + description
            text = comment.text

The problem is at element Comment, he is made it as follow:

<Comment>
<![CDATA[Usually made it with not reason]]>

I try to get the text atribute and I am getting like that:


Usually made it with not reason

I Know that I could do a strip and so on. But I would like to fix the problem from the root cause, and maybe there is some option before to parse with elementree.

When I am parsing the xml file I am doing like that:

tree = ET.parse(xml)

Minimal reproducible example

import xml.etree.ElementTree as ET

filename = test.xml  #Place here your path test xml file

tree = ET.parse(filename)
root = tree.getroot()
Description = root[0]
text = Description.text

print (text)

Minimal xml file

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Description>
<![CDATA[Hello world]]>
</Description>

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You're getting newline characters because there are newline characters:

<Comment>
<![CDATA[Usually made it with not reason]]>
</Comment>

Why else would <![CDATA and </Comment start on new lines?

If you don't want newline characters, remove them:

<Comment><![CDATA[Usually made it with not reason]]></Comment>

Everything inside an element counts towards its string value.

<![CDATA[...]]> is not an element, it's a parser flag. It changes how the XML parser is reading the enclosed characters. You can have multiple CDATA sections in the same element, switching between "regular mode" and "cdata mode" at will:

<Comment>normal text <![CDATA[
    CDATA mode, this may contain <unescaped> Characters!
]]> now normal text again
<![CDATA[more special text]]> now normal text again
</Comment>

Any newlines before and after a CDATA section count towards the "normal text" section. When the parser reads this, it will create one long string consisting of the individual parts:

normal text 
    CDATA mode, this may contain <unescaped> Characters!
 now normal text again
more special text now normal text again

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...