Issue 126398 - Empty text-boxes in exported HTML (from impress) render self-closed DIV, leading to matryoshka-doll-like nested pages
Summary: Empty text-boxes in exported HTML (from impress) render self-closed DIV, lead...
Status: UNCONFIRMED
Alias: None
Product: Impress
Classification: Application
Component: save-export (show other issues)
Version: 4.1.1
Hardware: Mac OS X 10.9
: P5 (lowest) Normal (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords: needmoreinfo
Depends on:
Blocks:
 
Reported: 2015-07-09 19:35 UTC by sergiozambrano
Modified: 2015-08-02 11:30 UTC (History)
4 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
corrupt closing div (76.70 KB, image/png)
2015-07-27 11:54 UTC, sergiozambrano
no flags Details
Presentation with two slides (520.11 KB, application/vnd.oasis.opendocument.presentation)
2015-07-27 12:07 UTC, sergiozambrano
no flags Details
File exported from impress presentation. (652.77 KB, text/html)
2015-07-27 12:08 UTC, sergiozambrano
no flags Details
Filter used to export the HTML (75.50 KB, image/png)
2015-07-27 12:10 UTC, sergiozambrano
no flags Details
NESTED SLIDE structure, as per Developer tools (Chrome) (84.27 KB, image/png)
2015-07-28 18:20 UTC, sergiozambrano
no flags Details
Structure as seen in XML Nodepad (66.30 KB, image/png)
2015-07-28 20:05 UTC, Regina Henschel
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description sergiozambrano 2015-07-09 19:35:27 UTC
Aside the code in the export is intended for reading it back into open office and NOT to be formatted in HTML (classes and titles help nothing to work with it later) the last slide is always nested INSIDE the previous to the last.

I have a 43 slides document, and when scraped with xPath, the last slide is always missing. 

I found out it IS there, but inside slide 42 (or inside one of the zillion divs OO creates, not named #page-n
Comment 1 oooforum (fr) 2015-07-15 09:35:14 UTC
Could you attach a sample document to see the problem?
Comment 2 sergiozambrano 2015-07-26 15:20:08 UTC
No need for an example.

I already found the error and tracked it down to a closing div tag misspelled (the bar is at the end, making it a self-closed tag, instead of at the beginning)

The tag also has an empty style attribute, which doesn't happen in the opening ones for a text-box - which made me realize that the comment before has nothing to do with the tag-

It seems like the lack of content to feed the parameters of the DIV tag, made it render with its defaults (empty style attribute, self-closed by default)

WHATCH OUT!

DON'T just remove the self-closing div because that tag is now part of the previous div's closing parts. (some deeper bug removed two tags, just not the right ones, so the one that remains is the one that should have been removed, and the good closing tag for its parent was removed instead).

If you do, all the pages will be inside the previous page, like a russian matryoshka doll.

It was hard to track down the error because despite there is no text-box, the heading comment STILL prints, not to mention all the comments say "Next WAS…" instead of "IS"… so reading them confuses more than what they were supposed to do (to help)

I tried to fix it, but I only got to the file body.xsl and I don't understand how the self-closing is defined (nor I understand the language… but sometimes I find the way. Not this time though :(
Comment 3 Regina Henschel 2015-07-26 19:13:02 UTC
Please add an example presentation and the export result.

Your description is not clear. What export do you use? What is "empty", a presentation object or a text-box? How do you notice something is "missing"?

I have exported a presentation to xhtml and do not see any structural problem. I get missing drawings, but that is different from your report.
Comment 4 sergiozambrano 2015-07-27 11:54:56 UTC
Created attachment 84838 [details]
corrupt closing div

comment announcing text-box which never comes, AND the next closing tag is wrong (self-closed)
Comment 5 Regina Henschel 2015-07-27 11:59:56 UTC
I do not need a picture, but the files themselves. Both.
Comment 6 sergiozambrano 2015-07-27 12:07:44 UTC
Created attachment 84839 [details]
Presentation with two slides

Presentation with two slides, as opened from .ppt from Office and saved by the first time in .odp format. ( I don't know if the error could self-fix after being saved, but I can't send the original presentation in .ppt (1) because it contains my client's information, and (2) because it's 34Mb. 

The end of the first slide/page already produces the error (self-closing div)
Comment 7 sergiozambrano 2015-07-27 12:08:29 UTC
Created attachment 84840 [details]
File exported from impress presentation.
Comment 8 sergiozambrano 2015-07-27 12:10:01 UTC
Created attachment 84841 [details]
Filter used to export the HTML
Comment 9 sergiozambrano 2015-07-27 12:32:03 UTC
What export do I use?
The one that exports the slides in the same document (the other saves individual documents, which could not make nested pages as I described.

I don't know other export but the two listed in the image attached.


How do I know something is missing?
If you read the whole thing you'll know it's irrelevant, since I already explained what the case was (xPath not finding the page at the same level as the others: inside body.)

If you refer about the closing tag missing: 
You know they should come out in pairs, right? 
If one opens and never closes, I have a tag missing.


About "empty text-boxes" (the only place where I said "empty" other than "empty attribute", which is self-explanatory).
I called it "empty text-boxes" because the comment IS there announcing the draw:text-box, and there comes no text-box, which is obvious it's not being rendered because the box (frame, master element, or whatever you call it) is EMPTY in the document.
Comment 10 sergiozambrano 2015-07-27 12:40:38 UTC
I just noticed the only time I said "empty" (and not followed by "style attribute" was in the title, and it's followed by the words "text-box".

So for your answer to "…a presentation object or a text-box?", the answer is…

wait for it…

wait for it…

"what is a text-box" Cha-chin!

:)
Comment 11 Regina Henschel 2015-07-27 15:48:33 UTC
There is no structural error in the transformation result. Test the output on http://validator.w3.org/file-upload.html . (Change the file name extension to xhtml before upload, so that it is uploaded with the correct mime type.)

<div style=""/> is useless, but nevertheless a valid empty div-element in xhtml.

There is no slide inside another. The part
<div style="clear:both; line-height:0; width:0; height:0; margin:0; padding:0;"> </div>
marks the end of a slide. You can better identify the parts of your slides, when you name the parts. The name will occur in the xhtml output as id.


I know, that the xhtml output of a presentation is poor, but it is valid.
Comment 12 hanya 2015-07-27 16:53:10 UTC
The empty style="" is always shown because 
<xsl:if test="$dimension"> is always true when the variable is defined in <xsl:template match="draw:text-box">
<xsl:if test="$dimension != ''"> is better to suppress the empty style attribute.
Comment 13 sergiozambrano 2015-07-28 18:16:13 UTC
With all due respect to advanced programmers, this exported file is INTENDED to be used as a WEB PAGE, not a XML DATA FILE.

the file I uploaded, OPENED IN CHROME, IS VALID, of course, but THE SECOND SLIDE APPEARS INSIDE THE FIRST ONE, probably because the browser decided to do so when finding a self-closing DIV, and THAT MAKES IMPOSSIBLE TO PROPERLY FORMAT THEM with CSS.

I BELIEVE that the context for your assumption of this being "correct" is WRONG.
Comment 14 sergiozambrano 2015-07-28 18:17:56 UTC
Can you just MOVE THE FORWARD SLASH to the beginning of the div tag PLEASE?
It won't hurt anyone and the file will WORK AS EXPECTED (opening and closing div tags balanced) in a BROWSER. 

Thanks.
Comment 15 sergiozambrano 2015-07-28 18:20:34 UTC
Created attachment 84843 [details]
NESTED SLIDE structure, as per Developer tools (Chrome)
Comment 16 sergiozambrano 2015-07-28 18:30:36 UTC
(In reply to hanya from comment #12)
> The empty style="" is always shown because 
> <xsl:if test="$dimension"> is always true when the variable is defined in
> <xsl:template match="draw:text-box">
> <xsl:if test="$dimension != ''"> is better to suppress the empty style
> attribute.

The tag you are talking about is the CLOSING tag for the one before the comment. YOU CAN'T PUT STYLING in a closing tag. Much less replace it for a self closing tag.

Please look at it from the HTML point of view, not XML. Replacing it for a closing tag WILL ALSO BE VALID, but useful for the public who is exporting this for the HTML (not programmers, I'm gessing 90% are not programmers). I'm pretty sure Programmers would chose a different method to export a presentation as data, probably the original OO xml.
Comment 17 sergiozambrano 2015-07-28 18:34:11 UTC
(In reply to Regina Henschel from comment #11)
> There is no structural error in the transformation result. Test the output
> on http://validator.w3.org/file-upload.html . (Change the file name
> extension to xhtml before upload, so that it is uploaded with the correct
> mime type.)
> 
> <div style=""/> is useless, but nevertheless a valid empty div-element in
> xhtml.
> 
> There is no slide inside another. The part
> <div style="clear:both; line-height:0; width:0; height:0; margin:0;
> padding:0;"> </div>
> marks the end of a slide. You can better identify the parts of your slides,
> when you name the parts. The name will occur in the xhtml output as id.
> 
> 
> I know, that the xhtml output of a presentation is poor, but it is valid.

NO OTHER CLOSING TAG has styling inside. Have you even noticed the tag in question is the closing tag for the tag before the comment? (I assumed it was for a text-box div that was not printed, but I just noticed it's THERE, before the comment. I was confused because the PAGE tag was opened with a comment saying "WAS" instead of "IS" so I assumed all the comments said "WAS", but it seems not all of them are (definitely the one before draw:page div IS)
Comment 18 Regina Henschel 2015-07-28 20:05:50 UTC
Created attachment 84844 [details]
Structure as seen in XML Nodepad

<div style=""/> is not a closing tag, but it is an empty element. It is the short form of <div style=""></div>.
See http://www.w3.org/TR/xhtml1/#h-4.6

You should consider to use a different tool to examine the output. See attached file is a screenshot of the structure as shown in "XML Nodepad". I have examined the structure manually too. It is exactly as "XML Nodepad" shows it. There is no error.

The body contains two times the comment "Next 'div' was a 'draw:page'.", which indicates the start of a slide. After the comment you see two div-elements. The first div-element contains the content of the slide, the second div-element contains the style rule "clear:both", which is needed to force the next div-element to start after any floating content.
Comment 19 hanya 2015-07-29 02:10:27 UTC
(In reply to sergiozambrano from comment #16)
> 
> Please look at it from the HTML point of view, not XML. Replacing it for a
> closing tag WILL ALSO BE VALID, but useful for the public who is exporting
> this for the HTML (not programmers, I'm gessing 90% are not programmers).
> I'm pretty sure Programmers would chose a different method to export a
> presentation as data, probably the original OO xml.

You are using wrong export filter. It's for XHTML not for HTML.
Comment 20 Regina Henschel 2015-07-29 11:30:30 UTC
(In reply to sergiozambrano from comment #15)
> Created attachment 84843 [details]
> NESTED SLIDE structure, as per Developer tools (Chrome)

I do not say, that your original document exports correctly, but the attached files have no error, so I cannot reproduce your problem. To investigate the problems in your file, you need to make the file smaller. You have embedded the pictures. That results in huge base64 encoded img-elements. If you link the pictures, you get the usual link in the src-attribute in the img-element. Then you can examine the structure of the file much easier.

I suggest, you discuss the problem in a forum or mailing list. Feel free to reopen the bug, if you can attach an .odp file, which we can use to reproduce the problem.
Comment 21 sergiozambrano 2015-07-30 17:35:15 UTC
Ok, let me face it from a different point of view:

In the document, there is tags for XML purposes, and tags for html representation (div is html, you like it or not)

So, CONSIDERING the div tag is for REPRESENTATION purposes, THAT TAG MUST be a closing tag, because must pair with the previous. That's the ONLY way in which this html document would create contiguous sibling pages in a VALID HTML BROWSER.

Now, if you want to see it as XML, the comment before that div tag says that next WAS (which means it should be the closing tag of the ending text-block). I call that WRONG.

If that tag is so crucial for xml MAKE IT SOMETHING ELSE, NOT A DIV. Because a DIV IS FROM THE HTML SET, AND NEEDS TO BE CLOSED.

So, instead of you fixing one forward slash that 

a) breaks html structure (not balanced)
b) should be closing as per the comment itself
c) would NOT hurt the xml at all
d) has no value as xml data (unless an empty style value triggers a paradox)

You expect me to 

a) consider another export tool (which you don't mention, which probably means there's none other for this)
b) develop my own exporter
c) search-replace the offending tag

Does that make sense for you?
If it does, PLEASE ignore my messages and LET OTHERS to answer. You don't own the project and yours is just your opinion. Your answers here are making others believe I'm being taken care of. Thanks.
Comment 22 hanya 2015-07-31 00:40:51 UTC
Since we use libxslt and XSLT to convert from ODF file to XHTML file, 
we can no do so much customization against the result. 

libxslt has option to change its output type to html. When you choose "html" method in xsl:output element, 
open tags and end tags are separetely written. 
But you might get other problems that yet unknown.

INSTALLED_PATH/openoffice4/share/xslt/export/opendoc2xhtml.xsl
@@ -69,7 +69,7 @@
 	<xsl:include href="body.xsl" />
 
 
-	<xsl:output method               = "xml"
+	<xsl:output method               = "html"
 				encoding             = "UTF-8"
 				media-type           = "application/xhtml+xml"
 				indent               = "no"
Comment 23 sergiozambrano 2015-07-31 20:54:12 UTC
(In reply to hanya from comment #22)
> Since we use libxslt and XSLT to convert from ODF file to XHTML file, 
> we can no do so much customization against the result. 
> 
> libxslt has option to change its output type to html. When you choose "html"
> method in xsl:output element, 
> open tags and end tags are separetely written. 
> But you might get other problems that yet unknown.
> 
> INSTALLED_PATH/openoffice4/share/xslt/export/opendoc2xhtml.xsl
> @@ -69,7 +69,7 @@
>  	<xsl:include href="body.xsl" />
>  
>  
> -	<xsl:output method               = "xml"
> +	<xsl:output method               = "html"
>  				encoding             = "UTF-8"
>  				media-type           = "application/xhtml+xml"
>  				indent               = "no"

Yes, I guessed the engine which does that is an already stablished robust engine. That's why I'm asking those who know how tags are "requested" to look into it, because NONE of the other tags in the document is closing like that. There must be some empty or invalid parameter that prevents the tag to populate its properties and it must be outputting its default attributes until inited (self-closed, empty style).

If you know where I could start looking, I'd do it myself. I don't mind look through code and see if something sparks :)
Comment 24 hanya 2015-08-01 12:56:46 UTC
The file path was described in the above.
See Comment 12 for empty style attribute. You can find "body.xsl" file in the 
same directory with the path. Search in the file.
Comment 25 sergiozambrano 2015-08-02 11:22:15 UTC
Thanks for your advice. I found the output setting and when changing it to HTML it fixed the self-closing divs.

The export format drop down menu in the export dialog should have separate options for xml and html though. 

Right now it reads ".xhtml; .html" and that is affecting the biggest user group: the home user who doesn't know how to deal with the difference.

(The programmer would know what to do to get a strcit xml or would figure it out faster… and they are the smaller user group (from people who needs to export to html, because the programmer would rather use another format)

Should I open a new feature request for tha? (separate xhtml and html options in the exporter window)



It seems the error is already known in bugzilla tracker for other xslt projects.
It happens in empty elements, as I noticed. 

The solution so far is to put a comment inside the empty element, or a non-breaking space.

I don't know xml to quickly figure it out, and I still hope the document could be made the document useful in modern browsers without hacking the exporter output setting in the application files, so I thought I could save you the search and copied some solutions I found on the web.



lxml.sax.ElementTreeContentHandler checks closing elements and raises SaxError on mismatch
lxml.sax.ElementTreeContentHandler supports namespace-less SAX events (startElement, endElement) and defaults to empty attributes (keyword argument)

Fixed in 2006 "Removing Elements from a tree could make them loose their namespace declarations" 

That sounds like the this bug (the tag outputting with its default form (not opening, not closing, empty style by defect)

The recommendation for the cases when the element has no content are http://www.w3.org/TR/xhtml-media-types/#C_2



If you have declared the xhtml namespace on 
xsl:stylesheet then elements created in the stylesheet will be xhtml 
(unless you work hard to stop that) so perhaps these div elements have 
been copied from an input document?
If the input is in no-namespace and you want to generate elements in the 
xhtml namepsace, you don't want to copy with xsl:copy or xsl:copy-of you 
want to generate an element in the (new) default namespace but wth the 
same local name as before, so don't use
<xsl:copy>
use
<xsl:element name="{local-name()}">

http://www.thecodingforums.com/threads/self-closing-tags.596847/
Comment 26 sergiozambrano 2015-08-02 11:30:16 UTC
Oh, another fix I found here http://stackoverflow.com/questions/5032347/xslt-stylesheet-replaces-self-closing-tags-with-empty-paired-tags

…You could try to fool the processor by adding empty content in elements. In this case it can be done by modifying the identity template.

<!-- Define a dummy variable with empty content -->
<xsl:variable name="empty" select="''"/>

<!-- Copy input to output, most of the time -->
<xsl:template match="@* | node()">
    <xsl:copy>
        <xsl:apply-templates select="@* | node()" />
<!-- Insert empty content into copied element -->
        <xsl:value-of select="$empty"/>
    </xsl:copy>
</xsl:template>

<!-- Identity template for empty elements -->
<xsl:template match="*[not(node())]">
    <xsl:copy>
        <xsl:apply-templates select="@* | node()" />
        <xsl:value-of select="$empty"/>
    </xsl:copy>
</xsl:template>