Issue 119219 - Saved RTF has issue in encoding Latin1 characters
Summary: Saved RTF has issue in encoding Latin1 characters
Status: CONFIRMED
Alias: None
Product: Impress
Classification: Application
Component: save-export (show other issues)
Version: OOo 3.2
Hardware: PC All
: P3 Normal (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords: needmoreinfo
Depends on:
Blocks:
 
Reported: 2012-04-12 10:11 UTC by Chaitanya
Modified: 2013-01-29 21:47 UTC (History)
2 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
This RTF document demonstrates the defect reported in #119219 (6.43 KB, text/plain)
2012-04-12 23:15 UTC, orcmid
no flags Details
ANSI Font that is detected as fcharset128 while saving in RTF (30.86 KB, application/octet-stream)
2012-04-14 19:24 UTC, Chaitanya
no flags Details
Sample Text in APS-DV-STARDUST-NORMAL (ODF format) (8.19 KB, application/vnd.oasis.opendocument.text)
2012-04-15 15:40 UTC, Chaitanya
no flags Details
Screenshot of Sample Text in APS-DV-STARDUST-NORMAL (ODF format) (31.07 KB, image/png)
2012-04-15 15:42 UTC, Chaitanya
no flags Details
The Devanagari-rendered sample.odt RTF from OpenOffice.org 3.3.0 (2.66 KB, text/plain)
2012-04-15 21:25 UTC, orcmid
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description Chaitanya 2012-04-12 10:11:05 UTC
Hi,

In Open Office 3.2 windows version, I observed issue in RTF file.
You may open Open Office writer and copy -paste this string -- šbkeâ}sKeve
(Note - this is not garbage data but meaningful Asian word if used with monolingual font - APS-DV-PRAKASH)
Then save the file as RTF and check the RTF code using any notepad application.

It saves the non english characters as unicode values and HEX code in following fashion
For character š - RTF file code shows - \u353\'3f
For character â - RTF code shows - \u226\'3f
Like this for every special character the unicode value is correct but for each such character HEX code is incorrect and that is - \'3f for every special character.

Same string is stored in AbiWord shows proper HEX values-
For character š - RTF file code shows - \u353\'9a
For character â - RTF code shows - \u226\'e2
Which are correct as per wiki link.
http://en.wikipedia.org/wiki/Windows-1252

This does not give any issue in rendering the string in Open Office writer but if this file is used in other programs which strictly uses HEX values, we hit with issue. (Anyways incorrect HEX values is definitely an issue!)

Thanks,
Chaitanya
Comment 1 Marcus 2012-04-12 20:44:53 UTC
Have you tried to reporduce this issue with the most recent version 3.3? If not, please do so as we do not support 3.2 anymore:
http://www.openoffice.org/download/other.html
Comment 2 orcmid 2012-04-12 22:41:05 UTC
I just confirmed this bug with Apache OpenOffice 3.4 developer preview r1309668 in Windows.  We can safely conclude that it has been present in all releases since at least 3.2.0.

@Chiatanya,

Thank you for the completeness of your description.  You made it easy to reproduce the problem.

I will include the .rtf that I was able to produce following your instructions.
Comment 3 orcmid 2012-04-12 23:15:23 UTC
Created attachment 77439 [details]
This RTF document demonstrates the defect reported in #119219

Although this file will download as plaintext, it is an RTF (a data format that uses ASCII).  It can be opened correctly in OpenOffice Writer and in Microsoft Word.  The defect is only visible when viewing the plaintext.  Just search for "\u335" and you'll see the places where the incorrect single-byte code appears.  (3f is the ASCII code for '?')

It was produced with Apache OpenOffice 3.4 r1309668 on Windows.

Now, I am not clear this is a bug.  The expected codes are greater than '7f, the highest ASCII value.  The RTF prolog from AOO specifies that the RTF is ansi coded.  It does not specify a code page to be used for single-byte codes instead.  I also don't believe there is an option to select a code page as part of exporting to RTF format.

Since OpenOffice operates in Unicode I can see why there is a disconnect with Windows-1252.  OpenOffice export to RTF does not identify a code page in which its non-ASCII characters will be expressed in a single-byte code.  That's an interesting problem, since OpenOffice is a multi-platform product.
Comment 4 Chaitanya 2012-04-13 06:28:15 UTC
Thanks orcmid for taking this forward.
Looking forward to get this issue addressed.
Comment 5 orcmid 2012-04-13 15:13:26 UTC
(In reply to comment #0)
> Hi,
> In Open Office 3.2 windows version, I observed issue in RTF file.
[ ... ]
> For character š - RTF file code shows - \u353\'3f
> For character â - RTF code shows - \u226\'3f
> Like this for every special character the unicode value is correct but for each
> such character HEX code is incorrect and that is - \'3f for every special
> character.
[ ... ]
> This does not give any issue in rendering the string in Open Office writer but
> if this file is used in other programs which strictly uses HEX values, we hit
> with issue. (Anyways incorrect HEX values is definitely an issue!)
> Thanks,
> Chaitanya

Is this a regression?  That is, has any release of OpenOffice.org ever provided cp1252 hex values the way you expect?

Note that the RTF specification does not require that cp1252 be assumed as single-byte mappings for printable Unicode characters that are not representable in the RTF ASCII stream directly.
Comment 6 orcmid 2012-04-13 16:13:32 UTC
(In reply to comment #0)
> Hi,
> In Open Office 3.2 windows version, I observed issue in RTF file.
> You may open Open Office writer and copy -paste this string -- šbkeâ}sKeve
> (Note - this is not garbage data but meaningful Asian word if used with
> monolingual font - APS-DV-PRAKASH)
[ ... ]
> This does not give any issue in rendering the string in Open Office writer but
> if this file is used in other programs which strictly uses HEX values, we hit
> with issue. (Anyways incorrect HEX values is definitely an issue!)
> Thanks,
> Chaitanya

Here's my understanding of the situation.  I won't get too deep into it because I'd like confirmation first:

The idea is to use OpenOffice-lineage software (i.e., OpenOffice.org 3.x, Apache OpenOffice) in an out-of-band protocol trick for making documents using a particular Asian character set encoding.  The use of the Asian character set is accomplished by disguising it as single-byte Windows ANSI codes, specifically Microsoft cp1252 (a variant of ISO 8859-1 in which the C1 controls (codes 0x90 - 0xAF and perhaps others) are replaced by other graphic characters.  The correct rendering is obtained in some single-byte applications by using a font that renders the codes as quite different characters than those specified for cp1252.

To accomplish the tunneling in OpenOffice.org 3.2 and later, the correct Unicode characters for those Windows ANSI characters are used, although their Unicode code points are not the same as the Windows ANSI code points.  I am not sure how those are being entered in practice.  Apparently the correct Unicode characters for the desired Asian characters are not being entered.

The disappointment is that the export from Unicode-centric OpenOffice ODF 1.2-supporting documents to RTF does not convert the Unicode characters used to corresponding cp1252 code points so that they are successful disguises for Asian characters in some non-Unicode applications.

An obvious way to move these characters through OpenOffice is by using the correct Asian characters (if they exist in Unicode) in the first place, with appropriate fonts and font mappings from Unicode.  This will work for interchange among modern Unicode supporting applications, and it will work over RTF.  Unfortunately, that does not do much for (legacy?) non-Unicode applications and for those RTF documents created based on using the cp1252 disguise.

Is that the essence of the situation?
Comment 7 Chaitanya 2012-04-13 16:49:39 UTC
Hi orcmid,

Perfect ! I am amazed to see how can one document an issue so well.
It gave me feeling like you are reading my mind :=)
That was definitely a good learning.

Also to answer earlier question - Is this a regression? I am not sure, as I have not used Open Office older versions for such tasks.

Thanks
Comment 8 Chaitanya 2012-04-13 17:25:35 UTC
Hi orcmid,

One point I was wondering about. In most RTF, I have seen a RTF header tag -  
'\ansicpg1252'. As per following article (Not an official RTF spec page)
http://latex2rtf.sourceforge.net/rtfspec_6.html
\ansicpgN specifies the 'This keyword represents the ANSI code page that is used to perform the Unicode to ANSI conversion when writing RTF text.'

This tag is missing in RTF created by Open Office. (I check the attached file also.) Whereas this tag was present in most other RTF files created by other editors like AbiWord, TextMaker (Part of SoftMaker office) etc.

Are we missing this Open Office RTF export ?
Comment 9 orcmid 2012-04-13 17:50:37 UTC
Yes, the RTF specification allows, but does not require the use of code-page-specific controls.

I assume it is not being specified beyond the use of the \ansi control because the single-byte codepage does not appear to be used for anything but ASCII (7-bit) codes and anything else in Unicode that does not fit in ASCII is coded as '?' (but the Unicode character offering is correct).

PROPOSED REMEDY

It would seem that your particular problem would be solved were the RTF export in OpenOffice updated to emit the \ansi\ansicpg1252 control sequence and present the correct hex for cp1252 along with the Unicode code point when there is a correspondence.  When there is no corresponding cp1252 code, the hex for '?' would still be provided.  (cp1252 is more than Latin1 and your tunneling usage depends on that).

Since the Unicode code points are also being provided, the RTF will still interchange successfully among all Unicode-supporting RTF producers and consumers (OpenOffice-lineage and Microsoft Office products, for exampe).

INTEROPERABILITY CONSIDERATIONS

The only possible downside is for RTF consumers that use the code-page-relative encodings instead of the Unicode information and do not recognize the \ansicpg1252 control or simply fail to accept the full range of cp1256 printable codes.  This is probably low risk, especially with regard to the defacto prevalence of cp1252 in the context of RTF (although Apple users might disagree).

I think that would allow your particular usage.  I have a separate question about the use of OpenOffice as intended, though.
Comment 10 orcmid 2012-04-13 18:02:56 UTC
(In reply to comment #7)
OFF TOPIC:

Your immediate problem is resolved by upgrading the RTF export to use all of cp1252 as the alternative code set for non-Unicode consumers, as I discuss in Comment #9.

That is not to see that this will happen, nor when.  It requires someone to work over the RTF export code and implement a more-complicated support for the cp1252 mapping.  (I suspect that there might be no mapping at all at the moment, with the code point for ASCII '?' used every time a Unicode escape is produced.) 

I have a separate question.  When OpenOffice is used, how are the *desired* Asian characters entered by operators?  Do you have keyboards for this?  Do you select a particular font that causes the tunneled Asian characters to appear in displays and when printing?

Finally, are you using a localized version of OpenOffice that supports that Asian language in your user interface?

It would be good if OpenOffice were set up to use the Asian characters as they exist in Unicode, which would make full interoperability among Unicode uses, including in all of the ways I am asking about.  Unfortunately, that would then conflict with the RTF case since Unicode codes that map to cp1252 would no longer be used.  Is there any plan for working out of this bind and beginning to use Unicode correctly?
Comment 11 Chaitanya 2012-04-14 19:07:15 UTC
Hi orcmid,

1. When Open Office is used, how are the *desired* Asian characters entered by operators?  Do you have keyboards for this? -  
We are using font driver software which uses Low Level Keyboard hooks. This software is widely used. When we type a character say 'k', the software captures the keystroke and modifies the characters to 'šb' and send them to target application that is rendering the keystroke/text.

2. Do you select a particular font that causes the tunneled Asian characters to appear in displays and when printing? - Yes

3. Finally, are you using a localized version of Open Office that supports that
Asian language in your user interface? - No, we use default version with English interface.

4. Is there any plan for working out of this bind and beginning to use Unicode correctly? - I am not sure about it.

5. One interesting point I found that if we replace fcharset128 to fcharset0 in font table of RTF file using notepad. i.e. 
Change line from 
{\f4\fnil\fprq2\fcharset128 APS-DV-Stardust;}
to 
{\f4\fnil\fprq2\fcharset0 APS-DV-Stardust;}
everything works well in target application i.e. Adobe Pagemaker7.

Now question is how does fcharset128/fcharset0 is detected for a font ? (Kindly note other applications like wordpad, TextMaker detects same font APS-DV-STARDUST as fcharset0)

Thanks
Comment 12 Chaitanya 2012-04-14 19:24:48 UTC
Created attachment 77442 [details]
ANSI Font that is detected as fcharset128 while saving in RTF

This is regarding bug id 119219 and comment #11 Point 5
Comment 13 orcmid 2012-04-14 19:32:27 UTC
(In reply to comment #11)
[ ... ]
> 5. One interesting point I found that if we replace fcharset128 to fcharset0 in
> font table of RTF file using notepad. i.e. 
> Change line from 
> {\f4\fnil\fprq2\fcharset128 APS-DV-Stardust;}
> to 
> {\f4\fnil\fprq2\fcharset0 APS-DV-Stardust;}
> everything works well in target application i.e. Adobe Pagemaker7.
> Now question is how does fcharset128/fcharset0 is detected for a font ? (Kindly
> note other applications like wordpad, TextMaker detects same font
> APS-DV-STARDUST as fcharset0)
> Thanks

What happens when you use the OpenOffice.org Menu File | Export as PDF ...
option to make a PDF directly?

Does that provide the correct rendering (assuming you embed the font in the
PDF)?

Will this solve your problem for now?

Your heavy reliance on tunneling is very fragile, especially for pure-Unicode
software products such as OpenOffice.  Is there anything in your preferred
character set that can't be expressed in Unicode?
Comment 14 Chaitanya 2012-04-15 09:08:57 UTC
Hi orcmid,

1. PDF renders quite nicely after exporting but using PDF in DTP application is not supported.
2. Adobe Pagemaker does not support Asian/indic Unicode, so we could not move to Unicode.
3. Currently as I see the proper fcharset setting will be the quickest fix for issue.
4. Secondly, providing wp1252 based HEX values will greatly add value to usage of RTF in pure HEX based applications. (This task is now not required as far as my problem is concerned.)

Thanks
Comment 15 orcmid 2012-04-15 14:41:18 UTC
(In reply to comment #14)
[ ... ]
> 3. Currently as I see the proper fcharset setting will be the quickest fix for
> issue.
[ ... ]

In the RTF, the \fcharset128 indicates that the font is a shift-JIS font.  The \fcharset0 indicates that the font is for cp1252.

The interesting question is how is it that OO.o concludes that the character mapping is for shift-JIS?  Presumably the multi-byte determination is from the TTF file itself.

Can you attach a small .ODT file that uses the aps-dv-stardust-normal.ttf font?  I would like to see how the font is listed in the ODF document itself.  There only needs to be a a small line of text that includes some of the Asian characters that are mapped from šbkeâ}sKeve or similar sequences.

Also, if you can attach a screen capture for what is displayed in OpenOffice, that would be very helpful.
Comment 16 Chaitanya 2012-04-15 15:40:49 UTC
Created attachment 77446 [details]
Sample Text in APS-DV-STARDUST-NORMAL (ODF format)

As response to comment#15
Comment 17 Chaitanya 2012-04-15 15:42:12 UTC
Created attachment 77447 [details]
Screenshot of Sample Text in APS-DV-STARDUST-NORMAL (ODF format)

As response to comment#15
Comment 18 orcmid 2012-04-15 21:25:31 UTC
Created attachment 77449 [details]
The Devanagari-rendered sample.odt RTF from OpenOffice.org 3.3.0

(In reply to comment #17)
> Created attachment 77447 [details]
> Screenshot of Sample Text in APS-DV-STARDUST-NORMAL (ODF format)
> As response to comment#15

Thank you for the screen shot and the sample document.  I have confirmed how the Devanagari is rendered successfully with APS-DV-Stardust.  This makes it very clear.

I have also confirmed that using \fcharset128 instead of \fcharset0 for APS-DV-Stardust may have been a bug.

However, when I converted your sample.odt to RTF using any OpenOffice-lineage release starting with OpenOffice.org 3.3.0, I do see \fcharset0 when I have the font installed.  ALSO, when I have the font installed, I see what appears to be the correct hexadecimal for character codes from \'80 to \'fe.

This upload is the RTF that was produced from your sample.odt using OpenOffice.org 3.3.0.  

Please confirm whether this form of the RTF is correct according to your tunneling case.
Comment 19 orcmid 2012-04-16 00:13:05 UTC
(In reply to comment #18)
> I have also confirmed that using \fcharset128 instead of \fcharset0 for
> APS-DV-Stardust may have been a bug.
> However, when I converted your sample.odt to RTF using any OpenOffice-lineage
> release starting with OpenOffice.org 3.3.0, I do see \fcharset0 when I have the
> font installed.  ALSO, when I have the font installed, I see what appears to be
> the correct hexadecimal for character codes from \'80 to \'fe.

I repeated your original exercise with šbkeâ}sKeve and I can still end up with \fcharset128 and failure to treat the font as mapping from cp1252, even when I select APS-DV-Stardust font and I see the Devanagari presented before I Save As ... RTF.  


It is difficult to know what the pattern is.  I must stop investigating for now.
Comment 20 Chaitanya 2012-04-16 19:12:29 UTC
Hi orcmid,

Thanks for your cooperation. Could you please guide us how to find charset of a font ? How does Open Office detects fcharset ?

Thanks
Comment 21 orcmid 2012-04-16 19:39:17 UTC
(In reply to comment #20)
> Hi orcmid,
> Thanks for your cooperation. Could you please guide us how to find charset of a
> font ? How does Open Office detects fcharset ?
> Thanks

The character code that a font is based on is determined by TTF information.  That appears to be how the same font can be used with software that works with single-byte code-page characters and other software that has a different "native" character set, such as a Unicode representaitn.

It is clear that \fcharset0 is determined correctly some of the time during the export as RTF and it is not determined correctly at other times.  The \fcharset128 seems to be inappropriate at all times unless there is something like the Indic extension that can happen with the font you are using.  I doubt that.

Since this is specific to RTF, the first place to look is in the code of the RTF export.  There may also be interactions with the different ways characters are introduced and the APS-DV-Stardust font applied to them.  But somehow, there is a determination that has either \fcharset0 or \fcharset128 used at the front of the file and then the occurences of characters with that font are converted appropriately (as \' cp1252 codes) or not (as \u Unicodes with "?" single-byte representations).

I am not familiar with the RTF export code and cannot offer any further information.