Discussion:
How does boost.serialization do with BOM in text/xml files
Tan, Tom (Shanghai)
2008-09-04 08:14:11 UTC
Permalink
I am think of using boost.serialization to replace
CMarkup(http://www.firstobject.com/) as handle my XML config files. By
playing with the examples coming with Boost, I found boost.serialization

- does not handle BOM of UTF-8 files.
- does not ignore <!-- ... > comments.

Is there any workaround to this?
Thanks.
Robert Ramey
2008-09-04 16:21:47 UTC
Permalink
what is BOM?
Post by Tan, Tom (Shanghai)
I am think of using boost.serialization to replace
CMarkup(http://www.firstobject.com/) as handle my XML config files. By
playing with the examples coming with Boost, I found
boost.serialization
- does not handle BOM of UTF-8 files.
- does not ignore <!-- ... > comments.
Is there any workaround to this?
Thanks.
Markus Schöpflin
2008-09-04 15:34:34 UTC
Permalink
Post by Robert Ramey
what is BOM?
Probably "Byte Order Mark", see http://en.wikipedia.org/wiki/Byte-order_mark

Markus
Tan, Tom (Shanghai)
2008-09-05 01:27:46 UTC
Permalink
Post by Robert Ramey
what is BOM?
Probably "Byte Order Mark", see
http://en.wikipedia.org/wiki/Byte-order_mark

Yes, That's what I meant.

I was testing the demo_xml_load.cpp and demo_xml_save.cpp available in
the boost.serialization example.
By simply opening demo_save.xml produced by demo_xml_save.exe with XML
copy editor(http://xml-copy-editor.sourceforge.net/) and saving it back,
demo_xml_load.exe would crash. I compared the two files with Winmerge.
It said it's identical.

by studying the hex view, I later found it's because the 3-byte UTF-8
BOM was inserted to the beginning of file. It would not change the data,
and in many cases was ignored by the text editors.

I thinking that Boost.serialization should also handle this for all text
files including XML.

Tom
Robert Ramey
2008-09-05 06:46:03 UTC
Permalink
This is news to me.

the wide character text/xml archives use UTF-8. They do this
by creating a stream with the uft_codecvt_facet. I used
this factet, it worked great and I moved on. So you're way
ahead of me on this.

This would probably be easy to address in the xml_iarchive code
or perhaps the xml_grammar - but, as I said, I don't know
anything about it.

Robert Ramey
Post by Markus Schöpflin
Post by Robert Ramey
what is BOM?
Probably "Byte Order Mark", see
http://en.wikipedia.org/wiki/Byte-order_mark
Yes, That's what I meant.
I was testing the demo_xml_load.cpp and demo_xml_save.cpp available
in the boost.serialization example.
By simply opening demo_save.xml produced by demo_xml_save.exe with XML
copy editor(http://xml-copy-editor.sourceforge.net/) and saving it
back, demo_xml_load.exe would crash. I compared the two files with
Winmerge. It said it's identical.
by studying the hex view, I later found it's because the 3-byte UTF-8
BOM was inserted to the beginning of file. It would not change the
data, and in many cases was ignored by the text editors.
I thinking that Boost.serialization should also handle this for all
text files including XML.
Tom
Loading...