[thelist] A Question about encoding in .net

VOLKAN ÖZÇELİK volkan.ozcelik at gmail.com
Thu Sep 22 15:51:16 CDT 2005


Hi List,

I had posted a tooltip on the issue as well; here is an expanded version:

Nowadays I am struggling a lot about dot net and encoding issues. Here
is some of my findings:

Environment:
Server : uses cp1252 (latin) encoding (as a result: odbc datareader
and odbc command objects use cp1252 encoding as well)
Request and response encoding are cp1254, however.

(Changing the server encoding to cp1254 is out of question since it is
not dedicated to me)

I'll write semi-pseudo logic so that non dot net oriented guys out
there can share their thoughts as well.

I'll try to explain every step I do. Forgive me if I bore you.

Here is what I do:

1. insert a CP1254 encoded String direclty using PHP MyAdmin's gui to
a database table. (so that I am sure it is stored as a CP1254-encoded
String in the DB)

2. Read the column from that table to a DataReader.

Although the data in the DB is CP1254-encoded, the DataReader returns
a (improperly) CP1252-encoded String.

Here comes the fun part:

When I do

(1) ByteArray[] = TheString.GetBytes( Using CP1252 Encoding )

I receive a properly encoded byte array of CP1254 encoding without any
errors, garbage characters, question marks etc.

My first question: is it always the case?

That is;

If I create an E1 encoded String S, from E2 encoded byte array B1;
when I decode B1 with encoding E2 into byte array B2
will B1 and B2 be allways equal for all possible string values and all
possible encodings?

(I re-wrote the sentence above several times to make it
understandable, hope this is the clearest mathematical form)

Or am I just lucky-enough because CP1254 and CP1252 are quite similar
encodings and somehow their cross-transformation manage to stay
reversible.

* * *

Let us have a look at the other side of the coin:

I have a CP1254 encoded String S that I read from page's response.

I convert it into a CP1254-encoded byte array B1 using S.GetBytes(
using CP1254 encoding)

If I assume that the relation (1) is reversible; decoding S using
CP1252 encoding will create an CP1252-encoded (improper) String which
is identical to the one our DataReader has generated above.

and it happens to be the case, because when I Insert the improper
String (which displays incorrectly in the response output) using ODBC
Command object; magically correct values are inserted to the DB with
proper encoding, no mis-typed characters what so ever.

Which makes me deduce that the DataReader and CommandObject operate at
byte level.

So my second question is. Does the DataReader read data from db and
odbccomment write data to db after converting the response String to
an array of bytes? Do they operate at byte-level?

However I am mostly interested in a reasonable answer to the first question:
To state it again:

If a string S is created out of an arbitrary byte array B using
encoding E; can we always retrive B when we use S.getBytes( Encoding E
) , no matter what B or E is?

It's sort of a confusing issue. Hope I stated myself clear.

Thanks a bunch in advance,
Volkan.


More information about the thelist mailing list