Last update July 12, 2009

Non Unicode Text In D



“Working with non-Unicode text files (e.g. in latin1-encoding)”

Most other common encodings are byte-width, e.g. latin-1/-15 or windows-1252, depending on what operating system one is using.

If you still have text files in such encodings, there are several possibilities:

  1. convert all text files to utf8
  2. keep the text files with the current encoding, but convert the characters to unicode when reading the file, and back to the other encoding when writing the file (Java does this for example), see below.
  3. work with the current encoding, without converting anything. You can store such strings as char[] or ubyte[], and process them as normal. What you have to be careful about is, to not use standard library functions on these strings, which were made for utf8 data (treat them as binary data when reading/writing to files or the console, see next section).

“How to print non-utf8 strings (e.g. in latin1-encoding)”

You cannot use writefln for this, because you will get an "invalid utf8-sequence" error. You have to use a lower-level function.

see HowTo/printf

Latin-1 to Utf32 conversion (for reading from file)

This function can be used to convert latin1 to unicode. It is easy, because latin1 and unicode share the same first 256 codepoints.

dchar[] latin1_to_unicode(ubyte[] latin1) {
  dchar[] s = new dchar[](latin1.length);
  for(size_t i=0; i<latin1.length; i++) {
    s[i] = latin1[i];
  }
  return s;
}

auto s = latin1_to_unicode(cast(ubyte[]) std.file.read("some_latin-1_file.txt"));

Resulting dchars allow for easy character manipulation in the program (one dchar = one character).

(adapted from {{ http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=54530}})

Utf32 to Latin-1 conversion (writing to a file)

similar, to be done

WinAnsi? to/from Utf8 conversion (reading/writing textfiles on Windows)

import std.windows.charset; //fromMBSz, toMBSz

string winansi_to_utf8(ubyte[] s) {
  return std.windows.charset.fromMBSz(toStringz(cast(char[])s), 0);
}

ubyte[] utf8_to_winansi(string s) {
  char* p=std.windows.charset.toMBSz(s, 0);
  uint len=std.c.string.strlen(p);
  return p[0..len].dup;
}

Example:

bool IS_WINANSI=1; 
auto full_file = cast(ubyte[]) std.file.read(infile);
string s = (IS_WINANSI ? winansi_to_utf8(full_file) : cast(char[]) full_file);

string[] lines = std.string.splitlines(s);
foreach(inout line; lines) { // utf 8 here
  line=line~"!"; // do something
}

s = std.string.join(lines, "\r\n"); //add windows line endings (CRLF)
ubyte[] bin_out = (IS_WINANSI ? utf8_to_winansi(s) : cast(ubyte[])s);         
std.file.write(outfile, cast(void[])bin_out);

Latin1, WinAnsi? ...

Latin1 (=ISO 8859-1) is similar to some other encodings, e.g. Latin-9 (=ISO 8859-15) mainly has the Euro (€)-character added. Windows uses special encodings like win-1252, which also have some differences. Windows api calls should be used to convert these code pages to Unicode (see above).

see:

See also


FrontPage | News | TestPage | MessageBoard | Search | Contributors | Folders | Index | Help | Preferences | Edit

Edit text of this page (date of last change: July 12, 2009 22:06 (diff))