| 1 | <?xml version="1.0" encoding="iso-8859-1"?> | 
|---|
| 2 | <!DOCTYPE chapter PUBLIC "-//Samba-Team//DTD DocBook V4.2-Based Variant V1.0//EN" "http://www.samba.org/samba/DTD/samba-doc"> | 
|---|
| 3 | <chapter id="unicode"> | 
|---|
| 4 | <chapterinfo> | 
|---|
| 5 | &author.jelmer; | 
|---|
| 6 | &author.jht; | 
|---|
| 7 | <author> | 
|---|
| 8 | <firstname>TAKAHASHI</firstname><surname>Motonobu</surname> | 
|---|
| 9 | <affiliation> | 
|---|
| 10 | <address><email>monyo@home.monyo.com</email></address> | 
|---|
| 11 | </affiliation> | 
|---|
| 12 | <contrib>Japanese character support</contrib> | 
|---|
| 13 | </author> | 
|---|
| 14 | <pubdate>25 March 2003</pubdate> | 
|---|
| 15 | </chapterinfo> | 
|---|
| 16 |  | 
|---|
| 17 | <title>Unicode/Charsets</title> | 
|---|
| 18 |  | 
|---|
| 19 | <sect1> | 
|---|
| 20 | <title>Features and Benefits</title> | 
|---|
| 21 |  | 
|---|
| 22 | <para> | 
|---|
| 23 | <indexterm><primary>use computer anywhere</primary></indexterm> | 
|---|
| 24 | Every industry eventually matures. One of the great areas of maturation is in | 
|---|
| 25 | the focus that has been given over the past decade to make it possible for anyone | 
|---|
| 26 | anywhere to use a computer. It has not always been that way. In fact, not so long | 
|---|
| 27 | ago, it was common for software to be written for exclusive use in the country of | 
|---|
| 28 | origin. | 
|---|
| 29 | </para> | 
|---|
| 30 |  | 
|---|
| 31 | <para> | 
|---|
| 32 | Of all the effort that has been brought to bear on providing native | 
|---|
| 33 | language support for all computer users, the efforts of the | 
|---|
| 34 | <ulink url="http://www.openi18n.org/">Openi18n organization</ulink> | 
|---|
| 35 | is deserving of special mention. | 
|---|
| 36 | </para> | 
|---|
| 37 |  | 
|---|
| 38 | <para> | 
|---|
| 39 | <indexterm><primary>codepages</primary></indexterm> | 
|---|
| 40 | Samba-2.x supported a single locale through a mechanism called | 
|---|
| 41 | <emphasis>codepages</emphasis>. Samba-3 is destined to become a truly transglobal | 
|---|
| 42 | file- and printer-sharing platform. | 
|---|
| 43 | </para> | 
|---|
| 44 |  | 
|---|
| 45 | </sect1> | 
|---|
| 46 |  | 
|---|
| 47 | <sect1> | 
|---|
| 48 | <title>What Are Charsets and Unicode?</title> | 
|---|
| 49 |  | 
|---|
| 50 | <para> | 
|---|
| 51 | <indexterm><primary>character set</primary></indexterm> | 
|---|
| 52 | Computers communicate in numbers. In texts, each number is | 
|---|
| 53 | translated to a corresponding letter. The meaning that will be assigned | 
|---|
| 54 | to a certain number depends on the <emphasis>character set (charset) | 
|---|
| 55 | </emphasis> that is used. | 
|---|
| 56 | </para> | 
|---|
| 57 |  | 
|---|
| 58 | <para> | 
|---|
| 59 | <indexterm><primary>charset</primary></indexterm> | 
|---|
| 60 | <indexterm><primary>ASCII</primary></indexterm> | 
|---|
| 61 | A charset can be seen as a table that is used to translate numbers to | 
|---|
| 62 | letters. Not all computers use the same charset (there are charsets | 
|---|
| 63 | with German umlauts, Japanese characters, and so on). The American Standard Code | 
|---|
| 64 | for Information Interchange (ASCII) encoding system has been the normative character | 
|---|
| 65 | encoding scheme used by computers to date. This employs a charset that contains | 
|---|
| 66 | 256 characters. Using this mode of encoding, each character takes exactly one byte. | 
|---|
| 67 | </para> | 
|---|
| 68 |  | 
|---|
| 69 | <para> | 
|---|
| 70 | <indexterm><primary>multibyte charsets</primary></indexterm> | 
|---|
| 71 | <indexterm><primary>extended characters</primary></indexterm> | 
|---|
| 72 | There are also charsets that support extended characters, but those need at least | 
|---|
| 73 | twice as much storage space as does ASCII encoding. Such charsets can contain | 
|---|
| 74 | <command>256 * 256 = 65536</command> characters, which is more than all possible | 
|---|
| 75 | characters one could think of. They are called multibyte charsets because they use | 
|---|
| 76 | more then one byte to store one character. | 
|---|
| 77 | </para> | 
|---|
| 78 |  | 
|---|
| 79 | <para> | 
|---|
| 80 | <indexterm><primary>unicode</primary></indexterm> | 
|---|
| 81 | One standardized multibyte charset encoding scheme is known as | 
|---|
| 82 | <ulink url="http://www.unicode.org/">unicode</ulink>.  A big advantage of using a | 
|---|
| 83 | multibyte charset is that you only need one. There is no need to make sure two | 
|---|
| 84 | computers use the same charset when they are communicating. | 
|---|
| 85 | </para> | 
|---|
| 86 |  | 
|---|
| 87 | <para> | 
|---|
| 88 | <indexterm><primary>single-byte charsets</primary></indexterm> | 
|---|
| 89 | <indexterm><primary>SMB/CIFS</primary></indexterm> | 
|---|
| 90 | <indexterm><primary>negotiating the charset</primary></indexterm> | 
|---|
| 91 | Old Windows clients use single-byte charsets, named | 
|---|
| 92 | <parameter>codepages</parameter>, by Microsoft. However, there is no support for | 
|---|
| 93 | negotiating the charset to be used in the SMB/CIFS protocol. Thus, you | 
|---|
| 94 | have to make sure you are using the same charset when talking to an older client. | 
|---|
| 95 | Newer clients (Windows NT, 200x, XP) talk Unicode over the wire. | 
|---|
| 96 | </para> | 
|---|
| 97 | </sect1> | 
|---|
| 98 |  | 
|---|
| 99 | <sect1> | 
|---|
| 100 | <title>Samba and Charsets</title> | 
|---|
| 101 |  | 
|---|
| 102 | <para> | 
|---|
| 103 | <indexterm><primary>Unicode</primary></indexterm> | 
|---|
| 104 | <indexterm><primary>character sets</primary></indexterm> | 
|---|
| 105 | As of Samba-3, Samba can (and will) talk Unicode over the wire. Internally, | 
|---|
| 106 | Samba knows of three kinds of character sets: | 
|---|
| 107 | </para> | 
|---|
| 108 |  | 
|---|
| 109 | <variablelist> | 
|---|
| 110 | <varlistentry> | 
|---|
| 111 | <term><smbconfoption name="unix charset"/></term> | 
|---|
| 112 | <listitem><para> | 
|---|
| 113 | <indexterm><primary>UTF-8</primary></indexterm> | 
|---|
| 114 | <indexterm><primary>CP850</primary></indexterm> | 
|---|
| 115 | This is the charset used internally by your operating system. | 
|---|
| 116 | The default is <constant>UTF-8</constant>, which is fine for most | 
|---|
| 117 | systems and covers all characters in all languages. The default | 
|---|
| 118 | in previous Samba releases was to save filenames in the encoding of the | 
|---|
| 119 | clients &smbmdash; for example, CP850 for Western European countries. | 
|---|
| 120 | </para></listitem> | 
|---|
| 121 | </varlistentry> | 
|---|
| 122 |  | 
|---|
| 123 | <varlistentry> | 
|---|
| 124 | <term><smbconfoption name="display charset"/></term> | 
|---|
| 125 | <listitem><para>This is the charset Samba uses to print messages | 
|---|
| 126 | on your screen. It should generally be the same as the <parameter>unix charset</parameter>. | 
|---|
| 127 | </para></listitem> | 
|---|
| 128 | </varlistentry> | 
|---|
| 129 |  | 
|---|
| 130 | <varlistentry> | 
|---|
| 131 | <term><smbconfoption name="dos charset"/></term> | 
|---|
| 132 | <listitem><para>This is the charset Samba uses when communicating with | 
|---|
| 133 | DOS and Windows 9x/Me clients. It will talk Unicode to all newer clients. | 
|---|
| 134 | The default depends on the charsets you have installed on your system. | 
|---|
| 135 | Run <command>testparm -v | grep "dos charset"</command> to see | 
|---|
| 136 | what the default is on your system. | 
|---|
| 137 | </para></listitem> | 
|---|
| 138 | </varlistentry> | 
|---|
| 139 | </variablelist> | 
|---|
| 140 |  | 
|---|
| 141 | </sect1> | 
|---|
| 142 |  | 
|---|
| 143 | <sect1> | 
|---|
| 144 | <title>Conversion from Old Names</title> | 
|---|
| 145 |  | 
|---|
| 146 | <para> | 
|---|
| 147 | <indexterm><primary>charset conversion</primary></indexterm> | 
|---|
| 148 | Because previous Samba versions did not do any charset conversion, | 
|---|
| 149 | characters in filenames are usually not correct in the UNIX charset but only | 
|---|
| 150 | for the local charset used by the DOS/Windows clients. | 
|---|
| 151 | </para> | 
|---|
| 152 |  | 
|---|
| 153 | <para>Bjoern Jacke has written a utility named <ulink url="http://j3e.de/linux/convmv/">convmv</ulink> | 
|---|
| 154 | that can convert whole directory structures to different charsets with one single command. | 
|---|
| 155 | </para> | 
|---|
| 156 |  | 
|---|
| 157 | </sect1> | 
|---|
| 158 |  | 
|---|
| 159 | <sect1> | 
|---|
| 160 | <title>Japanese Charsets</title> | 
|---|
| 161 |  | 
|---|
| 162 | <para> | 
|---|
| 163 | Setting up Japanese charsets is quite difficult. This is mainly because: | 
|---|
| 164 | </para> | 
|---|
| 165 |  | 
|---|
| 166 | <itemizedlist> | 
|---|
| 167 | <listitem><para> | 
|---|
| 168 | <indexterm><primary>JIS X 0208</primary></indexterm> | 
|---|
| 169 | The Windows character set is extended from the original legacy Japanese | 
|---|
| 170 | standard (JIS X 0208) and is not standardized. This means that the strictly | 
|---|
| 171 | standardized implementation cannot support the full Windows character set. | 
|---|
| 172 | </para></listitem> | 
|---|
| 173 |  | 
|---|
| 174 | <listitem><para> | 
|---|
| 175 | <indexterm><primary>Shift_JIS</primary></indexterm> | 
|---|
| 176 | <indexterm><primary>EUC-JP</primary></indexterm> | 
|---|
| 177 | <indexterm><primary>CAP</primary></indexterm> | 
|---|
| 178 | <indexterm><primary>HEX</primary></indexterm> | 
|---|
| 179 | <indexterm><primary>Japanese</primary></indexterm> | 
|---|
| 180 | Mainly for historical reasons, there are several encoding methods in | 
|---|
| 181 | Japanese, which are not fully compatible with each other. There are | 
|---|
| 182 | two major encoding methods. One is the Shift_JIS series used in Windows | 
|---|
| 183 | and some UNIXes. The other is the EUC-JP series used in most UNIXes | 
|---|
| 184 | and Linux. Moreover, Samba previously also offered several unique encoding | 
|---|
| 185 | methods, named CAP and HEX, to keep interoperability with CAP/NetAtalk and | 
|---|
| 186 | UNIXes that can't use Japanese filenames.  Some implementations of the | 
|---|
| 187 | EUC-JP series can't support the full Windows character set. | 
|---|
| 188 | </para></listitem> | 
|---|
| 189 |  | 
|---|
| 190 | <listitem><para>There are some code conversion tables between Unicode and legacy | 
|---|
| 191 | Japanese character sets. One is compatible with Windows, another one | 
|---|
| 192 | is based on the reference of the Unicode consortium, and others are | 
|---|
| 193 | a mixed implementation. The Unicode consortium does not officially | 
|---|
| 194 | define any conversion tables between Unicode and legacy character | 
|---|
| 195 | sets, so there cannot be standard one. | 
|---|
| 196 | </para></listitem> | 
|---|
| 197 |  | 
|---|
| 198 | <listitem><para>The character set and conversion tables available in iconv() depend | 
|---|
| 199 | on the iconv library that is available. Next to that, the Japanese locale | 
|---|
| 200 | names may be different on different systems.  This means that the value of | 
|---|
| 201 | the charset parameters depends on the implementation of iconv() you are using. | 
|---|
| 202 | </para> | 
|---|
| 203 |  | 
|---|
| 204 | <para> | 
|---|
| 205 | <indexterm><primary>UCS-2</primary></indexterm> | 
|---|
| 206 | <indexterm><primary>Shift_JIS</primary></indexterm> | 
|---|
| 207 | <indexterm><primary>ASCII</primary></indexterm> | 
|---|
| 208 | <indexterm><primary>English</primary></indexterm> | 
|---|
| 209 | Though 2-byte fixed UCS-2 encoding is used in Windows internally, | 
|---|
| 210 | Shift_JIS series encoding is usually used in Japanese environments | 
|---|
| 211 | as ASCII encoding is in English environments. | 
|---|
| 212 | </para></listitem> | 
|---|
| 213 | </itemizedlist> | 
|---|
| 214 |  | 
|---|
| 215 | <sect2><title>Basic Parameter Setting</title> | 
|---|
| 216 |  | 
|---|
| 217 | <para> | 
|---|
| 218 | <indexterm><primary>CP932</primary></indexterm> | 
|---|
| 219 | The <smbconfoption name="dos charset"/> and | 
|---|
| 220 | <smbconfoption name="display charset"/> | 
|---|
| 221 | should be set to the locale compatible with the character set | 
|---|
| 222 | and encoding method used on Windows. This is usually CP932 | 
|---|
| 223 | but sometimes has a different name. | 
|---|
| 224 | </para> | 
|---|
| 225 |  | 
|---|
| 226 | <para> | 
|---|
| 227 | <indexterm><primary>Shift_JIS</primary></indexterm> | 
|---|
| 228 | <indexterm><primary>UTF-8</primary></indexterm> | 
|---|
| 229 | <indexterm><primary>EUC-JP</primary></indexterm> | 
|---|
| 230 | The <smbconfoption name="unix charset"/> can be either Shift_JIS series, | 
|---|
| 231 | EUC-JP series, or UTF-8. UTF-8 is always available, but the availability of other locales | 
|---|
| 232 | and the name itself depends on the system. | 
|---|
| 233 | </para> | 
|---|
| 234 |  | 
|---|
| 235 | <para> | 
|---|
| 236 | Additionally, you can consider using the Shift_JIS series as the | 
|---|
| 237 | value of the <smbconfoption name="unix charset"/> | 
|---|
| 238 | parameter by using the vfs_cap module, which does the same thing as | 
|---|
| 239 | setting <quote>coding system = CAP</quote> in the Samba 2.2 series. | 
|---|
| 240 | </para> | 
|---|
| 241 |  | 
|---|
| 242 | <para> | 
|---|
| 243 | Where to set <smbconfoption name="unix charset"/> | 
|---|
| 244 | to is a difficult question. Here is a list of details, advantages, and | 
|---|
| 245 | disadvantages of using a certain value. | 
|---|
| 246 | </para> | 
|---|
| 247 |  | 
|---|
| 248 | <variablelist> | 
|---|
| 249 | <varlistentry><term>Shift_JIS series</term> | 
|---|
| 250 | <listitem><para> | 
|---|
| 251 | Shift_JIS series means a locale that is equivalent to <constant>Shift_JIS</constant>, | 
|---|
| 252 | used as a standard on Japanese Windows. In the case of <constant>Shift_JIS</constant>, | 
|---|
| 253 | for example, if a Japanese filename consists of 0x8ba4 and 0x974c | 
|---|
| 254 | (a 4-bytes Japanese character string meaning <quote>share</quote>) and <quote>.txt</quote> | 
|---|
| 255 | is written from Windows on Samba, the filename on UNIX becomes | 
|---|
| 256 | 0x8ba4, 0x974c, <quote>.txt</quote> (an 8-byte BINARY string), same as Windows. | 
|---|
| 257 | </para> | 
|---|
| 258 |  | 
|---|
| 259 | <para>Since Shift_JIS series is usually used on some commercial-based | 
|---|
| 260 | UNIXes; hp-ux and AIX as the Japanese locale (however, it is also possible | 
|---|
| 261 | to use the EUC-JP locale series). To use Shift_JIS series on these platforms, | 
|---|
| 262 | Japanese filenames created from Windows can be referred to also on | 
|---|
| 263 | UNIX.</para> | 
|---|
| 264 |  | 
|---|
| 265 | <para> | 
|---|
| 266 | If your UNIX is already working with Shift_JIS and there is a user | 
|---|
| 267 | who needs to use Japanese filenames written from Windows, the | 
|---|
| 268 | Shift_JIS series is the best choice.  However, broken filenames | 
|---|
| 269 | may be displayed, and some commands that cannot handle non-ASCII | 
|---|
| 270 | filenames may be aborted during parsing filenames. Especially, there | 
|---|
| 271 | may be <quote>\ (0x5c)</quote> in filenames, which need to be handled carefully. | 
|---|
| 272 | It is best to not touch filenames written from Windows on UNIX. | 
|---|
| 273 | </para> | 
|---|
| 274 |  | 
|---|
| 275 | <para> | 
|---|
| 276 | Note that most Japanized free software actually works with EUC-JP | 
|---|
| 277 | only. It is good practice to verify that the Japanized free software can work | 
|---|
| 278 | with Shift_JIS. | 
|---|
| 279 | </para> | 
|---|
| 280 | </listitem> | 
|---|
| 281 | </varlistentry> | 
|---|
| 282 |  | 
|---|
| 283 | <varlistentry><term>EUC-JP series</term> | 
|---|
| 284 | <listitem><para> | 
|---|
| 285 | <indexterm><primary>EUC-JP</primary></indexterm> | 
|---|
| 286 | <indexterm><primary>Japanese UNIX</primary></indexterm> | 
|---|
| 287 | EUC-JP series means a locale that is equivalent to the industry | 
|---|
| 288 | standard called EUC-JP, widely used in Japanese UNIX (although EUC | 
|---|
| 289 | contains specifications for languages other than Japanese, such as | 
|---|
| 290 | EUC-KR). In the case of EUC-JP series, for example, if a Japanese | 
|---|
| 291 | filename consists of 0x8ba4 and 0x974c and <quote>.txt</quote> is written from | 
|---|
| 292 | Windows on Samba, the filename on UNIX becomes 0xb6a6, 0xcdad, | 
|---|
| 293 | <quote>.txt</quote> (an 8-byte BINARY string). | 
|---|
| 294 | </para> | 
|---|
| 295 |  | 
|---|
| 296 | <para> | 
|---|
| 297 | <indexterm><primary>EUC-JP</primary></indexterm> | 
|---|
| 298 | <indexterm><primary>UNIX</primary></indexterm> | 
|---|
| 299 | <indexterm><primary>Linux</primary></indexterm> | 
|---|
| 300 | <indexterm><primary>FreeBSD</primary></indexterm> | 
|---|
| 301 | <indexterm><primary>Solaris</primary></indexterm> | 
|---|
| 302 | <indexterm><primary>IRIX</primary></indexterm> | 
|---|
| 303 | <indexterm><primary>Tru64 UNIX</primary></indexterm> | 
|---|
| 304 | <indexterm><primary>Japanese locale</primary></indexterm> | 
|---|
| 305 | <indexterm><primary>Shift_JIS</primary></indexterm> | 
|---|
| 306 | <indexterm><primary>UTF-8</primary></indexterm> | 
|---|
| 307 | Since EUC-JP is usually used on open source UNIX, Linux, and FreeBSD, and on commercial-based UNIX, Solaris, | 
|---|
| 308 | IRIX, and Tru64 UNIX as Japanese locale (however, it is also possible on Solaris to use Shift_JIS and UTF-8, | 
|---|
| 309 | and on Tru64 UNIX it is possible to use Shift_JIS). To use EUC-JP series, most Japanese filenames created from | 
|---|
| 310 | Windows can be referred to also on UNIX. Also, most Japanized free software works mainly with EUC-JP only. | 
|---|
| 311 | </para> | 
|---|
| 312 |  | 
|---|
| 313 | <para> | 
|---|
| 314 | It is recommended to choose EUC-JP series when using Japanese filenames on UNIX. | 
|---|
| 315 | </para> | 
|---|
| 316 |  | 
|---|
| 317 | <para> | 
|---|
| 318 | Although there is no character that needs to be carefully treated | 
|---|
| 319 | like <quote>\ (0x5c)</quote>, broken filenames may be displayed and some | 
|---|
| 320 | commands that cannot handle non-ASCII filenames may be aborted | 
|---|
| 321 | during parsing filenames. | 
|---|
| 322 | </para> | 
|---|
| 323 |  | 
|---|
| 324 | <para> | 
|---|
| 325 | <indexterm><primary>eucJP-ms locale</primary></indexterm> | 
|---|
| 326 | Moreover, if you built Samba using differently installed libiconv, | 
|---|
| 327 | the eucJP-ms locale included in libiconv and EUC-JP series locale | 
|---|
| 328 | included in the operating system may not be compatible. In this case, you may need to | 
|---|
| 329 | avoid using incompatible characters for filenames. | 
|---|
| 330 | </para> | 
|---|
| 331 | </listitem> | 
|---|
| 332 | </varlistentry> | 
|---|
| 333 |  | 
|---|
| 334 | <varlistentry><term>UTF-8</term> | 
|---|
| 335 | <listitem><para> | 
|---|
| 336 | UTF-8 means a locale equivalent to UTF-8, the international standard defined by the Unicode consortium. In | 
|---|
| 337 | UTF-8, a <parameter>character</parameter> is expressed using 1 to 3 bytes. In case of the Japanese language, | 
|---|
| 338 | most characters are expressed using 3 bytes. Since on Windows Shift_JIS, where a character is expressed with 1 | 
|---|
| 339 | or 2 bytes is used to express Japanese, basically a byte length of a UTF-8 string the length of the UTF-8 | 
|---|
| 340 | string is 1.5 times that of the original Shift_JIS string. In the case of UTF-8, for example, if a Japanese | 
|---|
| 341 | filename consists of 0x8ba4 and 0x974c, and <quote>.txt</quote> is written from Windows on Samba, the filename | 
|---|
| 342 | on UNIX becomes 0xe585, 0xb1e6, 0x9c89, <quote>.txt</quote> (a 10-byte BINARY string). | 
|---|
| 343 | </para> | 
|---|
| 344 |  | 
|---|
| 345 | <para> | 
|---|
| 346 | For systems where iconv() is not available or where iconv()'s locales | 
|---|
| 347 | are not compatible with Windows, UTF-8 is the only locale available. | 
|---|
| 348 | </para> | 
|---|
| 349 |  | 
|---|
| 350 | <para> | 
|---|
| 351 | There are no systems that use UTF-8 as the default locale for Japanese. | 
|---|
| 352 | </para> | 
|---|
| 353 |  | 
|---|
| 354 | <para> | 
|---|
| 355 | Some broken filenames may be displayed, and some commands that | 
|---|
| 356 | cannot handle non-ASCII filenames may be aborted during parsing | 
|---|
| 357 | filenames. Especially, there may be <quote>\ (0x5c)</quote> in filenames, which | 
|---|
| 358 | must be handled carefully, so you had better not touch filenames | 
|---|
| 359 | written from Windows on UNIX. | 
|---|
| 360 | </para> | 
|---|
| 361 |  | 
|---|
| 362 | <para> | 
|---|
| 363 | <indexterm><primary>Windows</primary></indexterm> | 
|---|
| 364 | <indexterm><primary>Java</primary></indexterm> | 
|---|
| 365 | <indexterm><primary>Unicode UTF-8</primary></indexterm> | 
|---|
| 366 | In addition, although it is not directly concerned with Samba, since | 
|---|
| 367 | there is a delicate difference between the iconv() function, which is | 
|---|
| 368 | generally used on UNIX, and the functions used on other platforms, | 
|---|
| 369 | such as Windows and Java, so far is concerns the conversion between | 
|---|
| 370 | Shift_JIS and Unicode UTF-8 must be done with care and recognition | 
|---|
| 371 | of the limitations involved in the process. | 
|---|
| 372 | </para> | 
|---|
| 373 |  | 
|---|
| 374 | <para> | 
|---|
| 375 | <indexterm><primary>Mac OS X </primary></indexterm> | 
|---|
| 376 | Although Mac OS X uses UTF-8 as its encoding method for filenames, | 
|---|
| 377 | it uses an extended UTF-8 specification that Samba cannot handle, so | 
|---|
| 378 | UTF-8 locale is not available for Mac OS X. | 
|---|
| 379 | </para> | 
|---|
| 380 | </listitem> | 
|---|
| 381 | </varlistentry> | 
|---|
| 382 |  | 
|---|
| 383 | <varlistentry><term>Shift_JIS series + vfs_cap (CAP encoding)</term> | 
|---|
| 384 | <listitem><para> | 
|---|
| 385 | <indexterm><primary>CAP</primary></indexterm> | 
|---|
| 386 | <indexterm><primary>NetAtalk</primary></indexterm> | 
|---|
| 387 | <indexterm><primary>Macintosh</primary></indexterm> | 
|---|
| 388 | CAP encoding means a specification used in CAP and NetAtalk, file | 
|---|
| 389 | server software for Macintosh. In the case of CAP encoding, for | 
|---|
| 390 | example, if a Japanese filename consists of 0x8ba4 and 0x974c, and | 
|---|
| 391 | <quote>.txt</quote> is written from Windows on Samba, the filename on UNIX | 
|---|
| 392 | becomes <quote>:8b:a4:97L.txt</quote> (a 14 bytes ASCII string). | 
|---|
| 393 | </para> | 
|---|
| 394 |  | 
|---|
| 395 | <para> | 
|---|
| 396 | For CAP encoding, a byte that cannot be expressed as an ASCII | 
|---|
| 397 | character (0x80 or above) is encoded in an <quote>:xx</quote> form. You need to take | 
|---|
| 398 | care of containing a <quote>\(0x5c)</quote> in a filename, but filenames are not | 
|---|
| 399 | broken in a system that cannot handle non-ASCII filenames. | 
|---|
| 400 | </para> | 
|---|
| 401 |  | 
|---|
| 402 | <para> | 
|---|
| 403 | The greatest merit of CAP encoding is the compatibility of encoding | 
|---|
| 404 | filenames with CAP or NetAtalk. These are respectively the Columbia Appletalk | 
|---|
| 405 | Protocol, and the NetAtalk Open Source software project. | 
|---|
| 406 | Since these software applications write a file name on UNIX with CAP encoding, if a | 
|---|
| 407 | directory is shared with both Samba and NetAtalk, you need to use | 
|---|
| 408 | CAP encoding to avoid non-ASCII filenames from being broken. | 
|---|
| 409 | </para> | 
|---|
| 410 |  | 
|---|
| 411 | <para> | 
|---|
| 412 | However, recently, NetAtalk has been | 
|---|
| 413 | patched on some systems to write filenames with EUC-JP (e.g., Japanese original Vine Linux). | 
|---|
| 414 | In this case, you need to choose EUC-JP series instead of CAP encoding. | 
|---|
| 415 | </para> | 
|---|
| 416 |  | 
|---|
| 417 | <para> | 
|---|
| 418 | vfs_cap itself is available for non-Shift_JIS series locales for | 
|---|
| 419 | systems that cannot handle non-ASCII characters or systems that | 
|---|
| 420 | share files with NetAtalk. | 
|---|
| 421 | </para> | 
|---|
| 422 |  | 
|---|
| 423 | <para> | 
|---|
| 424 | To use CAP encoding on Samba-3, you should use the unix charset parameter and VFS | 
|---|
| 425 | as in <link linkend="vfscap-intl">the VFS CAP smb.conf file</link>. | 
|---|
| 426 | </para> | 
|---|
| 427 |  | 
|---|
| 428 | <example id="vfscap-intl"> | 
|---|
| 429 | <title>VFS CAP</title> | 
|---|
| 430 | <smbconfblock> | 
|---|
| 431 | <smbconfsection name="[global]"/> | 
|---|
| 432 | <smbconfcomment>the locale name "CP932" may be different</smbconfcomment> | 
|---|
| 433 | <smbconfoption name="dos charset">CP932</smbconfoption> | 
|---|
| 434 | <smbconfoption name="unix charset">CP932</smbconfoption> | 
|---|
| 435 |  | 
|---|
| 436 | <smbconfsection name="[cap-share]"/> | 
|---|
| 437 | <smbconfoption name="vfs option">cap</smbconfoption> | 
|---|
| 438 | </smbconfblock> | 
|---|
| 439 | </example> | 
|---|
| 440 |  | 
|---|
| 441 | <para> | 
|---|
| 442 | <indexterm><primary>CP932</primary></indexterm> | 
|---|
| 443 | <indexterm><primary>libiconv</primary></indexterm> | 
|---|
| 444 | <indexterm><primary>unix charset</primary></indexterm> | 
|---|
| 445 | <indexterm><primary>cap-share</primary></indexterm> | 
|---|
| 446 | You should set CP932 if using GNU libiconv for unix charset. With this setting, | 
|---|
| 447 | filenames in the <quote>cap-share</quote> share are written with CAP encoding. | 
|---|
| 448 | </para> | 
|---|
| 449 | </listitem> | 
|---|
| 450 | </varlistentry> | 
|---|
| 451 | </variablelist> | 
|---|
| 452 |  | 
|---|
| 453 | </sect2> | 
|---|
| 454 |  | 
|---|
| 455 | <sect2><title>Individual Implementations</title> | 
|---|
| 456 |  | 
|---|
| 457 | <para> | 
|---|
| 458 | Here is some additional information regarding individual implementations: | 
|---|
| 459 | </para> | 
|---|
| 460 |  | 
|---|
| 461 | <variablelist> | 
|---|
| 462 | <varlistentry><term>GNU libiconv</term> | 
|---|
| 463 | <listitem><para> | 
|---|
| 464 | To handle Japanese correctly, you should apply the patch | 
|---|
| 465 | <ulink url="http://www2d.biglobe.ne.jp/~msyk/software/libiconv-patch.html">libiconv-1.8-cp932-patch.diff.gz</ulink> | 
|---|
| 466 | to libiconv-1.8. | 
|---|
| 467 | </para> | 
|---|
| 468 |  | 
|---|
| 469 | <para> | 
|---|
| 470 | Using the patched libiconv-1.8, these settings are available: | 
|---|
| 471 | </para> | 
|---|
| 472 |  | 
|---|
| 473 | <programlisting> | 
|---|
| 474 | dos charset = CP932 | 
|---|
| 475 | unix charset = CP932 / eucJP-ms / UTF-8 | 
|---|
| 476 | |       | | 
|---|
| 477 | |       +-- EUC-JP series | 
|---|
| 478 | +-- Shift_JIS series | 
|---|
| 479 | display charset = CP932 | 
|---|
| 480 | </programlisting> | 
|---|
| 481 |  | 
|---|
| 482 | <para> | 
|---|
| 483 | Other Japanese locales (for example, Shift_JIS and EUC-JP) should not | 
|---|
| 484 | be used because of the lack of the compatibility with Windows. | 
|---|
| 485 | </para> | 
|---|
| 486 | </listitem> | 
|---|
| 487 | </varlistentry> | 
|---|
| 488 |  | 
|---|
| 489 | <varlistentry><term>GNU glibc</term> | 
|---|
| 490 | <listitem><para> | 
|---|
| 491 | To handle Japanese correctly, you should apply a <ulink url="http://www2d.biglobe.ne.jp/~msyk/software/glibc/">patch</ulink> | 
|---|
| 492 | to glibc-2.2.5/2.3.1/2.3.2 or should use the patch-merged versions, glibc-2.3.3 or later. | 
|---|
| 493 | </para> | 
|---|
| 494 |  | 
|---|
| 495 | <para> | 
|---|
| 496 | Using the above glibc, these setting are available: | 
|---|
| 497 | <smbconfblock> | 
|---|
| 498 | <smbconfoption name="dos charset">CP932</smbconfoption> | 
|---|
| 499 | <smbconfoption name="unix charset">CP932 / eucJP-ms / UTF-8</smbconfoption> | 
|---|
| 500 | <smbconfoption name="display charset">CP932</smbconfoption> | 
|---|
| 501 | </smbconfblock> | 
|---|
| 502 | </para> | 
|---|
| 503 |  | 
|---|
| 504 | <para> | 
|---|
| 505 | Other Japanese locales (for example, Shift_JIS and EUC-JP) should not | 
|---|
| 506 | be used because of the lack of the compatibility with Windows. | 
|---|
| 507 | </para> | 
|---|
| 508 | </listitem> | 
|---|
| 509 | </varlistentry> | 
|---|
| 510 | </variablelist> | 
|---|
| 511 |  | 
|---|
| 512 | </sect2> | 
|---|
| 513 |  | 
|---|
| 514 | <sect2> | 
|---|
| 515 | <title>Migration from Samba-2.2 Series</title> | 
|---|
| 516 |  | 
|---|
| 517 | <para> | 
|---|
| 518 | Prior to Samba-2.2 series, the <quote>coding system</quote> parameter was used. The default codepage in Samba | 
|---|
| 519 | 2.x was code page 850. In the Samba-3 series this has been replaced with the <smbconfoption name="unix | 
|---|
| 520 | charset"/> parameter.  <link linkend="japancharsets">Japanese Character Sets in Samba-2.2 and Samba-3</link> | 
|---|
| 521 | shows the mapping table when migrating from the Samba-2.2 series to Samba-3. | 
|---|
| 522 | </para> | 
|---|
| 523 |  | 
|---|
| 524 | <table frame="all" id="japancharsets"> | 
|---|
| 525 | <title>Japanese Character Sets in Samba-2.2 and Samba-3</title> | 
|---|
| 526 |  | 
|---|
| 527 | <tgroup cols="2" align="center"> | 
|---|
| 528 | <colspec align="center"/> | 
|---|
| 529 | <colspec align="center"/> | 
|---|
| 530 | <thead> | 
|---|
| 531 | <row><entry>Samba-2.2 Coding System</entry><entry>Samba-3 unix charset</entry></row> | 
|---|
| 532 | </thead> | 
|---|
| 533 | <tbody> | 
|---|
| 534 | <row><entry>SJIS</entry><entry>Shift_JIS series</entry></row> | 
|---|
| 535 | <row><entry>EUC</entry><entry>EUC-JP series</entry></row> | 
|---|
| 536 | <row><entry>EUC3<footnote><para>Only exists in Japanese Samba version</para></footnote></entry><entry>EUC-JP series</entry></row> | 
|---|
| 537 | <row><entry>CAP</entry><entry>Shift_JIS series + VFS</entry></row> | 
|---|
| 538 | <row><entry>HEX</entry><entry>currently none</entry></row> | 
|---|
| 539 | <row><entry>UTF8</entry><entry>UTF-8</entry></row> | 
|---|
| 540 | <row><entry>UTF8-Mac<footnote><para>Only exists in Japanese Samba version</para></footnote></entry><entry>currently none</entry></row> | 
|---|
| 541 | <row><entry>others</entry><entry>none</entry></row> | 
|---|
| 542 | </tbody> | 
|---|
| 543 | </tgroup> | 
|---|
| 544 | </table> | 
|---|
| 545 |  | 
|---|
| 546 | </sect2> | 
|---|
| 547 |  | 
|---|
| 548 | </sect1> | 
|---|
| 549 |  | 
|---|
| 550 | <sect1> | 
|---|
| 551 | <title>Common Errors</title> | 
|---|
| 552 |  | 
|---|
| 553 | <sect2> | 
|---|
| 554 | <title>CP850.so Can't Be Found</title> | 
|---|
| 555 |  | 
|---|
| 556 | <para><quote>Samba is complaining about a missing <filename>CP850.so</filename> file.</quote></para> | 
|---|
| 557 |  | 
|---|
| 558 | <para> | 
|---|
| 559 | CP850 is the default <smbconfoption name="dos charset"/>. | 
|---|
| 560 | The <smbconfoption name="dos charset"/> is used to convert data to the codepage used by your DOS clients. | 
|---|
| 561 | If you do not have any DOS clients, you can safely ignore this message. </para> | 
|---|
| 562 |  | 
|---|
| 563 | <para> | 
|---|
| 564 | CP850 should be supported by your local iconv implementation. Make sure you have all the required packages installed. | 
|---|
| 565 | If you compiled Samba from source, make sure that the configure process found iconv. This can be | 
|---|
| 566 | confirmed by checking the <filename>config.log</filename> file that is generated when | 
|---|
| 567 | <command>configure</command> is executed.</para> | 
|---|
| 568 | </sect2> | 
|---|
| 569 | </sect1> | 
|---|
| 570 |  | 
|---|
| 571 | </chapter> | 
|---|