mirror of
https://github.com/git/git.git
synced 2024-10-31 14:27:54 +01:00
3a59e5954e
As a "distributed" VCS, git should better define the encodings of its core textual data structures, in particular those that are part of the network protocol. That git is encoding agnostic is only really true for blob objects. E.g. the 'non-NUL bytes' requirement of tree and commit objects excludes UTF-16/32, and the special meaning of '/' in the index file as well as space and linefeed in commit objects eliminates EBCDIC and other non-ASCII encodings. Git expects bytes < 0x80 to be pure ASCII, thus CJK encodings that partly overlap with the ASCII range are problematic as well. E.g. fmt_ident() removes trailing 0x5C from user names on the assumption that it is ASCII '\'. However, there are over 200 GBK double byte codes that end in 0x5C. UTF-8 as default encoding on Linux and respective path translations in the Mac and Windows versions have established UTF-8 NFC as de-facto standard for path names. Update the documentation in i18n.txt to reflect the current status-quo. Signed-off-by: Karsten Blees <blees@dcon.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
70 lines
2.9 KiB
Text
70 lines
2.9 KiB
Text
Git is to some extent character encoding agnostic.
|
|
|
|
- The contents of the blob objects are uninterpreted sequences
|
|
of bytes. There is no encoding translation at the core
|
|
level.
|
|
|
|
- Path names are encoded in UTF-8 normalization form C. This
|
|
applies to tree objects, the index file, ref names, as well as
|
|
path names in command line arguments, environment variables
|
|
and config files (`.git/config` (see linkgit:git-config[1]),
|
|
linkgit:gitignore[5], linkgit:gitattributes[5] and
|
|
linkgit:gitmodules[5]).
|
|
+
|
|
Note that Git at the core level treats path names simply as
|
|
sequences of non-NUL bytes, there are no path name encoding
|
|
conversions (except on Mac and Windows). Therefore, using
|
|
non-ASCII path names will mostly work even on platforms and file
|
|
systems that use legacy extended ASCII encodings. However,
|
|
repositories created on such systems will not work properly on
|
|
UTF-8-based systems (e.g. Linux, Mac, Windows) and vice versa.
|
|
Additionally, many Git-based tools simply assume path names to
|
|
be UTF-8 and will fail to display other encodings correctly.
|
|
|
|
- Commit log messages are typically encoded in UTF-8, but other
|
|
extended ASCII encodings are also supported. This includes
|
|
ISO-8859-x, CP125x and many others, but _not_ UTF-16/32,
|
|
EBCDIC and CJK multi-byte encodings (GBK, Shift-JIS, Big5,
|
|
EUC-x, CP9xx etc.).
|
|
|
|
Although we encourage that the commit log messages are encoded
|
|
in UTF-8, both the core and Git Porcelain are designed not to
|
|
force UTF-8 on projects. If all participants of a particular
|
|
project find it more convenient to use legacy encodings, Git
|
|
does not forbid it. However, there are a few things to keep in
|
|
mind.
|
|
|
|
. 'git commit' and 'git commit-tree' issues
|
|
a warning if the commit log message given to it does not look
|
|
like a valid UTF-8 string, unless you explicitly say your
|
|
project uses a legacy encoding. The way to say this is to
|
|
have i18n.commitencoding in `.git/config` file, like this:
|
|
+
|
|
------------
|
|
[i18n]
|
|
commitencoding = ISO-8859-1
|
|
------------
|
|
+
|
|
Commit objects created with the above setting record the value
|
|
of `i18n.commitencoding` in its `encoding` header. This is to
|
|
help other people who look at them later. Lack of this header
|
|
implies that the commit log message is encoded in UTF-8.
|
|
|
|
. 'git log', 'git show', 'git blame' and friends look at the
|
|
`encoding` header of a commit object, and try to re-code the
|
|
log message into UTF-8 unless otherwise specified. You can
|
|
specify the desired output encoding with
|
|
`i18n.logoutputencoding` in `.git/config` file, like this:
|
|
+
|
|
------------
|
|
[i18n]
|
|
logoutputencoding = ISO-8859-1
|
|
------------
|
|
+
|
|
If you do not have this configuration variable, the value of
|
|
`i18n.commitencoding` is used instead.
|
|
|
|
Note that we deliberately chose not to re-code the commit log
|
|
message when a commit is made to force UTF-8 at the commit
|
|
object level, because re-coding to UTF-8 is not necessarily a
|
|
reversible operation.
|