Character sets and encodings

Character sets or encodings determine how letters and symbols are represented numerically on your computer. There are a great many different representations used in different countries and for different languages. Possibly the commonest is (7-bit) ASCII, which is sufficient to represent the characters and symbols in common use in English-speaking countries, but which cannot represent accented characters, or characters in alphabets and scripts used in other languages such as Russian or Arabic.

The character sets and encodings supported by the Gedcom 5.5 specification comprise only ANSEL, UNICODE and ASCII. The specification, while preferring ANSEL at the time of publication in 1996, anticipates a future preference for UNICODE, and the subsequent draft 5.5.1 specification of 1999 supports also UTF-8, a later variant of unicode.

UTF-8 has the advantage that no conversion of an ASCII (7-bit) file is necessary - the representation is the same in UTF-8 - and yet it is a complete unicode encoding, capable of representing all the characters and symbols in all the world's languages and scripts. It is now widely supported by genealogy programs.

GWintree supports ANSEL, UNICODE, UTF-8 and ASCII; and also ANSI, which does not conform to the Gedcom specification, because it does not have a unique interpretation: an ANSI file will not be transferrable between systems operating in different countries and languages.

In the case of a Gedcom file which specifies an encoding other than one of these, or which specifies ANSI but was created in another country, you may need to specify the encoding in the Identify encoding dialog.

The character set or encoding used in a Gedcom file is displayed in the Header page when the file is viewed in GWintree. If the setting is changed, conversion to the selected encoding will be carried out when the file is saved. 'UNICODE' may be selected: this is interpreted as UTF-16, but it is not in common use for Gedcom data. 'ANSI', if selected, will be interpreted according to the default Windows codepage for the particular country or 'locale'. It is recommended that 'ASCII' should be selected only to indicate that only characters from this limited (7-bit) character set are contained in the file. Any non-ascii characters would be lost on conversion to (7-bit) ASCII, and no such conversion will actually be performed by the program.

'ASCII' is most commonly used to refer to the 7-bit character set in widespread use. The GEDCOM 5.5 specification interprets 'ASCII' according to ANSI standards X3.134.1 and X3.134.2, which define an extended 8-bit US ASCII character set. These standards are neither readily available nor widely used, but the similar ISO-8859-1 character set is very widely used. Accordingly, conversion to the ISO-8859-1 character set will be carried out if 'ASCII' is selected in GWintree. Characters in the 7-bit ASCII character set will be unaffected.