Gedcom Publisher: Character Encoding

Overview

A GEDCOM file includes a CHAR tag in the HEAD section that indicates the character encoding used in the file. The valid options as of GEDCOM v5.5.1 are ANSEL, ASCII, UNICODE, and UTF-8. Unfortunately, some programs specify the wrong character encoding, some use an invalid option such as ANSI, and some files are edited by text editors that change the character encoding of the file without changing the CHAR value. For those reasons, Gedcom Publisher detects the character encoding using both the CHAR value and character encoding detection techniques.

Gedcom Publisher opens the GEDCOM file once to detect if the file has a Unicode "byte order mark" (BOM). If Gedcom Publisher detects a Unicode BOM, it ignores the CHAR value and uses the encoding associated with the BOM to read the file.

Gedcom Publisher re-opens the GEDCOM file to read the CHAR tag and several other GEDCOM tags from the HEAD section. If Gedcom Publisher did not find a Unicode BOM, Gedcom Publisher uses an ASCII encoding to read the file for the tag preview.

After reading the CHAR tag, if Gedcom Publisher did not find a Unicode BOM, Gedcom Publisher chooses an encoding based on the CHAR tag value.

If your genealogy program supports writing a GEDCOM file using the UTF-8 encoding, choose that option for the best results with Gedcom Publisher.

When the character encoding in the GEDCOM file is set to "ASCII", Gedcom Publisher will accept characters in the Windows-1252 encoding. Windows-1252 is a superset of ASCII.

Challenges

Unfortunately, there are character encoding issues that make it difficult or impossible to detect the encoding automatically.

GEDCOM file specifies UTF-8, but file is not UTF-8

If a GEDCOM file self-identifies as UTF-8 by including a 1 CHAR UTF-8 record, Gedcom Publisher may or may not be able to detect if the file is actually in some other format. For example, if the file is actually a Windows text file with encoding "Windows 1252", a common text file format on PCs running MS Windows, then Gedcom Publisher cannot tell that the file is not in UTF-8 format. Non-accented characters will display correctly in the resulting book, but many accented characters will not. The solution is to change the Database.CHAR Value property to "ASCII".

ANSEL

ANSEL was an ANSI standard used to encode text, but as of 14 February 2013, the standard has been withdrawn. The Family History Department of the Church of Jesus Christ of Latter-day Saints recommended an extended version of ANSEL for use in GEDCOM files.

Fortunately, modern genealogy software programs are not limited to writing ANSEL-encoded GEDCOM files, and you should configure your software to write in another format, preferably UTF-8. Still, to process files that are encoded in ANSEL, Gedcom Publisher includes support for that encoding. Gedcom Publisher's support is based on the information sources listed below.

The tables below describe how certain ANSEL code points map to Unicode. These tables were constructed based on information from these sources:

The GEDCOM Standard, Draft Release 5.5.1, "Appendix C, ANSEL Character Set"
The GEDCOM Standard, Release 5.5, "Appendix D, ANSEL Character Set"
The GEDCOM Standard, Release 4.0, "Chapter 6, Specification for GEDCOM Character Sets"
The Character Name Index on The Unicode Consortium
GEDCOM ANSEL Table by Tamura Jones, especially for corrections and examples

The GEDCOM standards listed above were prepared by the Family History Department of the Church of Jesus Christ of Latter-day Saints.

Spacing Characters

Hex	Decimal	Unicode	Graphic	Name	Example
A1	161	U+0141	Ł	capital L with stroke	Łódź
A2	162	U+00D8	Ø	capital O with stroke	Øst
A3	163	U+0110	Đ	capital D with stroke	Đuro
A4	164	U+00DE	Þ	capital thorn	Þann
A5	165	U+00C6	Æ	capital AE	Ægir
A6	166	U+0152	Œ	capital ligature OE	Œuvre
A7	167	U+02B9	ʹ	modifier letter prime	fakulʹtet
A8	168	U+00B7	·	middle dot	novel·la
A9	169	U+266D	♭	music flat sign	B♭
AA	170	U+00AE	®	registered sign	Kleenex ®
AB	171	U+00B1	±	plus-minus sign	1910±2
AC	172	U+01A0	Ơ	hook O, uppercase	BƠ
AD	173	U+01AF	Ư	hook U, uppercase	XƯA
AE	174	U+02BE	◌ʾ	right half ring (alif)	Unʾyusho
B0	176	U+02BF	◌ʿ	left half ring (ayn)	faʿil
B1	177	U+0142	ł	small l with stroke	rozbił
B2	178	U+00F8	ø	small o with stroke	høj
B3	179	U+0111	đ	small d with stroke	đavola
B4	180	U+00FE	þ	small thorn	þann
B5	181	U+00E6	æ	small ae	skæg
B6	182	U+0153	œ	small ligature oe	œuvre
B7	183	U+02BA	ʺ	modifier letter double prime	obʺi︠a︡vlenie
B8	184	U+0131	ı	small dotless i	masalı
B9	185	U+00A3	£	pound sign	£5.00
BA	186	U+00F0	ð	small eth	verður
BC	188	U+01A1	ơ	hook o, lowercase	Sơ
BD	189	U+01B0	ư	hook u, lowercase	Tư
BE	190	U+25A1	□	white square(LDS Extension)	□
BF	191	U+25A0	■	black square(LDS Extension)	■
C0	192	U+00B0	°	degree sign	98.6°
C1	193	U+2113	ℓ	script small L	2.0ℓ
C2	194	U+2117	℗	sound recording copyright	Parlophone℗
C3	195	U+00A9	©	copyright sign	©1993
C4	196	U+266F	♯	music sharp sign	D♯
C5	197	U+00BF	¿	inverted question mark	¿Qué?
C6	198	U+00A1	¡	inverted exclamation mark	¡Esta!
CD	205		e	e in middle of line(LDS Extension)	e
CE	206		o	o in middle of line(LDS Extension)	o
CF	207	U+00DF	ß	small sharp s	Preußen

Gedcom Publisher does not support LDS extensions "e in middle of line" or "o in middle of line". They are converted to "e" and "o", respectively.

Combining (non-spacing) Characters

ANSEL includes combining characters that modify the following¹ character. In the table below, the Graphic column shows the combining character modifying a dotted circle ◌ (U+25CC).

Hex	Decimal	Unicode	Graphic	Name	Example
E0	224	U+0309	◌̉	hook above	củi
E1	225	U+0300	◌̀	grave accent	règle
E2	226	U+0301	◌́	acute accent	está
E3	227	U+0302	◌̂	circumflex accent	même
E4	228	U+0303	◌̃	tilde	niño
E5	229	U+0304	◌̄	macron	gājājs
E6	230	U+0306	◌̆	breve	altă
E7	231	U+0307	◌̇	dot above	żaba
E8	232	U+0308	◌̈	diaeresis (umlaut)	öppna
E9	233	U+030C	◌̌	caron (hacek)	vždy
EA	234	U+030A	◌̊	ring above (angstrom)	hår
EB	235	U+FE20	◌︠	ligature, left-half	akademii︠a︡
EC	236	U+FE21	◌︡	ligature, right-half	akademii︠a︡
ED	237	U+0315	◌̕	comma above right	rozdel̕ ovac
EE	238	U+030B	◌̋	double acute accent	időszaki
EF	239	U+0310	◌̐	candrabindu	Alii̐ev
F0	240	U+0327	◌̧	cedilla	ça
F1	241	U+0328	◌̨	ogonek (nasal hook)	vietą
F2	242	U+0323	◌̣	dot below	teḍa
F3	243	U+0324	◌̤	double dot below	k̲h̲ut̤bah
F4	244	U+0325	◌̥	circle below	Samskr̥ta
F5	245	U+0333	◌̳	double underscore	G̳hulam
F6	246	U+0332	◌̲	underscore	s̲amar
F7	247	U+0326	◌̦	left hook	dārzin̦a
F8	248	U+031C	◌̜	right cedilla	kho̜ng
F9	249	U+032E	◌̮	breve below	ḫumantus̆
FA	250	U+FE22	◌︢	double tilde, left half	n︢g︣alan
FB	251	U+FE23	◌︣	double tilde, right half	n︢g︣alan
FC	252	U+0338	◌̸	long solidus (slash) overlay(LDS Extension)	0̸
FE	254	U+0313	◌̓	comma above	ge̓otermika

Please note that per the Unicode Standard, Version 7.0, Chapter 3, D52, "The graphic positioning of a combining character depends on the last preceding base character, unless they are separated by a character that is neither a combining character nor either zero width joiner or zero width non-joiner."

In the table above, the position of the cedilla may be different when it applies to the dotted circle ◌̧ compared to where it appears under a "c" in "ça". In this document, the behavior will vary based on your browser's font choices for "serif" and "sans-serif", and also on your browser's text layout software.

The slash overlay character does not seem to be positioned properly when applied to the dotted circle or the digit zero. I tried multiple fonts and multiple base characters and all combinations produced similar results.

Notes

In the ANSEL encoding, combining characters precede the character they modify. In Unicode, combining characters follow the character they modify.