On This Page
advertisement

Overview

A GEDCOM file includes a CHAR tag in the HEAD section that indicates the character encoding used in the file. The valid options as of GEDCOM v5.5.1 are ANSEL, ASCII, UNICODE, and UTF-8. Unfortunately, some programs specify the wrong character encoding, some use an invalid option such as ANSI, and some files are edited by text editors that change the character encoding of the file without changing the CHAR value. For those reasons, Gedcom Publisher detects the character encoding using both the CHAR value and character encoding detection techniques.

Gedcom Publisher opens the GEDCOM file once to detect if the file has a Unicode "byte order mark" (BOM). If Gedcom Publisher detects a Unicode BOM, it ignores the CHAR value and uses the encoding associated with the BOM to read the file.

Gedcom Publisher re-opens the GEDCOM file to read the CHAR tag and several other GEDCOM tags from the HEAD section. If Gedcom Publisher did not find a Unicode BOM, Gedcom Publisher uses an ASCII encoding to read the file for the tag preview.

After reading the CHAR tag, if Gedcom Publisher did not find a Unicode BOM, Gedcom Publisher chooses an encoding based on the CHAR tag value.

If your genealogy program supports writing a GEDCOM file using the UTF-8 encoding, choose that option for the best results with Gedcom Publisher.

When the character encoding in the GEDCOM file is set to "ASCII", Gedcom Publisher will accept characters in the Windows-1252 encoding. Windows-1252 is a superset of ASCII.

Challenges

Unfortunately, there are character encoding issues that make it difficult or impossible to detect the encoding automatically.

GEDCOM file specifies UTF-8, but file is not UTF-8

If a GEDCOM file self-identifies as UTF-8 by including a 1 CHAR UTF-8 record, Gedcom Publisher may or may not be able to detect if the file is actually in some other format. For example, if the file is actually a Windows text file with encoding "Windows 1252", a common text file format on PCs running MS Windows, then Gedcom Publisher cannot tell that the file is not in UTF-8 format. Non-accented characters will display correctly in the resulting book, but many accented characters will not. The solution is to change the Database.CHAR Value property to "ASCII".

ANSEL

ANSEL was an ANSI standard used to encode text, but as of 14 February 2013, the standard has been withdrawn. The Family History Department of the Church of Jesus Christ of Latter-day Saints recommended an extended version of ANSEL for use in GEDCOM files.

Fortunately, modern genealogy software programs are not limited to writing ANSEL-encoded GEDCOM files, and you should configure your software to write in another format, preferably UTF-8. Still, to process files that are encoded in ANSEL, Gedcom Publisher includes support for that encoding. Gedcom Publisher's support is based on the information sources listed below.

The tables below describe how certain ANSEL code points map to Unicode. These tables were constructed based on information from these sources:

  1. The GEDCOM Standard, Draft Release 5.5.1, "Appendix C, ANSEL Character Set"
  2. The GEDCOM Standard, Release 5.5, "Appendix D, ANSEL Character Set"
  3. The GEDCOM Standard, Release 4.0, "Chapter 6, Specification for GEDCOM Character Sets"
  4. The Character Name Index on The Unicode Consortium
  5. GEDCOM ANSEL Table by Tamura Jones, especially for corrections and examples

The GEDCOM standards listed above were prepared by the Family History Department of the Church of Jesus Christ of Latter-day Saints.

Spacing Characters

Hex Decimal Unicode Graphic Name Example
A1 161 U+0141 Ł capital L with stroke Łódź
A2 162 U+00D8 Ø capital O with stroke Øst
A3 163 U+0110 Đ capital D with stroke Đuro
A4 164 U+00DE Þ capital thorn Þann
A5 165 U+00C6 Æ capital AE Ægir
A6 166 U+0152 Œ capital ligature OE Œuvre
A7 167 U+02B9 ʹ modifier letter prime fakulʹtet
A8 168 U+00B7 · middle dot novel·la
A9 169 U+266D music flat sign B♭
AA 170 U+00AE ® registered sign Kleenex ®
AB 171 U+00B1 ± plus-minus sign 1910±2
AC 172 U+01A0 Ơ hook O, uppercase
AD 173 U+01AF Ư hook U, uppercase XƯA
AE 174 U+02BE ◌ʾ right half ring (alif) Unʾyusho
B0 176 U+02BF ◌ʿ left half ring (ayn) faʿil
B1 177 U+0142 ł small l with stroke rozbił
B2 178 U+00F8 ø small o with stroke høj
B3 179 U+0111 đ small d with stroke đavola
B4 180 U+00FE þ small thorn þann
B5 181 U+00E6 æ small ae skæg
B6 182 U+0153 œ small ligature oe œuvre
B7 183 U+02BA ʺ modifier letter double prime obʺi︠a︡vlenie
B8 184 U+0131 ı small dotless i masalı
B9 185 U+00A3 £ pound sign £5.00
BA 186 U+00F0 ð small eth verður
BC 188 U+01A1 ơ hook o, lowercase
BD 189 U+01B0 ư hook u, lowercase
BE 190 U+25A1 white square(LDS Extension)
BF 191 U+25A0 black square(LDS Extension)
C0 192 U+00B0 ° degree sign 98.6°
C1 193 U+2113 script small L 2.0ℓ
C2 194 U+2117 sound recording copyright Parlophone℗
C3 195 U+00A9 © copyright sign ©1993
C4 196 U+266F music sharp sign D♯
C5 197 U+00BF ¿ inverted question mark ¿Qué?
C6 198 U+00A1 ¡ inverted exclamation mark ¡Esta!
CD 205 e e in middle of line(LDS Extension) e
CE 206 o o in middle of line(LDS Extension) o
CF 207 U+00DF ß small sharp s Preußen

Gedcom Publisher does not support LDS extensions "e in middle of line" or "o in middle of line". They are converted to "e" and "o", respectively.

Combining (non-spacing) Characters

ANSEL includes combining characters that modify the following1 character. In the table below, the Graphic column shows the combining character modifying a dotted circle ◌ (U+25CC).

Hex Decimal Unicode Graphic Name Example
E0 224 U+0309 ◌̉ hook above củi
E1 225 U+0300 ◌̀ grave accent règle
E2 226 U+0301 ◌́ acute accent está
E3 227 U+0302 ◌̂ circumflex accent même
E4 228 U+0303 ◌̃ tilde niño
E5 229 U+0304 ◌̄ macron gājājs
E6 230 U+0306 ◌̆ breve altă
E7 231 U+0307 ◌̇ dot above żaba
E8 232 U+0308 ◌̈ diaeresis (umlaut) öppna
E9 233 U+030C ◌̌ caron (hacek) vždy
EA 234 U+030A ◌̊ ring above (angstrom) hår
EB 235 U+FE20 ◌︠ ligature, left-half akademii︠a︡
EC 236 U+FE21 ◌︡ ligature, right-half akademii︠a︡
ED 237 U+0315 ◌̕ comma above right rozdel̕ ovac
EE 238 U+030B ◌̋ double acute accent időszaki
EF 239 U+0310 ◌̐ candrabindu Alii̐ev
F0 240 U+0327 ◌̧ cedilla ça
F1 241 U+0328 ◌̨ ogonek (nasal hook) vietą
F2 242 U+0323 ◌̣ dot below teḍa
F3 243 U+0324 ◌̤ double dot below k̲h̲ut̤bah
F4 244 U+0325 ◌̥ circle below Samskr̥ta
F5 245 U+0333 ◌̳ double underscore G̳hulam
F6 246 U+0332 ◌̲ underscore s̲amar
F7 247 U+0326 ◌̦ left hook dārzin̦a
F8 248 U+031C ◌̜ right cedilla kho̜ng
F9 249 U+032E ◌̮ breve below ḫumantus̆
FA 250 U+FE22 ◌︢ double tilde, left half n︢g︣alan
FB 251 U+FE23 ◌︣ double tilde, right half n︢g︣alan
FC 252 U+0338 ◌̸ long solidus (slash) overlay(LDS Extension)
FE 254 U+0313 ◌̓ comma above ge̓otermika

Please note that per the Unicode Standard, Version 7.0, Chapter 3, D52, "The graphic positioning of a combining character depends on the last preceding base character, unless they are separated by a character that is neither a combining character nor either zero width joiner or zero width non-joiner."

In the table above, the position of the cedilla may be different when it applies to the dotted circle ◌̧ compared to where it appears under a "c" in "ça". In this document, the behavior will vary based on your browser's font choices for "serif" and "sans-serif", and also on your browser's text layout software.

The slash overlay character does not seem to be positioned properly when applied to the dotted circle or the digit zero. I tried multiple fonts and multiple base characters and all combinations produced similar results.

Notes

  1. In the ANSEL encoding, combining characters precede the character they modify. In Unicode, combining characters follow the character they modify.