Title: 6th Annual Hong Kong Innovative Users Group Meeting 8-9 December 2005, Hong Kong HKIUG
16th Annual Hong Kong Innovative Users Group
Meeting8-9 December 2005, Hong KongHKIUGs
Unicode Projects Untangling the Chaotic Codes
- Philip Wong
- City University of Hong Kong Library
- K.T. LamHong Kong University of Science and
Technology Library
2Content
- Chaos in 2003
- Collaborative effort at HKIUG
- HKIUG CJK Code Table
- TSVCC linking
- Towards native Unicode catalog
3Chaos in 2003
- Local libraries were using BIG5 Chinese character
encoding system - INNOPAC was in the transition towards Unicode
support, with the development of the Millennium
software - Dual Web OPAC interfaces existed Big5 and UTF-8
(Unicode) - Some libraries (HKUST and CUHK) began releasing
UTF-8 Web OPAC to their users
4Chaos in 2003 cont.
- INNOPACs EACC to Unicode mapping is problematic
- multiple mappings
- incorrect mappings
- missing codes
- duplicated EACC and CCCII
- mapping to different EACCs in BIG5 and UTF-8
interfaces
5Chaos in 2003 cont.
- CJK support in Millennium software was buggy
- Millennium Editor involuntarily replacing
characters with preferred EACC - Individual libraries communicated with the vendor
- not fruitful fixes were in piece-meal fashion
- Some libraries conducted their own CJK / Unicode
study with attempts to propose to the vendor how
to tackle these problems again without much
progress - HKUST (April 2003)
- City University of Hong Kong (July 2003)
6Collaboration Effort at HKIUG
- June 2003 HKIUG Standing Committee agreed that
a joint proposal was essential for gaining
acceptance from the vendor - July 2003 seminar organized by CUHK to solicit
ideas and comments - July 2003 III-UTF-8 Working Group established,
members consisted of catalogers and systems
librarians from CITYU, CUHK, HKUST and HKU
7Collaboration Effort at HKIUG cont.
- Sep 2003 Working Group completed the study and
submitted the proposal to the vendor together
with a HKIUG version of the EACC to Unicode
Mapping Table - Oct 2003 vendor accepted the proposal
- Dec 2003 presentation of the work in 4th Annual
HKIUG Meeting - Jan 2004 HKUST representative was invited to
vendors Headquarters to help resolve outstanding
CJK issues
8Collaboration Effort at HKIUG cont.
- Results of the HKIUG effort, by February 2004
- Millennium Editor problem fixed
- HKIUG Code Table for CJK Characters adopted
- Began development of TSVCC Linking
- 25 February 2005 established HKIUG Unicode Task
Force to maintain the Unicode and TSVCC code
tables and to assist the vendor on Unicode
migration members from CUHK, CITYU, HKUST and
HKU.
9Millennium Editor Problem
- EACClt-gtUnicode Mapping Table failed in round-trip
crosswalk.
10Millennium Editor Problem cont.
- Problem EACC character 274349 in INNOPAC Catalog
would be incorrectly replaced by 27462A when it
was saved in Millennium Editor - Fixed by suppressing Millennium Editor from
converting 274349 (i.e. non-preferred code
multi-mapping) to U5386 when it was retrieved
from the catalog for editing - By using a one-to-one mapping table
11Millennium Editor Problem cont.
- Side effect
- The affected character is displayed as
braced-code, not as character, in the Editor
12(No Transcript)
13HKIUG CJK Code Table
- First released in September 2003 last revised in
August 2005 - Contains
- 15672 EACC characters
- 7043 pure CCCII characters
- 160 multi-mapping linked cases
- 49 multi-mapping unlinked cases
14HKIUG CJK Code Table cont.
- Mapping for EACC characters - follows LC as much
as possible - Does not contain CCCII characters that have EACC
equivalent - sites adopting HKIUG CJK code table
must convert these CCCII in their Catalog to the
EACC equivalents - Contains 7043 Pure CCCII that have no EACC
equivalent - includes them to avoid too many
missing characters
15HKIUG CJK Code Table cont.
- Multiple mappings
- Linked case ling
- Unlinked case li
- HKIUG decides on the preferences
16HKIUG CJK Code Table cont.
- Also available in XML format, conforming to LCs
code tables schema - Implementation
- November 2003 Pilot testing at HKUST
- February 2004 CUHK
- July 2004 PolyU
- October 2004 CityU, HKU
- November 2004 LU, HKBU
- March 2005 HKIED
- December 2005 HKAPA (scheduled)
17TSVCC Linking
- TSVCC stands for Traditional, Simplified and
Variant Chinese Characters. - Example guo
- ? (U570B) Traditional form of country
- ? (U56FD) Simplified form of country
- ? (U56EF) Variant form of country (used in
Japanese) - Example xi
- ? (U4FC2) Traditional form of relationship
- ? (U7E6B) Traditional form of linking
- ? (U7CFB) Traditional form of system,
simplified form of relationship, and simplified
form of linking - Why TSVCC?
18TSVCC Linking cont.
- In EACC, traditional, simplified and variant
characters can be linked by internal codes - gan ? (21304C) linked to ? (27304C )
- feng ? (213B78) linked to ? (2D3B78 ) and ?
(393B78) - However, some multi-mapping cases remain unlinked
- gan ? (27304C ) not linked to ? (273C67)
- li ? (274349) not linked to ? (27462A)
19TSVCC Linking cont.
Consider the following multi-mapping case
EACC Unicode
27462A ?(Simplified form of ?) U5386 ?
274349 ?(Simplified form of ?) U5386 ?
Searching ?? (27462A)(21472A) will not retrieve
?? (2D4349)(21472A)
20TSVCC Linking cont.
- Native Unicode catalog all internal linkings
will be gone - ? (U4E7E), ? (U5E72)
- ? (U5CF0), ? (U5CEF), ? (U5CC4)
- ? (U5386), ? (U66C6), ? (U6B77)
- How to maintain the linkings?
21TSVCC Linking cont.
- In October 2004, HKIUG constructed the TSVCC
Linking Tables and proposed to the vendor - Table M linking relationship is not purely from
EACC - 214349 ? 274349 ? 2D4349 ? 21462A ?
27462A ? 4B462A ? U5386 multi-mapped
27462A,274349 - Table V linking relationship is purely from
EACC - 21306C ? 2D306C ? 33306C ? 4B306C ?
22(No Transcript)
23(No Transcript)
24(No Transcript)
25TSVCC Linking cont.
- Implementation
- October 2004 created the TSVCC Tables
installed on HKUSTs testing database - November 2004 endorsed by HKIUG, first release
- November 2004 TSVCC linking capability was
enabled at CityU and HKU (using vendors original
tables i.e. not HKIUGs version) - Lingnan uninstalled after a short period of trial
due to high recall rate - August 2005 HKIUG second release
- November 2005 CityU installed second release
26TSVCC Linking cont.
- HKALL has also enabled the TSVCC Linking feature
but using hybrid EACC/Unicode tables (using
normalized EACC values to maintain default
ordering for CJK) - Drawback Unicode is a much bigger set than EACC
and again, need to maintain the legacy EACC
mappings - Vendor should put in programming effort to
support Unicode Version of TSVCC tables.
27TSVCC Linking cont.
- Results of implementation
- Improvement in searching
- Trade-off higher recall, lower precision
28TSVCC Linking cont.
- Results improvement in searching
Search ?? Li fa
29TSVCC Linking cont.
- Results higher recall, lower precision
30TSVCC Linking cont.
- Problems found during testing and implementation
- They are not the problems of TSVCC, but are
software problems which require software
enhancement from vendor
31TSVCC Linking cont.
- Problem 1
- Incorrect duplicate headings error in authority
heading verification - Duplicate authority RECORDS 02-11-04
- 33 gt FIELD 100 1 a???
- INDEXED AS AUTHOR ???
- MESSAGE --------------- DUPLICATE AUTHORITY
---------- - FROM a1525012x
- ??? and ??? are actually two different authors
- ? 21303A and ? 33303A are linked EACC but
this problem does not happen in non-TSVCC indexing
32TSVCC Linking cont.
- Problem 2
- Interfiling of indexed characters becomes worse
in TSVCC when recall is higher. Ideal is to
separate indexing and sorting.
33Towards Native Unicode Catalog
- How far are we?
- LC has issued MARC-8 to Unicode mapping tables
- OCLC Connexion client 1.5 begins to support MARC
record import and export in UTF-8 encoding - Intensive discussion of Unicode implementation in
MARC at UNICODE-MARC Discussion List
(UNICODE-MARC_at_loc.gov) - Most ILS vendors claim to support Unicode
34Towards Native Unicode Catalog cont.
- INNOPAC is almost there, but not fully ready yet.
- There is option for sites to convert their
catalogs to Unicode (e.g. HKALL has done so in
Oct 2004) - It was noted from the HKALL catalog that the
implementation of Unicode is only partially
completed - there are still EACC dependency in
the data store and indexes - INNOPAC/Millennium has not yet supported
exporting and importing of records in UTF-8 - CJK searching and sorting require more work
35Towards Native Unicode Catalog cont.
- Bibliographic data interchange involves multiple
partners.
36Towards Native Unicode Catalog cont.
- The failure of round-trip crosswalk between
systems will continue to be a problem until all
systems are capable of importing and exporting
data in Unicode and no one are interchanging MARC
records in non-Unicode encoding
37- Thank You!
- Contact Information
Philip Wong lbphilip_at_cityu.edu.hk
K.T. Lamlblkt_at_ust.hk