6th Annual Hong Kong Innovative Users Group Meeting 8-9 December 2005, Hong Kong HKIUG - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

6th Annual Hong Kong Innovative Users Group Meeting 8-9 December 2005, Hong Kong HKIUG

Description:

6th Annual Hong Kong Innovative Users Group Meeting. 8-9 December 2005, Hong Kong ... Lingnan uninstalled after a short period of trial due to high recall rate ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 38
Provided by: lb580
Category:

less

Transcript and Presenter's Notes

Title: 6th Annual Hong Kong Innovative Users Group Meeting 8-9 December 2005, Hong Kong HKIUG


1
6th Annual Hong Kong Innovative Users Group
Meeting8-9 December 2005, Hong KongHKIUGs
Unicode Projects Untangling the Chaotic Codes
  • Philip Wong
  • City University of Hong Kong Library
  • K.T. LamHong Kong University of Science and
    Technology Library

2
Content
  • Chaos in 2003
  • Collaborative effort at HKIUG
  • HKIUG CJK Code Table
  • TSVCC linking
  • Towards native Unicode catalog

3
Chaos in 2003
  • Local libraries were using BIG5 Chinese character
    encoding system
  • INNOPAC was in the transition towards Unicode
    support, with the development of the Millennium
    software
  • Dual Web OPAC interfaces existed Big5 and UTF-8
    (Unicode)
  • Some libraries (HKUST and CUHK) began releasing
    UTF-8 Web OPAC to their users

4
Chaos in 2003 cont.
  • INNOPACs EACC to Unicode mapping is problematic
  • multiple mappings
  • incorrect mappings
  • missing codes
  • duplicated EACC and CCCII
  • mapping to different EACCs in BIG5 and UTF-8
    interfaces

5
Chaos in 2003 cont.
  • CJK support in Millennium software was buggy
  • Millennium Editor involuntarily replacing
    characters with preferred EACC
  • Individual libraries communicated with the vendor
  • not fruitful fixes were in piece-meal fashion
  • Some libraries conducted their own CJK / Unicode
    study with attempts to propose to the vendor how
    to tackle these problems again without much
    progress
  • HKUST (April 2003)
  • City University of Hong Kong (July 2003)

6
Collaboration Effort at HKIUG
  • June 2003 HKIUG Standing Committee agreed that
    a joint proposal was essential for gaining
    acceptance from the vendor
  • July 2003 seminar organized by CUHK to solicit
    ideas and comments
  • July 2003 III-UTF-8 Working Group established,
    members consisted of catalogers and systems
    librarians from CITYU, CUHK, HKUST and HKU

7
Collaboration Effort at HKIUG cont.
  • Sep 2003 Working Group completed the study and
    submitted the proposal to the vendor together
    with a HKIUG version of the EACC to Unicode
    Mapping Table
  • Oct 2003 vendor accepted the proposal
  • Dec 2003 presentation of the work in 4th Annual
    HKIUG Meeting
  • Jan 2004 HKUST representative was invited to
    vendors Headquarters to help resolve outstanding
    CJK issues

8
Collaboration Effort at HKIUG cont.
  • Results of the HKIUG effort, by February 2004
  • Millennium Editor problem fixed
  • HKIUG Code Table for CJK Characters adopted
  • Began development of TSVCC Linking
  • 25 February 2005 established HKIUG Unicode Task
    Force to maintain the Unicode and TSVCC code
    tables and to assist the vendor on Unicode
    migration members from CUHK, CITYU, HKUST and
    HKU.

9
Millennium Editor Problem
  • EACClt-gtUnicode Mapping Table failed in round-trip
    crosswalk.

10
Millennium Editor Problem cont.
  • Problem EACC character 274349 in INNOPAC Catalog
    would be incorrectly replaced by 27462A when it
    was saved in Millennium Editor
  • Fixed by suppressing Millennium Editor from
    converting 274349 (i.e. non-preferred code
    multi-mapping) to U5386 when it was retrieved
    from the catalog for editing
  • By using a one-to-one mapping table

11
Millennium Editor Problem cont.
  • Side effect
  • The affected character is displayed as
    braced-code, not as character, in the Editor

12
(No Transcript)
13
HKIUG CJK Code Table
  • First released in September 2003 last revised in
    August 2005
  • Contains
  • 15672 EACC characters
  • 7043 pure CCCII characters
  • 160 multi-mapping linked cases
  • 49 multi-mapping unlinked cases

14
HKIUG CJK Code Table cont.
  • Mapping for EACC characters - follows LC as much
    as possible
  • Does not contain CCCII characters that have EACC
    equivalent - sites adopting HKIUG CJK code table
    must convert these CCCII in their Catalog to the
    EACC equivalents
  • Contains 7043 Pure CCCII that have no EACC
    equivalent - includes them to avoid too many
    missing characters

15
HKIUG CJK Code Table cont.
  • Multiple mappings
  • Linked case ling
  • Unlinked case li
  • HKIUG decides on the preferences

16
HKIUG CJK Code Table cont.
  • Also available in XML format, conforming to LCs
    code tables schema
  • Implementation
  • November 2003 Pilot testing at HKUST
  • February 2004 CUHK
  • July 2004 PolyU
  • October 2004 CityU, HKU
  • November 2004 LU, HKBU
  • March 2005 HKIED
  • December 2005 HKAPA (scheduled)

17
TSVCC Linking
  • TSVCC stands for Traditional, Simplified and
    Variant Chinese Characters.
  • Example guo
  • ? (U570B) Traditional form of country
  • ? (U56FD) Simplified form of country
  • ? (U56EF) Variant form of country (used in
    Japanese)
  • Example xi
  • ? (U4FC2) Traditional form of relationship
  • ? (U7E6B) Traditional form of linking
  • ? (U7CFB) Traditional form of system,
    simplified form of relationship, and simplified
    form of linking
  • Why TSVCC?

18
TSVCC Linking cont.
  • In EACC, traditional, simplified and variant
    characters can be linked by internal codes
  • gan ? (21304C) linked to ? (27304C )
  • feng ? (213B78) linked to ? (2D3B78 ) and ?
    (393B78)
  • However, some multi-mapping cases remain unlinked
  • gan ? (27304C ) not linked to ? (273C67)
  • li ? (274349) not linked to ? (27462A)

19
TSVCC Linking cont.
Consider the following multi-mapping case
EACC Unicode
27462A ?(Simplified form of ?) U5386 ?
274349 ?(Simplified form of ?) U5386 ?
Searching ?? (27462A)(21472A) will not retrieve
?? (2D4349)(21472A)
20
TSVCC Linking cont.
  • Native Unicode catalog all internal linkings
    will be gone
  • ? (U4E7E), ? (U5E72)
  • ? (U5CF0), ? (U5CEF), ? (U5CC4)
  • ? (U5386), ? (U66C6), ? (U6B77)
  • How to maintain the linkings?

21
TSVCC Linking cont.
  • In October 2004, HKIUG constructed the TSVCC
    Linking Tables and proposed to the vendor
  • Table M linking relationship is not purely from
    EACC
  • 214349 ? 274349 ? 2D4349 ? 21462A ?
    27462A ? 4B462A ? U5386 multi-mapped
    27462A,274349
  • Table V linking relationship is purely from
    EACC
  • 21306C ? 2D306C ? 33306C ? 4B306C ?

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
TSVCC Linking cont.
  • Implementation
  • October 2004 created the TSVCC Tables
    installed on HKUSTs testing database
  • November 2004 endorsed by HKIUG, first release
  • November 2004 TSVCC linking capability was
    enabled at CityU and HKU (using vendors original
    tables i.e. not HKIUGs version)
  • Lingnan uninstalled after a short period of trial
    due to high recall rate
  • August 2005 HKIUG second release
  • November 2005 CityU installed second release

26
TSVCC Linking cont.
  • HKALL has also enabled the TSVCC Linking feature
    but using hybrid EACC/Unicode tables (using
    normalized EACC values to maintain default
    ordering for CJK)
  • Drawback Unicode is a much bigger set than EACC
    and again, need to maintain the legacy EACC
    mappings
  • Vendor should put in programming effort to
    support Unicode Version of TSVCC tables.

27
TSVCC Linking cont.
  • Results of implementation
  • Improvement in searching
  • Trade-off higher recall, lower precision

28
TSVCC Linking cont.
  • Results improvement in searching

Search ?? Li fa
29
TSVCC Linking cont.
  • Results higher recall, lower precision

30
TSVCC Linking cont.
  • Problems found during testing and implementation
  • They are not the problems of TSVCC, but are
    software problems which require software
    enhancement from vendor

31
TSVCC Linking cont.
  • Problem 1
  • Incorrect duplicate headings error in authority
    heading verification
  • Duplicate authority RECORDS 02-11-04
  • 33 gt FIELD 100 1 a???
  • INDEXED AS AUTHOR ???
  • MESSAGE --------------- DUPLICATE AUTHORITY
    ----------
  • FROM a1525012x
  • ??? and ??? are actually two different authors
  • ? 21303A and ? 33303A are linked EACC but
    this problem does not happen in non-TSVCC indexing

32
TSVCC Linking cont.
  • Problem 2
  • Interfiling of indexed characters becomes worse
    in TSVCC when recall is higher. Ideal is to
    separate indexing and sorting.

33
Towards Native Unicode Catalog
  • How far are we?
  • LC has issued MARC-8 to Unicode mapping tables
  • OCLC Connexion client 1.5 begins to support MARC
    record import and export in UTF-8 encoding
  • Intensive discussion of Unicode implementation in
    MARC at UNICODE-MARC Discussion List
    (UNICODE-MARC_at_loc.gov)
  • Most ILS vendors claim to support Unicode

34
Towards Native Unicode Catalog cont.
  • INNOPAC is almost there, but not fully ready yet.
  • There is option for sites to convert their
    catalogs to Unicode (e.g. HKALL has done so in
    Oct 2004)
  • It was noted from the HKALL catalog that the
    implementation of Unicode is only partially
    completed - there are still EACC dependency in
    the data store and indexes
  • INNOPAC/Millennium has not yet supported
    exporting and importing of records in UTF-8
  • CJK searching and sorting require more work

35
Towards Native Unicode Catalog cont.
  • Bibliographic data interchange involves multiple
    partners.

36
Towards Native Unicode Catalog cont.
  • The failure of round-trip crosswalk between
    systems will continue to be a problem until all
    systems are capable of importing and exporting
    data in Unicode and no one are interchanging MARC
    records in non-Unicode encoding

37
  • Thank You!
  • Contact Information

Philip Wong lbphilip_at_cityu.edu.hk
K.T. Lamlblkt_at_ust.hk
Write a Comment
User Comments (0)
About PowerShow.com