Ian Little - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Ian Little

Description:

Java Community Process ... nurtured through Java Community Process (JCP). New major features & API additions in J2SE 1.4 introduced & reviewed via JCP. ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 38
Provided by: develop8
Category:
Tags: ian | jcp

less

Transcript and Presenter's Notes

Title: Ian Little


1
Pluggable Charset Support in the Java Platform
Ian Little Java Software Core Tools and
Libraries, Software Engineer Sun Microsystems
Ireland
2
Outline of Presentation
  • Support for charsets within Java platform
    Old limitations, What's new and exciting!
  • Primer on buffer classes.
  • A Developers Sightseeing Tour of the new charset
    API
  • Overview of how to write a custom installable
    charset implementation.
  • Deploy and enjoy!

3
Pluggable Charsets in Java
  • Prior to J2SE Release 1.4 No public API (
  • 1.4 introduces java.nio.charset package
    ) - Small but important
    part of New I/O (JSR-51) -
    SPI allows support for new charsets to be
    plugged in.

4
JSR-51 The Bigger Picture
  • java.nio Buffers
  • java.nio.channels
  • Non-blocking network I/O
  • Fast file I/O (memory-mapped files, etc.)
  • java.util.regex -- Regular expressions
  • java.nio.charset -- Character sets

5
Java Community Process
  • java.nio and charset API/SPI conceived
    nurtured through Java Community
    Process (JCP).
  • New major features API additions in J2SE1.4
    introduced reviewed via JCP.
  • JSR expert groups are composed of a wide
    diversity of domain experts.
  • For JSR-51 (finalised May 2002) expert group
    included IBM, Oracle, BEA, OKI, NTT, Sun.

6
Terminology
  • Charset Defined in RFC 2278
  • The combination of a coded character setand an
    encoding scheme.
  • Coders
  • Either a decoder (java.nio.charset.CharsetDecoder)
    or an encoder (CharsetEncoder).

7
Core NIO Charset Coders
  • java.nio Charset API introduced in J2SE 1.4
  • You can start writing to the API and deploying
    custom developed charsets now!
  • All charsets supported in prior J2SE releases
    are still supported.
  • Only some are accessible via the java.nio.charset
    API
  • More will be added in future releases.

US-ASCII, ISO-8859-1, ISO-8859-15, UTF-8, UTF-16,
UTF-16BE, UTF-16LE, windows-1252, Big5,
Big5-HKSCS, EUC-CN, EUC-JP, EUC-KR, EUC-TW, GBK,
GB18030, ISO-2022-KR, Johab, windows-936/949/950,
ISO-8859-X,JIS-X-0201,JIS-X-0212,
JIS-X0208, TIS-620, ISCII91
8
Introducing the API/SPI

9
java.nio.charset package
  • java.nio
  • Core buffer classes ByteBuffer, CharBuffer
  • Used as input/output for coders
  • java.nio.charset
  • Core charset classes
  • Charset
  • CharsetDecoder, CharsetEncoder
  • CoderResult, CodingErrorAction
  • java.nio.charset.spi
  • Service-Provider Interface for buildinguser-insta
    llable charset coders

10
API Overview
  • java.nio.charset.Charset
  • A named mapping between sequences of sixteen-bit
    Unicode characters and sequences of bytes.
  • Encapsulates the immutable properties of charsets
  • concrete instances obtained via the static
    forName() method

11
API Overview (continued)
  • java.nio.charset.CharsetDecoder
  • Engine which takes a sequence of bytes
    encoded using a native character encodings scheme
    or mapping and produces the equivalently
    decoded Java 16-bit Unicode based character
    representation.
  • java.nio.charset.CharsetEncoder
  • Engine which takes a sequence of Java
    characters stored internally as 16-bit Unicode
    values and produces the equivalent natively
    encoded byte sequence.

12
API Overview (continued)
  • java.nio.charset.spi.CharsetProvider
  • An object which facilitates the installation of
    the Charset implementation into the running JVM.

13
java.nio core classes
  • Buffer A core java.nio abstraction
  • java.nio.Buffer is the parent abstract
    superclass.
  • Buffer subclasses encapsulate a linear sequence
    of values of primitive data types
  • The key Buffer properties are
    position, limit capacity.
  • Specifically revelant when dealing with Charsets
    are
  • ByteBuffer and CharBuffer subclasses



14
Java.nio Buffer classes overview
  • Buffer read operation achieved via get()methods
  • Buffer write operations achieved via put()
    methods
  • Overloaded method put(..) get(..)method
    definitions depending on whether you require
  • per- byte or per-char reads from an input buffer
  • Bulk byte or char reads or writes
  • Buffer reads writes addressed absolutely or
    relatively

15
NIO Buffer classes overview
  • To create a CharBuffer and ByteBuffer
    instance
  • allocate(int capacity) or
  • allocateDirect(int capacity)
  • Interoperability with pre java.nio code
  • CharBuffer provides
  • CharBuffer wrap(char)
  • ByteBuffer provides
  • ByteBuffer wrap(byte)

16
Charset Class
  • Anchor class within the java.nio.charset package
  • boolean isSupported(String charsetName) -
    tests if supplied charset name/alias is
    supported via java.nio API
  • Charset naming consistency is a goal of the API.
  • Canonical names of Charsets mirror those within
    IANA registry.

17
java.nio.Charset Class
  • Use MIME-preferred name where multiple choices
    exist.
  • boolean isRegistered()
  • Prefix canonical name with x- or X- where
    charset is not IANA-registered.

18
API Usage Idiom Decoding
  • Decoding bytes from a file (setup)....
  • // Get Charset instance Charset cs
    Charset.forName(X-fooCS)
    // Get Decoder instance CharsetDecoder
    decoder cs.newDecoder()// Decode ByteBuffer
    to CharBuffer (quick)CharBuffer cb
    decoder.decode(bb)// Quick decode of
    ByteBuffer to StringString s
    decoder.decode(bb).toString()

19
API Usage Idiom Decoding
  • ByteBuffer bb ...
  • CharBuffer cb ... boolean eof
    false CoderResult result
    CoderResult.UNDERFLOWwhile (!eof)
    if (result CoderResult.UNDERFLOW
    ) bb.clear() eof
    (inChannel.read(bb) -1)
    bb.flip()
    result decoder.decode(bb, cb, eof) if
    (result CoderResult.OVERFLOW) drainBuf(cb)
    //defined elsewhere
    decoder.flush() // check overflow here
    too!

20
Autodetecting Charsets
  • Special case of decoder implementation -
    Inspects encoded text determines
    the encoding employed.
  • Typically such autodetecting charsets will be
    asymmetric and will require a
    decoder with no associated
    encoder implementation.
  • java.nio.charset.CharsetDecoder provides
  • boolean isAutoDetecting()
  • boolean isCharsetDetected()
  • String charsetDetected()

21
Error Handling Exceptions
  • The propagation of exceptions by coders when
    they encounter unexpected input is
    under the programmers control
  • Achievable via default and overrideable actions
    defined within the class.
  • java.nio.charset.CodingErrorAction
  • Overrideable action directives are encapsulated
    within
  • CodingErrorAction.REPLACE
  • CodingErrorAction.IGNORE
  • CodingErrorAction.REPORT

22
Internals of a Decoder
  • Subclass java.nio.charset.CharsetDecoder
    directly if decoder has a decoding algorithm
    or properties perceived to be
    generally unique.
  • Employ OOA guiding best practices to determine
    inheritance / re-use opportunities.
  • Repeated decoding logic (similarly for encoders)
    can often be usefully refactored
    into a common abstract base
    decoder class.
  • Bulk of decoder implementation resides within
    the method CoderResult
    decodeLoop(ByteBuffer src,
    CharBuffer dest)

23
Internals of a decoder (continued)
  • decodeLoop() method inspects each byte within
    the input ByteBuffer object instance and
    calculates the appropriate output char (or
    chars) and prepares to place them in
    receiving CharBuffer.
  • When the input buffer is depleted (no more bytes)
    the decodeLoop method should
    ordinarily return with
    CoderResult.UNDERFLOW

24
Internals of a decoder (continued)
  • Illegal input bytes or sequences need to be
    flagged via returning with the
    invocation of
  • CoderResult.malformedForLength(n)
  • Malformed input can be dealt with by
    performing silent substitution of
    replacement chars in the output
    decoded buffer. It is possible to override
    the action via methods defined in
    java.nio.charset.CoderAction

25
Internals of a decoder (continued)
  • Receiving CharBuffer should be checked prior
    to each put() to determine if
    sufficient space.
  • If decoder requires larger output
    buffer return CoderResult.OVERFLOW

26
Internals of a decoder (continued)
  • decodeLoop()method can flag overflow by
    returning java.nio.charset.
    CoderResult.OVERFLOW
  • Overflow handled by calling code.
  • Drain output CharBuffer and reset its position
    before re-invoking the decoder
  • implFlush(CharBuffer out) A provided API
    hook which permits decoders (especially those
    which maintain state) to flush pending char
    output.

27
Autodetecting Charsets
  • Special case of decoder implementation -
    Inspects encoded text determines
    the encoding employed.
  • Typically such autodetecting charsets will be
    asymmetric and will require a
    decoder with no associated
    encoder implementation.
  • java.nio.charset.CharsetDecoder provides
  • boolean isAutoDetecting()
  • boolean isCharsetDetected()
  • String charsetDetected()

28
CharsetEncoder convenience
methods
  • java.nio.charset.CharsetDecoder methods
  • boolean canEncode(char c
  • Boolean canEncode(CharSequence cseq) Tests
    encodeability of a char or CharSequence.
  • float maxBytesPerChar() Used
    primarily by users/clients of API to
    adequately size output buffers
  • float averageBytesPerChar() Can be used
    by coder API library clients to
    perform economical Buffer sizing.

29
Internals of an encoder
  • Principal Encoder entry point is
    CoderResult
    encodeLoop(CharBuffer src,
    ByteBuffer dest)
  • Read and inspect each input character using
    relative CharBuffer get() method
    calls.
  • Example For single char at a time
    reads char c src.get()
  • use in.hasRemaining()or in.remaining()
    to determine encoder loop termination
    condition, i.e no more input
    chars within current encode buffer.

30
Internals of an encoder (continued)
  • A well behaved coder implementation will always
    terminate its encodeLoop() method
    implementation by returning the class
    constant
    java.nio.charset.CoderResult.U
    NDERFLOW
  • Code which invokes the encoder will then arrange
    to drain the existing input buffer and
    supply the encoder with the
    remaining bytes to encode within the
    pursuing input buffers payload.

31
Internals of an encoder
  • Determine if input character is mappable to the
    repertoire of the target encoding
    or not
  • Unmappable chars found in input flagged by
    returning CoderResult.unmappableForLength(int
    n) (The value n can conceivably exceed
    1)
  • out.hasRemaining() or out.remaining()
    should be queried before each put() of output
    characters to guard against output
    buffer overflow.
  • encodeLoop(src,dest)will need to
    return CoderResult.OVERFLOW when
    the output buffer is undersized to receive
    the output encoded bytes.

32
Error Conditions Exceptions
  • The propagation of exceptions by coders when
    they encounter unexpected input is
    under the programmers control
    via default and overrideable actions
    defined within the
    class
  • java.nio.charset.CodingErrorAction
  • Overrideable action directives are encapsulated
    within
  • CodingErrorAction.REPLACE
  • CodingErrorAction.IGNORE
  • CodingErrorAction.REPORT

33
Writing a user installed provider
  • Writing your own provider from scratch
  • java.nio.charset.spi.CharsetProvider
  • java.nio.charset.spi.CharsetProvider
    Methods which you will need to
    override
  • Iterator charsets()
  • Charset charsetForName(String
    charsetName)

34
Writing a user installed provider
  • Provider architecture requirement that you place
    a a specically named file within the
    classpath under the META-INF
    directory within the jar-file containing
    the provider and compiled charset Java classes.
  • The file needs to reside within the directory
    (relative)
  • META-INF/services/
  • The filename required equals the provider class
    name java.nio.charset.spi.CharsetProvider
  • Contents of the file is fully qualified
    classnames of each provider bundled
    within the provider jar file.

35
Writing a user installed provider
  • Provider lookup occurs via the current threads
    context classloader
  • You may put provider jar within
    applet/application classpath or
    within the J2SE extensions
    directory

36
Finally ....
  • Programmers now have access to a rich and
    well integrated (via java.nio) API to
    which to write and access
    charsets within J2SE.
  • The SPI features provide an extremely useful way
    to extend the J2SE charset support at
    runtime.
  • NB Read the Specification within the Javadocs
    to understand the full nuances of
    the API !! http//java.sun.com/j2se/1.4/docs/
    guide/nio/index.html
  • Going forward J2SE charset support will come
    in the form of a core set of
    exclusively New I/O supported
    charsets and optionally downloadable
    New I/O capable charsets from Sun 3rd parties.

37
Thanks!
  • Many thanks for your kind patience today!
  • A big thank you to my fellow team members within
    the Java Software Core Tools and Libraries
    team !! Andrew Bennett, my
    manager. Mark Reinhold JSR51 Spec
    lead, powerhouse behind New
    I/O And to my fellow team
    colleagues Josh Bloch,
    Mike McCloskey, Iris Garcia,
    Neal Gafter, Joe Darcy.
Write a Comment
User Comments (0)
About PowerShow.com