Presentatie Francis Cave ACAP - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Presentatie Francis Cave ACAP

Description:

Communicating with crawlers. What ACAP has to offer. Francis ... established protocol for web server-crawler communication ... by hundreds of crawlers ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 19
Provided by: franci98
Category:

less

Transcript and Presenter's Notes

Title: Presentatie Francis Cave ACAP


1
gtgtgtgt Communicating with crawlersWhat ACAP has
to offer
gtgtgtgt
Francis Cave, EDItEUR ACAP Technical Project
Manager May 2008
WEBCONTENT Te Mooi Om Weg Te GevenNUV, Amsterdam
2
Communicating with crawlersWhat ACAP has to
offer
  • What is ACAP (Version 1.0)?
  • What has been the experience so far?
  • What publishers should do now...

3
The ACAP Technical FrameworkACAP Version 1.0
  • What it is
  • a toolkit to enable communication of content
    access and usage policies
  • adopting and building upon existing standards
  • rooted in the requirements of real use cases
  • a proof of concept
  • What it isnt
  • at this stage ACAP is not a formal standard
  • a technical enforcement mechanism

4
The ACAP Technical FrameworkACAP Version 1.0
  • What is it?
  • Protocols for machine-to-machine messaging
  • using a common vocabulary of access and usage
    terminology
  • Guidance on methods of communication and access
    control
  • Software tools to support implementation

5
The ACAP Technical FrameworkACAP Version 1.0
  • What kinds of protocols?
  • business layer protocols
  • machines already know how to talk to one
    another
  • physical layer PPP, ATM,
  • network layer TCP/IP
  • application layer HTTP, HTTPS, SMTP, FTP
  • business layer RSS, ebXML, EDIINT, SOAP, web
    services,
  • they just dont know what to say in the business
    of communicating access and usage policies

6
The ACAP Technical FrameworkACAP Version 1.0
  • We need to tell the machines what to say to one
    another
  • we need a common vocabulary
  • so they knows what to say
  • and how to interpret it
  • and tell them how to say it
  • using whatever protocols they already use to
    talk to one another

7
The ACAP Technical FrameworkACAP Version 1.0
  • But machines arent going to do this on their
    own
  • we need to provide guidance on how to implement
    the protocols
  • we need to provide tools to support implementation

8
The ACAP Technical FrameworkACAP Version 1.0
  • How has it been developed?
  • We started with a set of real business use cases
  • Nine publishers looking for ways of communicating
    access and use policies for their online content
  • A national archive looking for ways of finding
    out what they were allowed to do with the content
    that they are preserving for posterity
  • A search engine looking for ways to include more
    high-quality content in their index

9
The ACAP Technical FrameworkACAP Version 1.0
  • What does ACAP Version 1.0 include?
  • Extensions to the Robots Exclusion Protocol (REP)
  • Part 1 specifies extensions to the robots.txt
    format
  • enables policies to be expressed for an entire
    website
  • leverages the established protocol for web
    server-crawler communication
  • the existing format is used on millions of
    websites and understood by hundreds of crawlers
  • Part 2 specifies extensions to the Robots META
    Tags format
  • enables policies to be expressed within
    individual HTML pages
  • existing format understood by major search
    engines
  • Dictionary of access and usage terminology
  • robots.txt conversion tool

10
The ACAP Technical FrameworkACAP Version 1.0
  • Why does REP need to be extended?
  • conventional REP has only a very limited
    vocabulary
  • even if we include non-standard extensions that
    not every search engine has implemented
  • conventional REP is inconsistently interpreted
  • e.g. Disallow is interpreted differently means
    different things to different crawlers
  • dont crawl?
  • dont index?

11
The ACAP Technical FrameworkACAP Version 1.0
  • ACAP Version 1.0 has been tested by four
    publishers against their priority use cases
  • De Persgroep major Flemish news publisher
  • Media 24 global news / media publisher based in
    South Africa
  • Macmillan online book content hosting service
  • Reed Elsevier scientific and business
    information publisher
  • all the tested use cases concern text resources
  • current technical work includes extension of ACAP
    Version 1.0 to enable communication of policies
    relating specifically to non-text resource such
    as images and video
  • ACAP Version 1.0 has been implemented in a test
    crawler by search engine operator Exalead

12
The ACAP Technical FrameworkACAP Version 1.0
  • Tool for converting existing robots.txt files
  • converts conventional robots.txt files so that
    existing policies are expressed using ACAP
    terminology
  • User-agent ? ACAP-crawler
  • Disallow ? ACAP-disallow-crawl
  • Allow ? ACAP-allow-crawl
  • is implemented in perl
  • can be used from the ACAP website
  • http//www.the-acap.org/convert-robots-txt-to-acap
    .php

13
The ACAP Technical FrameworkACAP Version 1.0
  • Guidance on crawler authentication
  • How to identify crawler names and IP addresses by
    analysing web server access log files
  • How to configure a server so that you can deliver
    different robots.txt files to different
    crawlers
  • examples are based upon the Apache web server
  • ACAP Version 1.0 Implementation Guide
  • Step-by-step guide on how to make full use of the
    extensions to REP proposed in ACAP Version 1.0
  • Illustrated with many examples

14
The ACAP Technical FrameworkACAP Version 1.0
  • Review of test results
  • We have tested ACAP Version 1.0 REP extensions in
    a range of use cases
  • for most of the tested use cases there are no
    unresolved issues
  • but protected content use cases
  • have been particularly challenging to implement
  • have highlighted need for further work on some
    terminology
  • ACAP Version 1.0 is ready to implement
  • for use cases in unprotected online content
    delivery
  • for some use cases in protected online content
    delivery
  • but ACAP needs further development
  • all specifications will continue to be revised
    and extended

15
The ACAP Technical FrameworkFuture plans
  • To be added in future
  • corrections and clarifications of a few points in
    ACAP Version 1.0
  • additional vocabulary required for expressing
    policies specific to
  • the creation and use of web archives
  • the presentation of images and other media
    content
  • the communication of policies associated with
    page fragments
  • mechanisms for embedding ACAP policies in PDF and
    media resources.
  • an XML format for policy expression
  • based upon ONIX for Licensing Terms developed by
    EDItEUR
  • required for news and web syndication use cases

16
The ACAP Technical FrameworkACAP Version 1.0
  • Experience to date
  • ACAP Version 1.0 works
  • it enables a richer form of expression of
    policies than is possible using conventional REP
    ...
  • it doesnt interfere with current crawler
    activity ...
  • ... but it only goes so far.
  • ACAP Version 1.0 needs to be extended
  • ACAP Version 1.1 (June/July 2008)

17
The ACAP Technical FrameworkACAP Version 1.0
  • What should publishers do now?
  • ACAP Version 1.0 needs to be implemented!
  • use the conversion tool to convert existing
    robots.txt files to use ACAP forms of
    expression
  • use the Implementation Guide to refine policy
    expressions
  • consider creating crawler-specific policies in
    separate robots.txt files
  • give us you feedback, to help us improve future
    versions of ACAP

18
The ACAP Technical Framework
gtgtgtgt Thank you! Questions? francis_at_franciscave.
com
gtgtgtgt
Francis Cave, EDItEUR ACAP Technical Project
Manager May 2008
WEBCONTENT Te Mooi Om Weg Te GevenNUV, Amsterdam
Write a Comment
User Comments (0)
About PowerShow.com