Scalla Ins - PowerPoint PPT Presentation

About This Presentation
Title:

Scalla Ins

Description:

Low Latency Access to data via xrootd servers. Protocol includes ... A Quick Recapitulation. The system is highly structured. Server xrootd's provide the data ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 48
Provided by: AndrewHan3
Category:

less

Transcript and Presenter's Notes

Title: Scalla Ins


1
Scalla Ins Outs
  • xrootd /cmsd
  • Andrew Hanushevsky
  • SLAC National Accelerator Laboratory
  • OSG Administrators Work Shop
  • Stanford University/SLAC
  • 13-November-08
  • http//xrootd.slac.stanford.edu

2
Goals
  • A good understanding of
  • xrootd structure
  • Clustering cmsd
  • How configuration directives apply
  • Cluster interconnections
  • How it really works
  • The oss Storage System the cacheFS
  • SRM Scalla
  • Position of FUSE, xroodFS, cnsd
  • The big picture

3
What is Scalla?
  • Structured Cluster Architecture for
  • Low Latency Access
  • Low Latency Access to data via xrootd servers
  • Protocol includes high performance features
  • Structured Clustering provided by cmsd servers
  • Exponentially scalable and self organizing

4
What is xrootd?
  • A specialized file server
  • Provides access to arbitrary files
  • Allows reads/writes with offset/length
  • Think of it as a specialized NFS server
  • Then why not use NFS?
  • Does not scale well
  • Cant map a single namespace on all the servers
  • All xrootd servers can be clustered to look like
    one server

5
The xrootd Server
application
Process Manager
Clustering Interface
Protocol Implementation
Logical File System
Physical Storage System
xrootd Process
xrootd Server
6
How Is xrootd Clustered?
  • By a management service provided by cmsd
    processes
  • Oversees the health and name space on each xrootd
    server
  • Maps file names to the servers that have the file
  • Informs client via an xrootd server about the
    files location
  • All done in real time without using any databases
  • Each xrootd server process talks to a local cmsd
    process
  • Communicate over a Unix named (i.e., file system)
    socket
  • Local cmsds communicate to a manager cmsd
    elsewhere
  • Communicate over a TCP socket
  • Each process has a specific role in the cluster

7
xrootd cmsd Relationships
Manager cmsd elsewhere
application
Clustering Interface
ofs Component
oss Component
xrootd Server
xrootd Process
cmsd Process
8
How Are The Relationship Described?
  • Relationships described in a configuration file
  • You normally need only one such file for all
    servers
  • But all servers need such a file
  • The file tells each component its role what to
    do
  • Done via component specific directives
  • One line per directive

9
Directives versus Components
xrd.directive
Manager cmsd
xrootd.directive
ofs.directive
oss.directive
xrootd.fslib //XrdOfs.so placed in the
configuration file
10
Where Can I Learn More?
  • Start With Scalla Configuration File Syntax
  • http//xrootd.slac.stanford.edu/doc/dev/Syntax_con
    fig.htm
  • System related parts have their own manuals
  • Xrd/XRootd Configuration Reference
  • Describes xrd. and xrootd. directives
  • Scalla Open File System Open Storage System
    Configuration Reference
  • Describes ofs. and oss. directives
  • Cluster Management Service Configuration
    Reference
  • Describes cms. directives
  • Every manual tells you when you must use all.

11
The Bigger Picture
Data Server Node a.slac.stanford.edu
Manager Node x.slac.stanford.edu
Data Server Node b.slac.stanford.edu
Which one do clients connect to?
Note All processes can be started in any order!
12
Then How Do I Get To A Server?
  • Clients always connect to managers xrootd
  • Clients think this is the right file server
  • But the manager only pretends to be a file server
  • Clients really dont know the difference
  • Manager finds out which server has clients file
  • Then magic happens

13
The Magic
Is Redirection!
Data Server Node a.slac.stanford.edu
Manager Node x.slac.stanford.edu
Data Server Node b.slac.stanford.edu
Have /foo?
Have /foo?
I have /foo!
Node a has /foo
Locate /foo
/foo
open(/foo)
Goto a
open(/foo)
client
14
Request Redirection
  • Most requests redirected to the right server
  • Provides point-to-point I/O
  • Redirection for existing files _at_ few milliseconds
    1st time
  • Results cached subsequent redirection is done in
    microseconds
  • Allows load balancing
  • Many options see the cms.perf cms.sched
    directives
  • Cognizant of failing servers
  • Can automatically choose another working server
  • See the cms.delay directive

15
Pause For Some Terminology
  • Manager
  • The processes whose assigned role is manager
  • all.role manager
  • Typically this is a distinguished node
  • Redirector
  • The xrootd process on the managers node
  • Server
  • The processes whose assigned role is server
  • all.role server
  • This is the end-point node that actually supplies
    the file data

16
How Many Managers Can I Have?
  • Up to eight but usually youll want only two
  • Avoids single-point hardware and software
    failures
  • Redirectors automatically cross-connect to all of
    the manager cmsds
  • Servers automatically connect to all of the
    manager cmsds
  • Clients randomly pick one of the working manager
    xrootds
  • Redirectors algorithmically pick one of the
    working cmsds
  • Allows you load balance manager nodes if you wish
  • See the all.manager directive
  • This also allows you to do serial restarts
  • Eases administrative maintenance
  • The cluster goes into safe mode if all the
    managers die or if too many servers die

17
A Robust Configuration
Central Manager Node x.slac.stanford.edu
Central Manager Node y.slac.stanford.edu
Data Server Node a.slac.stanford.edu
Data Server Node b.slac.stanford.edu
Redirectors
all.role server all.role manager if
x.slac.stanford.edu all.manager
x.slac.stanford.edu1213 all.role manager if
y.slac.stanford.edu all.manager
y.slac.stanford.edu1213
18
How Do I Handle Multiple Managers?
  • Ask your network administrator to
  • Assign the manager IP addresses to a common host
    name
  • xy.domain.edu x.domain.edu, y.domain.edu
  • Make sure that DNS load balancing does not apply!
  • Use xy.domain.edu everywhere instead of x or y
  • root//x.domain.edu,y.domain.edu//
    root//xy.domain.edu//
  • The client will choose one of x or y
  • In the configuration file do one of the following

19
A Quick Recapitulation
  • The system is highly structured
  • Server xrootds provide the data
  • Manager xrootd provide the redirection
  • The cmsds manage the cluster
  • Locate files and monitor the health of all the
    servers
  • Clients initially contact a redirector
  • They are then redirected to a data server
  • The structure is described by the config file
  • Usually the same one is used everywhere

20
Things You May Want To Do
  • Automatically restart failing processes
  • Best done via a crontab entry running a restart
    script
  • Most people use root but you can use the
    xrootd/cmsds uid
  • Renice server cmsds
  • As root renice n -10 p cmsd_pid
  • Allows cmsd to get CPU even when the system is
    busy
  • Can be automated via the start-up script
  • One reason why most people use root for
    start/restart

21
Things You Really Need To Do
  • Plan for log and core file management
  • /var/adm/xrootd/core /var/adm/xrootd/logs
  • Log rotation can be automated via command line
    options
  • Over-ride the default administrative path
  • See the all.adminpath directive
  • Place where Unix named sockets are created
  • /tmp is the (bad) default consider using
    /var/adm/xrootd/admin
  • Plan on configuring your storage space SRM
  • These are xrootd specific ofs oss options
  • SRM requires you run FUSE, cnsd, and BestMan

22
Server Storage Configuration
  • The questions to ask
  • What paths do I want to export (i.e., make
    available)?
  • Will I have more than one file system on the
    server?
  • Will I be providing SRM access?
  • Will I need to support SRM space tokens?

23
Exporting Paths
  • Use the all.export directive
  • Used by xrootd to allow access to exported paths
  • Used by cmsd to search for files in exported
    paths
  • Many options available
  • r/o and r/w are the two most common
  • Refer to the manual
  • Scalla Open File System Open Storage System
    Configuration Reference

24
But My Exports Are Mounted Elsewhere!
  • Common issue
  • Say you need to mount your file system on /myfs
  • But you want to export /atlas within /myfs
  • What to do?
  • Use the oss.localroot directive
  • Only the oss component needs to know about this
  • oss.localroot /myfs
  • all.export /atlas
  • Makes /atlas a visible path but internally always
    prefixes it with /myfs
  • So, open(/atlas/foo) actually opens
    /myfs/atlas/foo

25
Multiple File Systems
  • The oss allows you to aggregate partitions
  • Each partition is mounted as a separate file
    system
  • An exported path can refer to all the partitions
  • The oss automatically handles it by creating
    symlinks
  • File name in /atlas is a symlink to an actual
    file in /mnt1 or /mnt2

The oss CacheFS
symlink
/mnt1
/atlas
oss.cache public /mnt1 xa oss.cache public /mnt2
xa all.export /atlas
/mnt2
File system used to hold exported file paths
Mounted Partitions hold file data
26
OSS CacheFS Logic Example
  • Client creates a new file /atlas/myfile
  • The oss selects a suitable partition
  • Searches for space in /mnt1 and /mnt2 using LRU
    order
  • Creates a null file in the selected partition
  • Lets call it /mnt1/public/00/file0001
  • Creates two symlinks
  • /atlas/myfile /mnt1/public/00/file0001
  • /mnt/public/00/file0001.pfn /atlas/myfile
  • Client can then write the data

27
Why Use The oss CacheFS?
  • No need if you can have one file system
  • Use the OS volume manager if you have one and
  • Not worried about large logical partitions or
    fsck time
  • However,
  • We use the CacheFS to support SRM space tokens
  • Done by mapping tokens to virtual or physical
    partitions
  • The oss supports both

28
SRM Static Space Token Refresher
  • Encapsulates fixed space characteristics
  • Type of space
  • E.g., Permanence, performance, etc.
  • Implies a specific quota
  • Using a particular arbitrary name
  • E.g., atlasdatadisk, atlasmcdisk, atlasuserdisk,
    etc.
  • Typically used to create new files
  • Think of it as a space profile

29
Partitions as a Space Token Paradigm
  • Disk partitions map well to SRM space tokens
  • A set of partitions embody a set of space
    attributes
  • Performance, quota, etc.
  • A static space token defines a set of space
    attributes
  • Partitions and static space tokens are
    interchangeable
  • We take the obvious step
  • Use oss CacheFS partitions for SRM space tokens
  • Simply map space tokens on a set of partitions
  • The oss CacheFS supports real and virtual
    partitions
  • So you really dont need physical partitions here

30
Virtual vs. Real Partitions
Virtual Partition Name
Real Partition Mount
oss.cache atlasdatadisk /store1 xa oss.cache
atlasmcdisk /store1 xa oss.cache atlasuserdisk
/store2 xa
  • Simple two step process
  • Define your real partitions (one or more)
  • These are file system mount-points
  • Map virtual partitions on top of real ones
  • Virtual partitions can share real partitions
  • By convention, virtual partition names equal
    static token names
  • Yields implicit SRM space token support

31
Space Tokens vs. Virtual Partitions
  • Partitions selected by virtual partition name
  • Configuration file
  • New files cgi-tagged with space token name
  • root//host1094//atlas/mcdatafile?cgroupatlasmcd
    isk
  • The default is public
  • But space token names equal virtual partition
    names
  • File will be allocated in the desired
    real/virtual partition

oss.cache atlasdatadisk /store1 xa oss.cache
atlasmcdisk /store1 xa oss.cache atlasuserdisk
/store2 xa
32
Virtual vs. Real Partitions
  • Non-overlapping virtual partitions (RV)
  • A real partition represents a hard quota
  • Implies space token gets fixed amount of space
  • Overlapping virtual partitions (R¹V)
  • Hard quota applies to multiple virtual partitions
  • Implies space token gets an undetermined amount
    of space
  • Need usage tracking and external quota management

33
Partition Usage Tracking
  • The oss tracks usage by partition
  • Automatic for real partitions
  • Configurable for virtual partitions
  • oss.usage nolog log dirpath
  • Since Virtual Partitions Û SRM Space Tokens
  • Usage is also automatically tracked by space
    token
  • POSIX getxattr() returns usage information
  • See Linux man page

34
Partition Quota Management
  • Quotas applied by partition
  • Automatic for real partitions
  • Must be enabled for virtual partitions
  • oss.usage quotafile filepath
  • Currenty, quotas are not enforced by the oss
  • POSIX getxattr() returns quota information
  • Used by FUSE/xrootdFS to enforce quotas
  • Required to run a full featured SRM

35
The Quota File
  • Lists quota for each virtual partition
  • Hence, also a quota for each static space token
  • Simple multi-line format
  • vpname nnnnk m g t\n
  • vpnames are in 1-to-1 correspondence with space
    token names
  • The oss re-reads it whenever it changes
  • Useful only for FUSE/xrootdFS
  • Quotas need to apply to the whole cluster

36
Considerations
  • Files cannot be easily reassigned space tokens
  • Must manually move file across partitions
  • Can always get original space token name
  • Use file-specific getxattr() call
  • Quotas for virtual partitions are soft
  • Time causality prevents a real hard limit
  • Use real partitions if hard limit needed

37
SRM Scalla The Big Issue
  • Scalla implements a distributed name space
  • Very scalable and efficient
  • Sufficient for data analysis
  • SRM needs a single view of the complete name
    space
  • This requires deploying additional components
  • Composite Name Space Daemon (cnsd)
  • Provides the complete name space
  • FUSE/xrootdFS
  • Provides the single view via a file system
    interface
  • Compatible with all stand-alone SRMs (e.g.,
    BestMan StoRM)

38
The Composite Name Space
  • A new xrootd instance is used to maintain the
    complete name space for the cluster
  • Only holds the full paths file sizes, no more
  • Normally runs on one of the manager nodes
  • The cnsd needs to run on all the server nodes
  • Captures xrootd name space requests (e.g., rm)
  • Re-Issues the request to the new xrootd instance
  • This is the clusters composite name space
  • Composite because each server node adds to the
    name space
  • There is no pre-registration of names it all
    happens on-the-fly

39
Composite Name Space Implemented
opendir() refers to the directory structure
maintained at myhost2094
opendir()
Name Space xrootd_at_myhost2094
Redirector xrootd_at_myhost1094
Manager
create/trunc mkdir mv rm rmdir
Data Servers
cnsd
ofs.forward 3way myhost2094 mkdir mv rm rmdir
trunc ofs.notify closew create /opt/xrootd/bin/cn
sd xrootd.redirect myhost2094 dirlist
40
Some Caveats
  • Name space is reasonably accurate
  • Usually sufficient for SRM operations
  • cnsds do log events to circumvent transient
    failures
  • The log is replayed when the name space xrootd
    recovers
  • But, the log is not infinite
  • Invariably inconsistencies will arise
  • The composite name space can be audited
  • Means comparing and resolving multiple name
    spaces
  • Time consuming in terms of elapsed time
  • But can happen while the system is running
  • Tools to do this are still under development
  • Consider contributing such software

41
The Single View
  • Now that there is a composite cluster name space
    we need an SRM-compatible view
  • The easiest way is to use a file system view
  • BestMan and StoRM actually expect this
  • The additional component is FUSE

42
What is FUSE
  • Filesystem in Userspace
  • Implements a file system as a user space program
  • Linux 2.4 and 2.6 only
  • Refer to http//fuse.sourceforge.net/
  • Can use FUSE to provide xrootd access
  • Looks like a mounted file system
  • We call it xrootdFS
  • Two versions currently exist
  • Wei Yang at SLAC (packaged with VDT)
  • Andreas Peters at CERN (packaged with Castor)

43
xrootdFS (Linux/FUSE/xrootd)
User Space
POSIX File System Interface
Client Host
FUSE
Kernel
SRM
FUSE/Xroot Interface
opendir
xrootd POSIX Client
Redirector xrootd1094
Redirector Host
Should run cnsd on servers to capture non-FUSE
events
44
SLAC xrootdFS Performance
Sun V20z RHEL4 2x 2.2Ghz AMD Opteron 4GB
RAM 1Gbit/sec Ethernet
VA Linux 1220 RHEL3 2x 866Mhz Pentium 3 1GB
RAM 100Mbit/sec Ethernet
Unix dd, globus-url-copy uberftp 5-7MB/sec with
128KB I/O block size
Unix cp 0.9MB/sec with 4KB I/O block size
Conclusion Do not use it for data transfers!
45
More Caveats
  • FUSE must be administratively installed
  • Requires root access
  • Difficult if many machines (e.g., batch workers)
  • Easier if it only involves an SE node (i.e., SRM
    gateway)
  • Performance is limited
  • Kernel-FUSE interactions are not cheap
  • CERN modified FUSE shows very good transfer
    performance
  • Rapid file creation (e.g., tar) is limited
  • Recommend that it be kept away from general users

46
Putting It All Together
Manager Node
Data Server Nodes
Basic xrootd Cluster
47
Acknowledgements
  • Software Contributors
  • CERN Derek Feichtinger, Fabrizio Furano, Andreas
    Peters
  • Fermi Tony Johnson (Java)
  • Root Gerri Ganis, Bertrand Bellenot
  • SLAC Jacek Becla, Tofigh Azemoon, Wilko Kroeger
  • Operational Collaborators
  • BNL, INFN, IN2P3
  • Partial Funding
  • US Department of Energy
  • Contract DE-AC02-76SF00515 with Stanford
    University
Write a Comment
User Comments (0)
About PowerShow.com