Title: Scalla Ins
1Scalla Ins Outs
- xrootd /cmsd
- Andrew Hanushevsky
- SLAC National Accelerator Laboratory
- OSG Administrators Work Shop
- Stanford University/SLAC
- 13-November-08
- http//xrootd.slac.stanford.edu
2Goals
- A good understanding of
- xrootd structure
- Clustering cmsd
- How configuration directives apply
- Cluster interconnections
- How it really works
- The oss Storage System the cacheFS
- SRM Scalla
- Position of FUSE, xroodFS, cnsd
- The big picture
3What is Scalla?
- Structured Cluster Architecture for
- Low Latency Access
- Low Latency Access to data via xrootd servers
- Protocol includes high performance features
- Structured Clustering provided by cmsd servers
- Exponentially scalable and self organizing
4What is xrootd?
- A specialized file server
- Provides access to arbitrary files
- Allows reads/writes with offset/length
- Think of it as a specialized NFS server
- Then why not use NFS?
- Does not scale well
- Cant map a single namespace on all the servers
- All xrootd servers can be clustered to look like
one server
5The xrootd Server
application
Process Manager
Clustering Interface
Protocol Implementation
Logical File System
Physical Storage System
xrootd Process
xrootd Server
6How Is xrootd Clustered?
- By a management service provided by cmsd
processes - Oversees the health and name space on each xrootd
server - Maps file names to the servers that have the file
- Informs client via an xrootd server about the
files location - All done in real time without using any databases
- Each xrootd server process talks to a local cmsd
process - Communicate over a Unix named (i.e., file system)
socket - Local cmsds communicate to a manager cmsd
elsewhere - Communicate over a TCP socket
- Each process has a specific role in the cluster
7xrootd cmsd Relationships
Manager cmsd elsewhere
application
Clustering Interface
ofs Component
oss Component
xrootd Server
xrootd Process
cmsd Process
8How Are The Relationship Described?
- Relationships described in a configuration file
- You normally need only one such file for all
servers - But all servers need such a file
- The file tells each component its role what to
do - Done via component specific directives
- One line per directive
9Directives versus Components
xrd.directive
Manager cmsd
xrootd.directive
ofs.directive
oss.directive
xrootd.fslib //XrdOfs.so placed in the
configuration file
10Where Can I Learn More?
- Start With Scalla Configuration File Syntax
- http//xrootd.slac.stanford.edu/doc/dev/Syntax_con
fig.htm - System related parts have their own manuals
- Xrd/XRootd Configuration Reference
- Describes xrd. and xrootd. directives
- Scalla Open File System Open Storage System
Configuration Reference - Describes ofs. and oss. directives
- Cluster Management Service Configuration
Reference - Describes cms. directives
- Every manual tells you when you must use all.
11The Bigger Picture
Data Server Node a.slac.stanford.edu
Manager Node x.slac.stanford.edu
Data Server Node b.slac.stanford.edu
Which one do clients connect to?
Note All processes can be started in any order!
12Then How Do I Get To A Server?
- Clients always connect to managers xrootd
- Clients think this is the right file server
- But the manager only pretends to be a file server
- Clients really dont know the difference
- Manager finds out which server has clients file
- Then magic happens
13The Magic
Is Redirection!
Data Server Node a.slac.stanford.edu
Manager Node x.slac.stanford.edu
Data Server Node b.slac.stanford.edu
Have /foo?
Have /foo?
I have /foo!
Node a has /foo
Locate /foo
/foo
open(/foo)
Goto a
open(/foo)
client
14Request Redirection
- Most requests redirected to the right server
- Provides point-to-point I/O
- Redirection for existing files _at_ few milliseconds
1st time - Results cached subsequent redirection is done in
microseconds - Allows load balancing
- Many options see the cms.perf cms.sched
directives - Cognizant of failing servers
- Can automatically choose another working server
- See the cms.delay directive
15Pause For Some Terminology
- Manager
- The processes whose assigned role is manager
- all.role manager
- Typically this is a distinguished node
- Redirector
- The xrootd process on the managers node
- Server
- The processes whose assigned role is server
- all.role server
- This is the end-point node that actually supplies
the file data
16How Many Managers Can I Have?
- Up to eight but usually youll want only two
- Avoids single-point hardware and software
failures - Redirectors automatically cross-connect to all of
the manager cmsds - Servers automatically connect to all of the
manager cmsds - Clients randomly pick one of the working manager
xrootds - Redirectors algorithmically pick one of the
working cmsds - Allows you load balance manager nodes if you wish
- See the all.manager directive
- This also allows you to do serial restarts
- Eases administrative maintenance
- The cluster goes into safe mode if all the
managers die or if too many servers die
17A Robust Configuration
Central Manager Node x.slac.stanford.edu
Central Manager Node y.slac.stanford.edu
Data Server Node a.slac.stanford.edu
Data Server Node b.slac.stanford.edu
Redirectors
all.role server all.role manager if
x.slac.stanford.edu all.manager
x.slac.stanford.edu1213 all.role manager if
y.slac.stanford.edu all.manager
y.slac.stanford.edu1213
18How Do I Handle Multiple Managers?
- Ask your network administrator to
- Assign the manager IP addresses to a common host
name - xy.domain.edu x.domain.edu, y.domain.edu
- Make sure that DNS load balancing does not apply!
- Use xy.domain.edu everywhere instead of x or y
- root//x.domain.edu,y.domain.edu//
root//xy.domain.edu// - The client will choose one of x or y
- In the configuration file do one of the following
19A Quick Recapitulation
- The system is highly structured
- Server xrootds provide the data
- Manager xrootd provide the redirection
- The cmsds manage the cluster
- Locate files and monitor the health of all the
servers - Clients initially contact a redirector
- They are then redirected to a data server
- The structure is described by the config file
- Usually the same one is used everywhere
20Things You May Want To Do
- Automatically restart failing processes
- Best done via a crontab entry running a restart
script - Most people use root but you can use the
xrootd/cmsds uid - Renice server cmsds
- As root renice n -10 p cmsd_pid
- Allows cmsd to get CPU even when the system is
busy - Can be automated via the start-up script
- One reason why most people use root for
start/restart
21Things You Really Need To Do
- Plan for log and core file management
- /var/adm/xrootd/core /var/adm/xrootd/logs
- Log rotation can be automated via command line
options - Over-ride the default administrative path
- See the all.adminpath directive
- Place where Unix named sockets are created
- /tmp is the (bad) default consider using
/var/adm/xrootd/admin - Plan on configuring your storage space SRM
- These are xrootd specific ofs oss options
- SRM requires you run FUSE, cnsd, and BestMan
22Server Storage Configuration
- The questions to ask
- What paths do I want to export (i.e., make
available)? - Will I have more than one file system on the
server? - Will I be providing SRM access?
- Will I need to support SRM space tokens?
23Exporting Paths
- Use the all.export directive
- Used by xrootd to allow access to exported paths
- Used by cmsd to search for files in exported
paths - Many options available
- r/o and r/w are the two most common
- Refer to the manual
- Scalla Open File System Open Storage System
Configuration Reference
24But My Exports Are Mounted Elsewhere!
- Common issue
- Say you need to mount your file system on /myfs
- But you want to export /atlas within /myfs
- What to do?
- Use the oss.localroot directive
- Only the oss component needs to know about this
- oss.localroot /myfs
- all.export /atlas
- Makes /atlas a visible path but internally always
prefixes it with /myfs - So, open(/atlas/foo) actually opens
/myfs/atlas/foo
25Multiple File Systems
- The oss allows you to aggregate partitions
- Each partition is mounted as a separate file
system - An exported path can refer to all the partitions
- The oss automatically handles it by creating
symlinks - File name in /atlas is a symlink to an actual
file in /mnt1 or /mnt2
The oss CacheFS
symlink
/mnt1
/atlas
oss.cache public /mnt1 xa oss.cache public /mnt2
xa all.export /atlas
/mnt2
File system used to hold exported file paths
Mounted Partitions hold file data
26OSS CacheFS Logic Example
- Client creates a new file /atlas/myfile
- The oss selects a suitable partition
- Searches for space in /mnt1 and /mnt2 using LRU
order - Creates a null file in the selected partition
- Lets call it /mnt1/public/00/file0001
- Creates two symlinks
- /atlas/myfile /mnt1/public/00/file0001
- /mnt/public/00/file0001.pfn /atlas/myfile
- Client can then write the data
27Why Use The oss CacheFS?
- No need if you can have one file system
- Use the OS volume manager if you have one and
- Not worried about large logical partitions or
fsck time - However,
- We use the CacheFS to support SRM space tokens
- Done by mapping tokens to virtual or physical
partitions - The oss supports both
28SRM Static Space Token Refresher
- Encapsulates fixed space characteristics
- Type of space
- E.g., Permanence, performance, etc.
- Implies a specific quota
- Using a particular arbitrary name
- E.g., atlasdatadisk, atlasmcdisk, atlasuserdisk,
etc. - Typically used to create new files
- Think of it as a space profile
29Partitions as a Space Token Paradigm
- Disk partitions map well to SRM space tokens
- A set of partitions embody a set of space
attributes - Performance, quota, etc.
- A static space token defines a set of space
attributes - Partitions and static space tokens are
interchangeable - We take the obvious step
- Use oss CacheFS partitions for SRM space tokens
- Simply map space tokens on a set of partitions
- The oss CacheFS supports real and virtual
partitions - So you really dont need physical partitions here
30Virtual vs. Real Partitions
Virtual Partition Name
Real Partition Mount
oss.cache atlasdatadisk /store1 xa oss.cache
atlasmcdisk /store1 xa oss.cache atlasuserdisk
/store2 xa
- Simple two step process
- Define your real partitions (one or more)
- These are file system mount-points
- Map virtual partitions on top of real ones
- Virtual partitions can share real partitions
- By convention, virtual partition names equal
static token names - Yields implicit SRM space token support
31Space Tokens vs. Virtual Partitions
- Partitions selected by virtual partition name
- Configuration file
- New files cgi-tagged with space token name
- root//host1094//atlas/mcdatafile?cgroupatlasmcd
isk - The default is public
- But space token names equal virtual partition
names - File will be allocated in the desired
real/virtual partition
oss.cache atlasdatadisk /store1 xa oss.cache
atlasmcdisk /store1 xa oss.cache atlasuserdisk
/store2 xa
32Virtual vs. Real Partitions
- Non-overlapping virtual partitions (RV)
- A real partition represents a hard quota
- Implies space token gets fixed amount of space
- Overlapping virtual partitions (R¹V)
- Hard quota applies to multiple virtual partitions
- Implies space token gets an undetermined amount
of space - Need usage tracking and external quota management
33Partition Usage Tracking
- The oss tracks usage by partition
- Automatic for real partitions
- Configurable for virtual partitions
- oss.usage nolog log dirpath
- Since Virtual Partitions Û SRM Space Tokens
- Usage is also automatically tracked by space
token - POSIX getxattr() returns usage information
- See Linux man page
34Partition Quota Management
- Quotas applied by partition
- Automatic for real partitions
- Must be enabled for virtual partitions
- oss.usage quotafile filepath
- Currenty, quotas are not enforced by the oss
- POSIX getxattr() returns quota information
- Used by FUSE/xrootdFS to enforce quotas
- Required to run a full featured SRM
35The Quota File
- Lists quota for each virtual partition
- Hence, also a quota for each static space token
- Simple multi-line format
- vpname nnnnk m g t\n
- vpnames are in 1-to-1 correspondence with space
token names - The oss re-reads it whenever it changes
- Useful only for FUSE/xrootdFS
- Quotas need to apply to the whole cluster
36Considerations
- Files cannot be easily reassigned space tokens
- Must manually move file across partitions
- Can always get original space token name
- Use file-specific getxattr() call
- Quotas for virtual partitions are soft
- Time causality prevents a real hard limit
- Use real partitions if hard limit needed
37SRM Scalla The Big Issue
- Scalla implements a distributed name space
- Very scalable and efficient
- Sufficient for data analysis
- SRM needs a single view of the complete name
space - This requires deploying additional components
- Composite Name Space Daemon (cnsd)
- Provides the complete name space
- FUSE/xrootdFS
- Provides the single view via a file system
interface - Compatible with all stand-alone SRMs (e.g.,
BestMan StoRM)
38The Composite Name Space
- A new xrootd instance is used to maintain the
complete name space for the cluster - Only holds the full paths file sizes, no more
- Normally runs on one of the manager nodes
- The cnsd needs to run on all the server nodes
- Captures xrootd name space requests (e.g., rm)
- Re-Issues the request to the new xrootd instance
- This is the clusters composite name space
- Composite because each server node adds to the
name space - There is no pre-registration of names it all
happens on-the-fly
39Composite Name Space Implemented
opendir() refers to the directory structure
maintained at myhost2094
opendir()
Name Space xrootd_at_myhost2094
Redirector xrootd_at_myhost1094
Manager
create/trunc mkdir mv rm rmdir
Data Servers
cnsd
ofs.forward 3way myhost2094 mkdir mv rm rmdir
trunc ofs.notify closew create /opt/xrootd/bin/cn
sd xrootd.redirect myhost2094 dirlist
40Some Caveats
- Name space is reasonably accurate
- Usually sufficient for SRM operations
- cnsds do log events to circumvent transient
failures - The log is replayed when the name space xrootd
recovers - But, the log is not infinite
- Invariably inconsistencies will arise
- The composite name space can be audited
- Means comparing and resolving multiple name
spaces - Time consuming in terms of elapsed time
- But can happen while the system is running
- Tools to do this are still under development
- Consider contributing such software
41The Single View
- Now that there is a composite cluster name space
we need an SRM-compatible view - The easiest way is to use a file system view
- BestMan and StoRM actually expect this
- The additional component is FUSE
42What is FUSE
- Filesystem in Userspace
- Implements a file system as a user space program
- Linux 2.4 and 2.6 only
- Refer to http//fuse.sourceforge.net/
- Can use FUSE to provide xrootd access
- Looks like a mounted file system
- We call it xrootdFS
- Two versions currently exist
- Wei Yang at SLAC (packaged with VDT)
- Andreas Peters at CERN (packaged with Castor)
43xrootdFS (Linux/FUSE/xrootd)
User Space
POSIX File System Interface
Client Host
FUSE
Kernel
SRM
FUSE/Xroot Interface
opendir
xrootd POSIX Client
Redirector xrootd1094
Redirector Host
Should run cnsd on servers to capture non-FUSE
events
44SLAC xrootdFS Performance
Sun V20z RHEL4 2x 2.2Ghz AMD Opteron 4GB
RAM 1Gbit/sec Ethernet
VA Linux 1220 RHEL3 2x 866Mhz Pentium 3 1GB
RAM 100Mbit/sec Ethernet
Unix dd, globus-url-copy uberftp 5-7MB/sec with
128KB I/O block size
Unix cp 0.9MB/sec with 4KB I/O block size
Conclusion Do not use it for data transfers!
45More Caveats
- FUSE must be administratively installed
- Requires root access
- Difficult if many machines (e.g., batch workers)
- Easier if it only involves an SE node (i.e., SRM
gateway) - Performance is limited
- Kernel-FUSE interactions are not cheap
- CERN modified FUSE shows very good transfer
performance - Rapid file creation (e.g., tar) is limited
- Recommend that it be kept away from general users
46Putting It All Together
Manager Node
Data Server Nodes
Basic xrootd Cluster
47Acknowledgements
- Software Contributors
- CERN Derek Feichtinger, Fabrizio Furano, Andreas
Peters - Fermi Tony Johnson (Java)
- Root Gerri Ganis, Bertrand Bellenot
- SLAC Jacek Becla, Tofigh Azemoon, Wilko Kroeger
- Operational Collaborators
- BNL, INFN, IN2P3
- Partial Funding
- US Department of Energy
- Contract DE-AC02-76SF00515 with Stanford
University