Title: SNFS: The design and implementation of a Social Network File System
1SNFS The design and implementationof a Social
Network File System
- Ch. Kaidos, A. Pasiopoulos N. Ntarmos,
- P. Triantafillou
- University of Patras
2Shameless plug..
- If interested, please check out
- eXO Decentralized Autonomous Scalable Social
Networking, - 5th Conference on Innovative Data Systems
Research (CIDR2011), 2011.
3Social Networks
- Our Take
- Search for
- People (friends, experts, )
- Content (books, photos, videos, blogs, websites,
) - Form entities (collections)
- Friends-lists, content-libs
- Search for
- entities
- Using previously-formed collections
- SNFS currently provides the foundation for these
Social Networks
4Tagging
- Profiles
- sets of tags describing entities.
- Search for
- based on profiles.
- Ranked retrieval (top-k)
Tag 1 Tag 2
Tag 3 Tag 4 Tag 5
5Current State
- 5,000,000,000 photos
- 3,000 photos/min (as of September 2010)
- 2,000,000,000 videos served up each day
- (May 2010)
- 600,000,000 monthly active users (January 2011)
- 15,000,000 books (October 2010)
- 130,000,000 by the end of the decade
6Current State
- Need to access published content
- 22,750,000,000 queries in search engines
- 4,000,000,000 queries in YouTube
- 351,000,000 queries in Facebook
- 416,000,000 queries in MySpace
- (U.S. market figures, December 2009)
?
7Current State
How do I provide intresting objects to my users?
How do I find stuff I want?
8Proposal
A content-aware file system for Social
Network Systems
Usefull to users...
... And service providers too!
9Previous Work on File Indexing
1991 Semantic File Systems by Gifford
1996 BeFS by Giampaolo and Meurillon, part of
the BeOS
BeOS never had commercial success...
1998 Indexing Service on Windows NT, not needed
at the time
Remnant of the Object File System from the
unmaterialized Cairo project
- Typically
- no ranked retrieval
- No users input (tags)
- No user relationships
10Desktop Searches
2004 Windows Desktop Search, widely popular
2005... Mac OS X's Spotlight, Google Desktop,
Beagle, Strigi, Tracker...
- Typically
- no ranked retrieval ?
- No user relationships
- no exploits from relations for searching
11Problems
Power tools for power users... But for average
users...
Boolean operators??? SQL like queries???
12Previous Work on Ranked Retrieval
1968 SMART system by Salton, introduced weights
in retrieval, instead of classical Boolean
retrieval
1975 Vectors and cosine similarity by Salton
1988 Other functions for similarity tested and
evaluated by Salton and Buckley
2003 Fagin proposes and compares several
efficient algorithms for top-k retrieval
13Design
14Design SNFS
- Tags are extracted from object, stemmed and
frequency is counted
Each object is associated with a unique id in a
Tree
Weights for each tag and document are calculated
A tf-idf weighting scheme was chosen
15Design SNFS
- Term Weight and Object ID are stored in an
inverted index
Each posting list of the index is a BTree stored
in secondary memory
The position of the root of the BTree in the
index is stored in a Red Black Tree
16Design Search and retrieval
- The query is split in terms and stemmed
The score of each document is calculated using a
threshold algorithm and a tf-idf function
17Threshold Algorithms
Input Posting lists sorted on weight (decreasing)
NRA (No Random Access) Algorithm
Score
Doc ID
Doc ID
d1
s1
t1
d1
d4
d2
s2
s6
s7
d2
t2
s3
s8
d2
d5
d3
d3
s4
s9
d4
t3
d2
d4
d3
s5
d5
depth
1
2
3
Threshold
s1s2s3
t1
s4s5s6
s7s8s9
When no score bellow the top-k objects can be
improved to exceed the threshold the algorithm
halts
18Threshold Algorithms
Input Posting lists sorted on weight (decreasing)
TA (Threshold Algorithm with random accesses)
Score
Doc ID
Doc ID
d1
s1
t1
d1
d4
d2
d5
s2
s6
s7
d2
t2
s3
s8
d2
d5
d3
d3
s4
s9
d4
t3
d2
d4
d3
s5
s10
d5
1
2
3
depth
Threshold
s1s2s3
s4s5s6
s7s8s9
When score of the last object is bellow threshold
the algorithm halts
19Qualitative Comparison
NRA
TA
Disk Accesses
System Calls
State Keeping and computation
We expect TA to perform many more slow disk
accesses Can NRA's large state keeping keeping
and computation need overcome TA's disk accesses?
We implement both, on hard disk and on RAM-disk
to find out...
20Implementation with FUSE
21Testing
- - 4 real world test sets
- - files containing tags from online objects
- - index is normally on secondary memory
- - ram-disk used to evaluate the effect of disk
accesses
22Results demanded vs Time
Disk based index
TA
NRA
23Results demanded vs Time
RAM based index
TA
NRA
24Query Terms vs Time
Disk based index
TA
NRA
25Query Terms vs Time
RAM based index
TA
NRA
26Beagle vs NRA
Terms vs time
Results vs time
27Conclusions
- SNFS
- - Indexing, storage, and ranked retrieval of
entities in a SN. - - Study of efficiency of algorithms and
implementations, using real-world data, and
various implementations. - - Competitive performance, (eg against Beagle).
- - Many ways of further expansion
28Future Work
- - Expansion for distributed systems and clouds
- - Distributed file systems (HDFS)
- - Distributed data structures
- - Tagging, Indexing, and searching for
entity-collections straightforward, as our
object implementation/abstraction captures
this. - Establishing entities consisting of relationships
between entities, using advanced-tagging, and
searching for these