Title: Toto, Were Not in Kansas Anymore On Transitioning from Research to the Real World Mike Carey FellowS
1Toto, Were Notin Kansas AnymoreOn
Transitioning fromResearch to the Real
WorldMike CareyFellow-Systems
ArchitecturePlatform Divisioncarey_at_propel.com
2Todays Talk
- Background information
- Lessons from the "Road to Propel"
- The UW-Madison years
- The IBM Almaden years
- The Propel (web) years
- Database research in the new millennium
- Maturity brings its own challenges
- Research opportunities in e-commerce
- Some operational recommendations
3Part One
4Background Info
- UW-Madison CS Professor (1983-1995)
- Concurrency control algorithms
- Query processing performance
- Main memory databases
- Extensible database systems (Exodus)
- Real-time database systems
- Client-server O-O database systems (Shore)
- Online algorithms, DBMS performance
5Background Info (cont.)
- IBM Almaden Research Staff Member and Manager
(1995-2000) - Heterogeneous database systems (Garlic)
- Object middleware (Component Broker)
- Object-relational databases (DB2 UDB)
- Propel Platform Engineering Fellow (2000-?)
- Scalable e-commerce infrastructure (Danube)
6Part Two
- Lessons from the "Road to Propel"
7UW-Madison YearsLesson 1 Awareness is key
- Be plugged in to current technologies issues
- Hardware and OS characteristics
- CPU, memory, disk, and network performance
- Path lengths (e.g., TCP/IP messages)
- DBMS software characteristics
- DBMS internal components
- Layers/calls SQL, records, pages,
- Interactions, e.g., concurrency recovery
- Application characteristics
- Typical workload characteristics
- What systems can or cannot know (when/how)
8UW-Madison YearsLesson 2 Students are the
product
- Having industrial impact is a laudable goal, but
- Its hard (in general) to be fully plugged in
- Details of systems and workloads
- The algorithms may not be the hard part
- More about this shortly
- Students are our biggest accomplishment
- Well-trained students are incredibly valuable
- Systems sense ability to think, learn, adapt
- Im extremely proud of my former students!
- Thats what I miss the most in industry
9UW-Madison YearsThe wake-up call A house of
cards?
- ACL85 Blindly following colleagues
- Ten years later, some papers still using the
same hardware and software parameters - RTDBS The blind following the blind?
- We basically stated and then solved these
research problems ourselves - SIGMOD-94 The SIGMOD chairs lunchtime analysis
of SIGMOD paper production - Not clear to me that most SIGMOD papers in the
last ten years was such a good thing
10The First TransitionFrom UW-Madison to IBM
Almaden
- Intellectual reasons
- Weary of inventing and then solving problems
- Wanted access to real problems and systems
- Also just needed a change after 12 years
- IBM Almaden reasons
- Terrific environment colleagues for DB
research - Development from the safety of a research lab
- Personal reasons
- Wanted to have a life again outside work
- Wanted to live in the Bay area (Silicon Valley)
11IBM Almaden YearsContext Extending DB2 UDB
- From 1996-2000, I worked on adding object
extensions to SQL and DB2 UDB (V5.2-V7.1) - Object-relational data model extensions
- Types, OIDs, references, subtables, object views
- Corresponding query language extensions
- Substitutability, path expressions, constraints
and triggers, type predicates, sub-table access
rules - System extensions
- Storage query processing for all of the above
- DB2 UDB work is geographically distributed
- IBM Toronto, Santa Teresa, and Almaden labs
12IBM Almaden YearsLesson 1 Products are hard to
build
- Products are very different than prototypes
- Someone else wrote the first 1M lines of code
- System has many nooks and crannies
- No one person understands the whole thing
- 100 or so people are working on it with you
- You have to do the other 80-90 of the work
- Testing, code reviews, testing, docs, testing,
- System catalogs no big deal, right?
- The engine is just one aspect of a product
- Import/export, bulk load, control center, visual
explain, query tools, design tools, replication,
13IBM Almaden YearsLesson 1 Products are hard
(cont.)
- Its difficult to make some kinds of changes
- Customers already have terabytes of data
- Data migration is a no-no (at least at IBM ?)
- Catalog migration is a pain and a time sink
- Its not just your own product thats affected
- 3rd-party vendors may also be a factor
- Ex. 1 Physical load utilities (table
hierarchies) - Ex. 2 Logical physical database design tools
- Market share standards come into play here
14IBM Almaden YearsLesson 2 Adding to a language
is hard
- SQL is a 25-year old language that was never
intended to do everything we want it to today - World was simple tables, basic retrievals
- Various assumptions made for convenience
- Ex. 1 Sub-queries scalar- or table-valued?
- Ex. 2 Nulls inconsistent (e.g., where vs.
max) - SQL changes must be monotonic in nature
- Cant change meaning of existing queries (!)
- Extensions must all peacefully co-exist
- Language is getting full (gt 1000 pages)
15IBM Almaden YearsLesson 2 Adding is hard
(cont.)
- Cool new SQL features are a double-edged sword
- Can add real value for advanced applications
- Consider OLAP, O-R, and temporal extensions
- Different or proprietary bad?
- To 3rd-party vendors, also to nervous customers
- And, tools may hide them anyway
- Query builders, EJB model,
- SQL standardization is an interesting world
- Serious extensions must someday fly with ANSI
ISO - SQL standard is in some ways a corporate
battleground - Vendors only want the extensions on their radar
screen
16IBM Almaden YearsLesson 3 Listen to users
needs
- So many features, so little time!
- Potential users help you prioritize your work
- Ex Sub-table triggers constraints in DB2
- They also help you make safe initial decisions
- Ex Internal storage for DB2 table hierarchies
- Potential users can help you see things you might
otherwise miss (at least initially) - Ex 1 Advantages of DB2 user-defined OIDs
- Customers already simulate objects today
- Access to system-generated OID values?
- Object caching and efficient write-back
- Ex 2 DB2 object view functionality
- Virtual table hierarchies, same authorization
model
17The Second TransitionFrom IBM Almaden to Propel
- Some triggering events
- Working on XML middleware layer for DB2 UDB
- After spending nearly 20 years under the hood
- Almaden manager discussions connecting to
Valley - Personal belief that this may be a unique time
for CS - Call (out of the blue) from Steve Kirsch, CEO
- Given a 4-year paid scholarship to e-school
- Chance to learn about
- Using database system technology
- Web and e-commerce applications
- The startup company experience
- Excellent senior team to learn from at Propel
- Unemployment risk low in Silicon Valley today
18Propel (Web) YearsContext E-commerce
infrastructure
- Propel has three divisions
- E-commerce divison
- Amazon-in-a-box product
- OneID division
- Hosted personal information service
- Platform division
- Infrastructure product for the above (and more)
- Platform Scalable 24x7 e-commerce OS
- Online data management, caching, search, message
services, deployment, monitoring,
19Propel (Web) Years Context E-C infrastructure
(cont.)
. . .
Firewall
Load Balancer
WebServer
WebServer
WebServer
WebServer
WebServer
. . .
App Server
App Server
AppServer
. . .
Propel Platform
Message Service
Admin MonitoringService
CachingService
ERP Service
OrderMgmtService
PaymentService
Data Management Search Service
20Propel (Web) YearsLesson 1 Standards vs.
innovation
- What a marketing person will likely tell you
after asking a customer for their input - Customers want standards-based solutions
- We want DB access via SQL and JDBC
- We want our programmers to use EJBs (J2EE)
- We want to use JSPs for our dynamic pages
- I.e., a typical customer dictionary entry says
- Proprietary see bad
- This poses obvious challenges for innovation!
- Luckily
- XML is also considered standards-based
- Performance, ease of use are still compelling in
web-land
21Propel (Web) YearsLesson 2 Oracle is a de
facto standard
- Talking to dot-coms with Oracle DBAs is an
interesting experience for the academic-minded - Academic point of view
- Whatever its just a database system
- Oracle DBA point of view
- Do my Oracle utilities work with your solution?
- Do my Oracle sequences work with your solution?
- You mean its not Oracle? (said with a whine ?)
- Again, this poses obvious challenges for
innovation (not to mention other DB vendors!) - Luckily
- Saying Oracle inside seems to help
- Oracle is not a cheap or limitless solution
22Propel (Web) YearsLesson 3 VCs, dot-coms, and
ASPs
- OracleSunSolaris are to web sites what IBM was
to corporate IS departments 15 years ago - Some VC firms prescribe them to dot-coms
- Some IS departments pre-approve (just) them
- They are a favorite managed stack for ASPs
- Thus, todays technology brakes include
- Corporate and VC comfort zones
- ASP system management expertise
- Developer and DBA skill set availability
23Part Three
- Database research in the
- new millennium
24The DB Field Has MaturedBringing a new set of
challenges
- SQL DB systems are becoming a commodity
- ISVs produce DBMS-independent packages
- Ex ERP systems (SAP, Peoplesoft, Baan, )
- SQL ODBC/JDBC is just a given
- New features face a huge uphill battle
- Witness the rate of object-relational adoption
- Hopefully SQL99 will help, but.?
- A SQL DBMS has truly become a component
- Transactional storage for ERP
- On-line data repository for e-commerce
- I.e., just a place to put your data
- So where does that leave our community?
25The DB Field Has MaturedBringing new challenges
(cont.)
- Interesting questions remain! For example
- A good component is easy to manage
- DB systems have way too many knobs
- Theyre virtually impossible to hide as a result
- A good component plugs in well with others
- Better, faster interfaces would be nice
- Cache interaction hooks would be nice
- Workflow hooks would be nice
- (Your application hooks go here)
- XML appears poised for interoperation success
- W3C XML Schema Query standards coming
- Our community should keep playing a big role
26The DB Field Has MaturedBringing new challenges
(cont.)
- Interesting questions remain (cont.)
- Major applications are worth studying
- Ex Kemper, Kossman, et al SAP study
- Sources of typical workload info, database
characteristics, and feature use (or disuse) info - Bottom line from a component perspective
- We need to understand how our technologies are
being utilized (or not) and respond accordingly - Ex. 1 Queries with parameter markers
- Ex. 2 SQLs approach to authorization
- Ex. 3 Actual usage-driven interoperation hooks
- And, of course, we must continue to innovate!
- Somehow?!?
27E-Commerce DB ResearchA Propel Perspective
- The Propel Platform Not an app server
- Scalable, 24x7 e-commerce infrastructure
- Array of inexpensive Sun or Intel boxes
- Exploitation of low main memory cost
- High-performance and highly available
- Data management and search capabilities
- Transparent data replication partitioning
- Caching of page fragments, objects, and data
- Scalable messaging queuing infrastructure
- Built from best-of-breed components
- XML-enabled (for the future of e-commerce)
- Unified administration and on-line deployment
28E-Commerce DB ResearchProblem 1 Caching
- What to cache and where to cache it?
- Fragments of dynamic HTML pages
- Personalization ruins basic page caching
- Commonly used fragments assured, though
- XML objects used to create HTML fragments
- If applicable, probably less bulky
- Java objects materialized on app servers
- Avoids database re-access cost
- Issues load balancing, memory duplication
- Database objects accessed on DB server(s)
- Lowers database access cost
- Where app servers, DB server(s), or both?
29E-Commerce DB ResearchProblem 1 Caching
(cont.)
- How to keep caches consistent
- Multiple web servers and app servers
- DB rows -gt Java objects -gt XML -gt HTML
- How to uniquely identify objects?
- How to keep track of whats where?
- How to keep track of data dependencies?
- How/when to propagate updates?
- How to maintain consistency?
- In fact, how to define consistency?
- And, just to up the ante a bit further
- Want all this to work across continents!
30E-Commerce DB ResearchProblem 2 Consistency
transactions
- Not all e-commerce data is equally valuable
- Want to trade off reliability performance
- Products hot, may be read-only once deployed
- Shopping carts read/write, best effort
durability - Orders also read/write, require full
durability - Similar considerations arise w.r.t. consistency
- Would like well-defined choices available
- Auctions okay to bid using slightly outdated
info - Orders real-time inventory requires
transactions - Need good, architecturally appropriate solutions
- Caching, replication, failover, smart load
balancing,
31E-Commerce DB ResearchProblem 3 Queries and
search
- W3Cs XML Schema recommendation
- How to store richly typed XML data?
- Sparse/variant data, repeating elements,
subtyping, text, - Would like to map it into (object-?) relational
databases - W3Cs XML Query recommendation
- How to process XML queries efficiently?
- SQL-appropriate processing model
- Pushdown and other optimizations
- How to handle search-oriented queries
- Want transaction-consistent text indexing
- Also want relevance ranking and various IR
goodies
32E-Commerce DB ResearchProblem 4 Content
management
- E-commerce sites are rich in content
- HTML fragments (e.g., logos and other goodies)
- Images (e.g., pictures of products)
- Text (e.g., descriptions of products)
- Database data (e.g., product attributes,
pricing) - JSP pages (e.g., a product page)
- Personalization rules (i.e., what to show me)
- Business logic (i.e., Java code)
- Data -gt object mappings (e.g., Java classes)
- And the list goes on
33E-Commerce DB ResearchProblem 4 Content mgmt.
(cont.)
- This poses a number of problems
- Versioning of file-based artifacts
- Not unlike CAD or document versioning
- Multiple editors working on the content base
- Several companies do this (e.g., Interwoven)
- Versioning of DB-based artifacts
- Not clear how to handle integrate this part
- No winning solutions out there yet (that I know
of) - Versioning of code-based artifacts
- How to keep all this stuff mutually consistent?
- And, how to deploy online in a 24x7 world?
34E-Commerce DB ResearchProblem 5 The sun never
sets anymore
- The web brings a clear need for 24x7 solutions
- Asynchronous replication techniques
- Online schema evolution (w/replication)
- Online data loading and deployment
- Online management of rolling history data
- Design for administration/monitoring is also key
- Online backup/restore
- Failure performance monitoring
- Would like system to be self-tuning
self-scaling - Reassign boxes between services as needed
- Even give and take boxes from ASP infrastructure
35The Propel PlatformWere attacking all of these
issues
- Programming model
- Objects with (truly!) universal OIDs
- Java classes, derived from XML Schema objects
- Caching
- Multilevel cache hierarchy (w/partitioning)
- Mini-caches, global cache, MM-DBMS, DB-DBMS
- Consistency and transactions
- Can trade off ACID-ity vs. performance
- Queries and search
- XML-based query language, integrated search
- Transparency for cached, partitioned,
replicated data
36The Propel PlatformWere attacking all of these
issues (cont.)
- Platform message system
- Truly scalable IPC for Platform components
- Hides a number of painful details
- Load balancing failover
- System monitoring
- Also supports persistent message queueing
- Content management
- Currently focused on deployment problems
- Partnering (sigh!) for content management
- System monitoring and administration
- Separate software stack with agents everywhere
- JSP-based console to oversee integrate
activities
37ConclusionLessons from the "Road to Propel"
- UW-Madison lessons Know what matters!
- Awareness is key
- Students are the product
- IBM Almaden lessons Whats really hard?
- Products are hard to build
- Adding to a language is hard
- Listen to users needs
- Propel lessons Commoditization brings
roadblocks. - Standards vs. innovation
- Oracle is a de facto standard
- Dot-coms, VCs, and ASPs
38ConclusionDB research in the new millennium
- SQL databases are becoming commodity parts
- ISVs strive for DBMS vendor-independence
- This makes (visible) innovation hard
- Lots of interesting research questions, though
- Component hooks, usage scenarios, XML,
- E-commerce problems are ripe for the picking
- Examples that have arisen at Propel include
- Caching, transactions consistency
- Queries and search
- Content management
- Online everything for a 24x7 world
39ConclusionSome operational recommendations
- Understand the real problems out there
- Industrial friends can be very helpful
- Your students will benefit tremendously
- So will the companies who hire them
- Recognize that commoditization is happening
- Consider working within the constraints that it
brings - Many important open problems remain
- E-commerce is a fun/interesting example here
- Also keep in mind what really matters
- Its actually not any of this stuff, in the
end!