Apache Arrow PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Apache Arrow


1
What Is Apache Arrow ?
  • A development platform for in-memory data
  • It has a columnar memory format
  • It provides efficient analytic operations on
    modern hardware
  • Used for in memory processing
  • Cross language support
  • Open source / Apache 2.0 license
  • Supports zero-copy reads for lightning fast data
    access

2
Languages supported
  • Arrow supports many languages
  • C
  • C
  • C
  • Go
  • Java
  • JavaScript
  • MATLAB
  • Python
  • R
  • Ruby
  • Rust

3
OS Community Support
  • Many open source projects support Arrow
  • Calcite
  • Cassandra
  • Drill
  • Hadoop
  • HBase
  • Ibis
  • Impala
  • Kudu
  • Pandas
  • Parquet
  • Phoenix
  • Spark
  • Storm

4
The problem Arrow tackles
  • Each system has its own internal memory format
  • 70-80 computation wasted
  • on serialization and de-serialization
  • Similar functionality implemented in multiple
    projects
  • Overheads for cross-system communication
  • All systems utilize different memory formats

5
The problem Arrow tackles
  • No shared in memory data model

6
Arrow solves this problem
  • All systems utilize the same memory format
  • In memory
  • Columnar format
  • Optimized for modern CPUs and GPUs
  • No overhead for cross-system communication
  • Projects can share functionality

7
Arrow solves this problem
  • Arrow shared data model

8
Arrow works with Parquet
  • Arrow is an in memory format
  • Parquet is designed for disk storage
  • Arrow and Parquet are intended to be used
    together
  • Parquet is a columnar file format
  • Used for data serialization
  • Parquet is a streaming format
  • Data must be decoded from start-to-end
  • Files are compressed and encoded
  • Means smaller files on disk

9
Arrow Memory Buffer
  • Arrow supports data adjacency for sequential
    access

10
Available Books
  • See Big Data Made Easy
  • Apress Jan 2015
  • See Mastering Apache Spark
  • Packt Oct 2015
  • See Complete Guide to Open Source Big Data
    Stack
  • Apress Jan 2018
  • Find the author on Amazon
  • www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
  • Connect on LinkedIn
  • www.linkedin.com/in/mike-frampton-38563020

11
Connect
  • Feel free to connect on LinkedIn
  • www.linkedin.com/in/mike-frampton-38563020
  • See my open source blog at
  • open-source-systems.blogspot.com/
  • I am always interested in
  • New technology
  • Opportunities
  • Technology based issues
  • Big data integration
Write a Comment
User Comments (0)
About PowerShow.com