Apache Gobblin - PowerPoint PPT Presentation

About This Presentation
Title:

Apache Gobblin

Description:

This presentation gives an overview of the Apache Gobblin project. It explains Apache Gobblin in terms of it's architecture, data sources/sinks and it's work unit processing. Links for further information and connecting – PowerPoint PPT presentation

Number of Views:93
Slides: 15
Provided by: semtechs

less

Transcript and Presenter's Notes

Title: Apache Gobblin


1
What Is Apache Gobblin ?
  • A big data integration framework
  • To simplify integration issues like
  • Data ingestion
  • Replication
  • Organization
  • Lifecycle management
  • For streaming and batch
  • An Apache incubator project

2
Gobblin Execution Modes
  • Gobblin has a number of execution modes
  • Standalone
  • Run on a single box / JVM / embedded mode
  • Map Reduce
  • Run as a map reduce application
  • Yarn / Mesos ( proposed ? )
  • Run on a cluster via a scheduler, supports HA
  • Cloud
  • Run on AWS / Azure, supports HA

3
Gobblin Sinks/Writers
  • Gobblin supports the following sinks
  • Avro HDFS
  • Parquet HDFS
  • HDFS byte array
  • Console (StdOut)
  • Couchbase
  • HTTP
  • JDBC
  • Kafka

4
Gobblin Sources
  • Gobblin supports the following sources
  • Avro files
  • File copy
  • Query based
  • Rest API
  • Google Analytics
  • Google drive
  • Google webmaster
  • Hadoop text input
  • Hive Avro to ORC
  • Hive compliance purging
  • JSON
  • Kafka
  • MySQL
  • Oracle
  • Salesforce
  • FTP / SFTP
  • SQL Server
  • Teradata
  • Wikipedia

5
Gobblin Architecture
6
Gobblin Architecture
  • A Gobblin job is built on a set of plugable
    constructs
  • Which are extensible
  • A job is a set of tasks created from a workunit
  • The workunit serves as a container at runtime
  • Tasks are executed by the Gobblin runtime
  • On the chosen deployment i.e. MapReduce
  • Run time handles scheduling, error handling etc
  • Utilities handle meta data, state, metrics etc

7
Gobblin Job
8
Gobblin Job
  • Optional aquire lock (to stop next job instance)
  • Create source instance
  • From source work units create tasks
  • Launch and run tasks
  • Publish data if OK to do so
  • Persist the job/task states into the state store
  • Clean up temporary work data
  • Release the job lock ( optional )

9
Gobblin Constructs
10
Gobblin Constructs
  • Source partitions data into work units
  • Source creates work unit data extractors
  • Converter converts schema and data records
  • Quality checker checks row and task level data
  • Fork operator allows control to flow into
    multiple streams
  • Writers sends data records to sink
  • Publisher publishes job records

11
Gobblin Job Configuration
  • Goblin jobs are configured via configuration
    files
  • May be named .pull / .job plus .properties
  • Source properties file defines
  • Connection / converter / quality / publisher
  • Job file defines
  • Name / group / description / schedule
  • Extraction properties
  • Source properties

12
Gobblin Users
13
Available Books
  • See Big Data Made Easy
  • Apress Jan 2015
  • See Mastering Apache Spark
  • Packt Oct 2015
  • See Complete Guide to Open Source Big Data
    Stack
  • Apress Jan 2018
  • Find the author on Amazon
  • www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
  • Connect on LinkedIn
  • www.linkedin.com/in/mike-frampton-38563020

14
Connect
  • Feel free to connect on LinkedIn
  • www.linkedin.com/in/mike-frampton-38563020
  • See my open source blog at
  • open-source-systems.blogspot.com/
  • I am always interested in
  • New technology
  • Opportunities
  • Technology based issues
  • Big data integration
Write a Comment
User Comments (0)
About PowerShow.com