Learning Apache Apex
Thomas Weise Munagala V. Ramanath David Yan Kenneth Knowles更新时间:2021-07-02 22:39:10
最新章节:Summary封面
版权信息
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
Introduction to Apex
Unbounded data and continuous processing
Stream processing
Stream processing systems
What is Apex and why is it important?
Use cases and case studies
Real-time insights for Advertising Tech (PubMatic)
Industrial IoT applications (GE)
Real-time threat detection (Capital One)
Silver Spring Networks (SSN)
Application Model and API
Directed Acyclic Graph (DAG)
Apex DAG Java API
High-level Stream Java API
SQL
JSON
Windowing and time
Value proposition of Apex
Low latency and stateful processing
Native streaming versus micro-batch
Performance
Where Apex excels
Where Apex is not suitable
Summary
Getting Started with Application Development
Development process and methodology
Setting up the development environment
Creating a new Maven project
Application specifications
Custom operator development
The Apex operator model
CheckpointListener/CheckpointNotificationListener
ActivationListener
IdleTimeHandler
Application configuration
Testing in the IDE
Writing the integration test
Running the application on YARN
Execution layer components
Installing Apex Docker sandbox
Running the application
Working on the cluster
YARN web UI
Apex CLI
Logging
Dynamically adjusting logging levels
Summary
The Apex Library
An overview of the library
Integrations
Apache Kafka
Kafka input
Kafka output
Other streaming integrations
JMS (ActiveMQ SQS and so on)
Kinesis streams
Files
File input
File splitter and block reader
File writer
Databases
JDBC input
JDBC output
Other databases
Transformations
Parser
Filter
Enrichment
Map transform
Custom functions
Windowed transformations
Windowing
Global Window
Time Windows
Sliding Time Windows
Session Windows
Window propagation
State
Accumulation
Accumulation Mode
State storage
Watermarks
Allowed lateness
Triggering
Merging of streams
The windowing example
Dedup
Join
State Management
Summary
Scalability Low Latency and Performance
Partitioning and how it works
Elasticity
Partitioning toolkit
Configuring and triggering partitioning
StreamCodec
Unifier
Custom dynamic partitioning
Performance optimizations
Affinity and anti-affinity
Low-latency versus throughput
Sample application for dynamic partitioning
Performance – other aspects for custom operators
Summary
Fault Tolerance and Reliability
Distributed systems need to be resilient
Fault-tolerance components and mechanism in Apex
Checkpointing
When to checkpoint
How to checkpoint
What to checkpoint
Incremental state saving
Incremental recovery
Processing guarantees
Example – exactly-once counting
The exactly-once output to JDBC
Summary
Example Project – Real-Time Aggregation and Visualization
Streaming ETL and beyond
The application pattern in a real-world use case
Analyzing Twitter feed
Top Hashtags
TweetStats
Running the application
Configuring Twitter API access
Enabling WebSocket output
The Pub/Sub server
Grafana visualization
Installing Grafana
Installing Grafana Simple JSON Datasource
The Grafana Pub/Sub adapter server
Setting up the dashboard
Summary
Example Project – Real-Time Ride Service Data Processing
The goal
Datasource
The pipeline
Simulation of a real-time feed using historical data
Parsing the data
Looking up of the zip code and preparing for the windowing operation
Windowed operator configuration
Serving the data with WebSocket
Running the application
Running the application on GCP Dataproc
Summary
Example Project – ETL Using SQL
The application pipeline
Building and running the application
Application configuration
The application code
Partitioning
Application testing
Understanding application logs
Calcite integration
Summary
Introduction to Apache Beam
Introduction to Apache Beam
Beam concepts
Pipelines PTransforms and PCollections
ParDo – elementwise computation
GroupByKey/CombinePerKey – aggregation across elements
Windowing watermarks and triggering in Beam
Windowing in Beam
Watermarks in Beam
Triggering in Beam
Advanced topic – stateful ParDo
WordCount in Apache Beam
Setting up your pipeline
Reading the works of Shakespeare in parallel
Splitting each line on spaces
Eliminating empty strings
Counting the occurrences of each word
Format your results
Writing to a sharded text file in parallel
Testing the pipeline at small scale with DirectRunner
Running Apache Beam WordCount on Apache Apex
Summary
The Future of Stream Processing
Lower barrier for building streaming pipelines
Visual development tools
Streaming SQL
Better programming API
Bridging the gap between data science and engineering
Machine learning integration
State management
State query and data consistency
Containerized infrastructure
Management tools
Summary
更新时间:2021-07-02 22:39:10