You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Apache Kafka

Apache Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design that enables Kafka to achieve very high throughput and very low latencies.

First let's review some basic messaging terminology:

  • Kafka maintains feeds of messages in categories called topics.
  • We'll call processes that publish messages to a Kafka topic producers.
  • We'll call processes that subscribe to topics and process the feed of published messages consumers.
  • Kafka is run as a cluster comprised of one or more servers each of which is called a broker.

So, at a high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers like this:

Communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol.

Authorization in Apache Kafka

Starting from 0.9.0.0 release, Apache Kafka has various security features built in, like, encryption and authentication using SSL, authentication using SASL, Apache Zookeeper authentication, quotas, and authorization. There are various ways one would want to have authorization done and so Kafka allows different authorization implementations to plug into Kafka. By default Apache Kafka comes with a Zookeeper based authorization implementation, which uses Zookeeper to store ACLs. Note that Kafka has a hard dependency on Zookeeper for configuration management and leader election, and is not an additional requirement to be able to use out of box Zookeeper based authorization. Though it is nice to not have any dependency on an external system for out-of-the-box authorization implementation in Apache Kafka, it has quite a few shortcomings.

  • Only supports User principal, so one will have to create an ACL for each and every user of a Kafka cluster, and for each resource they need access to. This could be a huge operational concern for enterprises or clusters with large number of users.
  • No way to use user group mapping from external services, like, LDAP, AD, etc. Quite often organizations already have some sort of user group mapping service and redefining those mapping just for authorization in Apache Kafka is probably not the best idea.
  • Very Kafka specific implementation. It is not ideal to have separate authorization entities for each component in a data pipeline. It makes it too hard to manage and as users or pipeline's complexity grows, it gets worse.
  • Zookeeper based Kafka authorization stores ACLs under zNodes in Zookeeper as JSON strings. As zNodes have size limitations, recommended size is only 1MB, and as ACLs need to be created for each and every user, JSON strings can easily grow beyond zNode's recommended size. It is not scalable.
  • Many concurrency issues have been found and fixed lately, but it is not battle tested and is definitely not production ready.

Apache Sentry

Apache Sentry is a system for enforcing fine grained role based authorization. Role Based Authorization Control, RBAC, is a powerful mechanism to manage authorization for a large set of users and data objects in a typical enterprise. Apache Sentry allows for various systems to integrate with it for utilizing it's generic and powerful authorization. Many systems, like, Hive, Impala, HDFS, Sqoop, etc are already capable of using Apache Sentry for providing authorization. It is also capable of getting user group mapping from external systems, like, LDAP, AD, etc. All the shortcomings of Zookeeper based out-of-the-box Apache Kafka authorization implementation can be taken care of if we choose Apache Sentry to provide authorization in Apache Kafka as well.

Starting from 1.7.0 release, Apache Sentry has Kafka binding that can be used to enable authorization in Apache Kafka with Apache Sentry. Following sections go over how to configure Apache Kafka to use Apache Sentry for authorization and a quick-start guide.

Configuring Apache Kafka to use Apache Sentry for Authorization

To enable authorization in Apache Kafka and use Apache Sentry for authorization, follow these steps.

  • Add required Sentry jars to Kafka's classpath.
  • Add following configs to Kafka broker's properties file.

 

 

    • authorizer.class.name=org.apache.sentry.kafka.authorizer.SentryKafkaAuthorizer
    • sentry.kafka.site.url=file:<path to SENTRY-SITE.XML> // with information on how to connect with Sentry server
    • sentry.kafka.principal.hostname=<HOST> // host of Kafka broker, required to perform kerberos authentication with Sentry server
    • sentry.kafka.kerberos.principal=<KAFKA_PRINCIPAL> // kerberos principal of user running Kafka, required to perform kerberos authentication with Sentry server
    • sentry.kafka.keytab.file=<KAFKA_KEYTAB_FILE> // keytab file of user running Kafka, required to perform kerberos authentication with Sentry server
  • Add super users

 

    • super.users=<Semicolon separated list of users in form User:<USERNAME1>;User:<USERNAME2>> these users can perform any action on any resource in the Kafka cluster. It is recommended that user running Kafka broker processes is a super user. This will avoid each inter broker communication to be authenticated against Sentry, which might have huge performance impact.

 

Quick Start

Performance Comparison

Future Work

  • No labels