Status: Draft

Author: Viquar Khan (Vaquar.khan@gmail.com)

Discussion thread: TBD

Jira:


Motivation

The open-source community is seeing a sharp increase in automated pull requests. While code generation tools can help developers, they frequently produce pull requests that look correct on the surface but contain empty boilerplate, unnecessary comments, and hallucinated logic. This puts a heavy burden on maintainers who spend limited review time on code with no real substance.

While this trend is accelerated by AI, the necessity for structural validation applies to all pull requests. Other major Apache projects are already dealing with this. Apache Airflow changed their contribution policies after automated bots started taking over open issues. Apache Iceberg wrote strict new contribution guidelines to protect their reviewers.

Why AGENTS.md is not enough

Adding an AGENTS.md or CLAUDE.md file is a "soft control." AI models use probabilistic reasoning; if their context window fills up or a user overrides the prompt, the agent will silently ignore markdown instructions and submit bad code anyway. We need a deterministic "hard control" that catches these failures before they reach a human reviewer.

I propose adding two focused validation tasks to Kafka's Gradle build known collectively as the Automated Integrity Validation (AIV) Gate that catch the two most damaging categories of low-quality contributions: scaffolding-heavy PRs with no real logic, and code that violates Kafka's specific architectural rules. This effectively addresses the "Reviewer Overload" crisis while maintaining Kafka's domain-specific invariants and ensuring 100% local data sovereignty.

Relationship to Existing Checks

To clarify why these new gates are required alongside existing tooling, the following table details how the proposed AIV Gate differs from Kafka's existing Checkstyle configurations:

Existing CheckWhat it EnforcesWhat the Proposed AIV Gate Does Differently
ImportControlRestricts which packages can import from which globally.The Design Gate enforces conditional architectural patterns within a file (e.g., if X is instantiated, it must be closed).
RegexpGlobally blocks specific string matches (e.g., System.exit()).The Design Gate handles complex AST relationships that regex cannot express safely, such as blocking ExecutorService only if KafkaConsumer is also present.
CyclomaticComplexity / MethodLength / NPathComplexityMeasures the complexity and size of individual methods.The Density Gate (LDR) measures the ratio of executable logic to empty scaffolding across the entire PR diff to catch boilerplate inflation.

Public Interfaces

This KIP introduces no changes to the Kafka protocol, public APIs, client behaviors, or broker metrics. It modifies only the project's internal build tools and CI workflow.

Proposed Changes

I propose adding an AIV Gate using Gradle's includeBuild composite build feature. This relies on native Java classes that live inside the Kafka repository   no external dependencies beyond JavaParser (build-time only, same pattern as Checkstyle/SpotBugs). Developers run them locally with ./gradlew checkContributionQuality, and CI runs them automatically as part of ./gradlew check.

Important: These tasks complement, not duplicate, Kafka's existing checks.

Task 1: Logic Density Validation (LDR)

The problem Checkstyle cannot solve: Checkstyle measures the complexity of individual methods. It cannot answer the question: "Did this PR add real logic, or is it 300 lines of scaffolding wrapping 2 lines of actual work?" High-volume, low-effort tools often generate massive PRs with very little executable logic empty class hierarchies, verbose documentation for trivial getters, and copy-pasted boilerplate.

How it works: The task uses JavaParser to build an Abstract Syntax Tree (AST) of each changed Java file in the PR diff and counts two categories of nodes:

  • Logic nodes (weighted): if, for, while, switch (weight 5); method calls, binary expressions, variable assignments (weight 2).

  • Structure nodes (weighted): class declarations, method declarations (weight 1).

  • Calculation: The tool simply divides the total score of logic nodes by the combined score of logic + structural nodes. To ensure accuracy and prevent gaming the system, the AST parser entirely ignores whitespace, comments, and Javadoc. (Note: If a PR diff contains no countable structural nodes in Java files, the LDR check is skipped to prevent division by zero).

Weighting Rationale:

Control-flow nodes (if, for, while) dictate cyclomatic complexity and represent actual human or algorithmic problem-solving. Class and method declarations are merely structural scaffolding. Weighting logic nodes at 5x ensures we measure the density of the solution, not the structure.

Worked Examples (LDR in Practice):

To understand the provisional threshold of 0.25, consider these three PR profiles using the exact weights defined above:

  1. Legitimate Core Fix (LDR = 0.91): A PR modifying replication logic inside ReplicaManager.java. It adds no new classes and one new private method (weight 1). It contains one if block (weight 5) and three variable assignments (weight 6). Total logic = 11, total structure = 1. LDR = 11 / 12 = 0.91. It passes easily.

  2. Interface/Config Addition (Bypassed): A PR adding a new ConfigDef or Java interface. Files where ≥80% of top-level type declarations are interface or @interface types are dynamically evaluated under a "declarative profile" exception, allowing them to pass regardless of the LDR score. The 80% threshold will be validated during Shadow Mode and adjusted if necessary.

  3. AI Slop/Boilerplate (LDR = 0.18): A PR generating an expansive new module wrapper. It contains 3 new classes (weight 3), 15 empty methods or simple getters (weight 15), massive Javadoc blocks (ignored), and only 2 actual method calls (weight 4). Total logic = 4, total structure = 18. LDR = 4 / 22 = 0.18. It falls below the 0.25 threshold, and the PR is blocked.

Default Thresholds & Exceptions:

  • LDR Threshold Calibration (Java): The provisional threshold is 0.25. The definitive threshold will be finalized by publishing a sensitivity analysis of the 5th percentile distribution across 100-200 historically merged Kafka PRs during the "Shadow Mode" phase.

  • The "Annotation Attack" Protection: AI models sometimes pad code with complex type annotations to "fake" logic density. The logic parser is tuned to ignore annotations and focus strictly on executable control-flow nodes.

  • Refactoring Exception: If a PR has a net-negative line count (e.g., net -50 lines), the density check is skipped entirely so legitimate cleanups are never blocked. Note: The Refactoring Exception applies only to the Logic Density check (Task 1). Design Compliance rules (Task 2) are always enforced regardless of net line count.

  • Language Scope: This initial KIP focuses strictly on Java files via JavaParser. Support for Scala and Kotlin is deferred to a future KIP to keep this proposal focused and rigorous.

Task 2: Design Compliance (Kafka Architecture Linter)

The problem Checkstyle cannot solve: While Checkstyle's Regexp module is excellent for globally blocking anti-patterns like System.exit(), it cannot express conditional rules like "if a file uses KafkaConsumer, then it must NOT use ExecutorService."

Governance Risk & Mitigation:

To prevent "rule creep," rules are defined in .validation/design-rules.yaml. Adding or removing a rule requires a standard pull request and lazy consensus on the dev@kafka.apache.org mailing list with a minimum 72-hour review period. A single PMC member can veto a rule addition to prevent bloat. The YAML schema explicitly requires an added-in-version and rationale field.

Kafka-Specific Rules to Ship With:

  • no-direct-zk-access:

    • Trigger: ZooKeeper, ZkClient, CuratorFramework

    • Forbidden: Instantiation or use. (Applies to all lines added in the PR diff).

    • Rationale: Kafka removed ZK dependency in KRaft mode, but AI tools still routinely generate ZK-based code from old training data.

  • consumer-thread-safety:

    • Trigger: KafkaConsumer, consumer.poll

    • Forbidden: ExecutorService, ThreadPoolExecutor, newFixedThreadPool

    • Rationale: KafkaConsumer is explicitly not thread-safe. AI tools frequently hallucinate and wrap it in thread pools.

  • producer-close:

    • Trigger: KafkaProducer, new KafkaProducer

    • Required: producer.close

    • Rationale: Unclosed producers leak connections and memory. (Note: The checker performs intra-method dataflow analysis using JavaParser's symbol resolution to handle arbitrary variable names like var p = new KafkaProducer, and successfully detects try-with-resources closures).

Implementation Details

Where the code lives: To prevent compilation errors from breaking the entire Gradle build (a common risk with buildSrc/), this tool will be implemented as an isolated composite build using includeBuild("build-plugins").

build-plugins/

├── build.gradle

└── src/

├── main/java/org/apache/kafka/gradle/integrity/

│ ├── DensityAnalyzer.java # LDR calculation

│ ├── DesignComplianceChecker.java # YAML rule enforcement

│ ├── ContributionQualityTask.java # Gradle task entry point

│ └── DiffParser.java # Reads GitHub PR.patch natively

└── test/java/org/apache/kafka/gradle/integrity/

├── DensityAnalyzerTest.java

└── DesignComplianceCheckerTest.java

Secure Human-in-the-Loop (HITL) Overrides:

  • Secure Emergency Bypass: Adding /aiv skip in any commit message skips all gates, but only if the GitHub Actions actor is verified against the official Apache Gitbox LDAP COMMITTERS list. Bots and external contributors cannot bypass the gate.

  • Auditability: All bypasses are explicitly logged in the GitHub Actions CI step summary for transparency.

Diagnostic Reports (A Mentorship Framework)

Instead of a cryptic failure once Shadow Mode ends, the tool posts a diagnostic summary in the CI output: "This PR's Logic Density is 0.18 (Threshold: 0.25). It appears to be mostly scaffolding. Try moving logic to the internal Config classes to improve density." This provides a clear, mentorship-oriented path to "ready" without bottlenecking legitimate contributions.

Compatibility, Deprecation, and Migration Plan

This proposal does not change the Kafka protocol or public APIs. It adds only internal build logic.

Migration: None. Existing code passes both checks (validated against trunk). The tasks only evaluate changed files in a PR diff, not the entire codebase.

Test Plan

To guarantee zero disruption to developer velocity, this KIP relies on strict unit testing and a phased CI rollout.

1. Unit Tests (JUnit 5 in build-plugins/src/test/):

  • DensityAnalyzerTest:

    • Case: 150 lines of Javadoc with a single return statement -> LDR = 0.05 -> FAILS.

    • Case: 20 lines of branching if/while logic -> LDR = 0.45 -> PASSES.

    • Case: PR adds a pure interface with 10 method signatures -> Declarative Profile -> PASSES.

    • Case: PR removes 100 lines of dead code -> Net LOC negative -> SKIPS check.

  • DesignComplianceCheckerTest:

    • Case: File initializes KafkaConsumer and passes it to Executors.newFixedThreadPool(5) -> FAILS (consumer-thread-safety violation).

    • Case: File initializes KafkaConsumer in a standard single-threaded poll loop -> PASSES.

    • Case: File contains only interface declarations with no trigger patterns -> PASSES (no rules applicable).

2. Shadow Mode (30-Day Data Collection):

Upon merge, the GitHub Action will run with continue-on-error: true. It will log results (Pass/Fail, LDR score, AST violations) to the GitHub Actions step summary for 30 days without blocking PRs. During this time, the PMC will publish a sensitivity analysis of the historical data.

Exit Criteria: The gate will be promoted to blocking status via lazy consensus on the dev list once the false-positive rate is below 2% over a minimum of 50 evaluated PRs.

Future Work

To keep this KIP tightly scoped to achievable build validation tasks, the following enhancements are deferred to future proposals:

  • Kotlin and Scala Support: Extending the density and design parsing to cover Kafka's non-Java modules.

  • BOM-Grounding (Supply Chain): A dependency validation phase that cross-references AI-suggested method calls against the current trunk AST to stop "hallucinated" API calls that look like logic but do not exist in the codebase.

  • Context Retrieval Guard: A prompt-engineering standard requiring AI contributors to utilize "Step-Back Prompting" in the PR description to explain the architectural intent before generating code.

Review & Comparison: Industry Solutions

This solution moves beyond the "descriptive" nature of standard AI tools and introduces a "deterministic" layer of governance.

CapabilityStandard Industry Tools (Copilot/CodeRabbit)Proposed AIV-Gate Solution
Logic FilteringLLM-based summary; flags redundancy probabilistically.LDR (Logic Density Ratio): Deterministic AST-based mathematical gate.
Architectural RulesGeneral best practices (e.g., DRY, SOLID).Design Gate: Enforces Kafka-specific constraints (e.g., ZK blocking, Consumer thread safety).
Supply ChainGeneral CVE scanning (SAST).Planned for Future KIP: Cross-referencing imports with project lockfiles locally.
Data PrivacyCloud-based; requires indexing/API keys.100% Local: No data leaves the committer's environment or GitHub Action.

FAQ

General Concept

  • Q: Why build a custom tool instead of using standard AI code reviewers like GitHub Copilot or CodeRabbit?

    A: Standard AI tools are purely descriptive and lack project-specific context (e.g., they don't natively know that wrapping a KafkaConsumer in an ExecutorService is a threading violation in our core). Furthermore, cloud-based tools require sending Kafka’s internal logic to a third-party cloud for indexing. AIV executes 100% locally via AST parsing, ensuring no data leakage, zero ongoing API costs, and providing a mathematical proxy (LDR) for substance that descriptive AI summaries lack.

  • Q: Why rely on hard controls instead of just adding instructions to an AGENTS.md file?

    A: Markdown instructions like AGENTS.md are a "soft control." Research shows that AI models use probabilistic reasoning and frequently ignore markdown instructions when their context window is saturated or overridden by a user prompt. By embedding deterministic Gradle tasks into our pipeline, we ensure that Kafka's architectural invariants are enforced by hard code, not just suggestions.

  • Q: Will a PR failure silently reject or move a submission to "Draft," discouraging new human contributors?

    A: No. We strictly want to avoid "Silent Rejection," which can be hostile to newcomers. The system acts as a Mentorship Framework. If a PR falls below the threshold, the GitHub Action posts a Diagnostic Scorecard in the CI logs providing a clear, actionable path to getting the PR ready.

Logic Density Ratio (LDR)

  • Q: How does the gate handle refactoring?

    A: AIV includes a "Refactor Exception" if a PR removes more lines than it adds (net-negative lines), the density check is automatically bypassed.

  • Q: What happens with trivial PRs, like fixing a single typo in a Javadoc? Won't the LDR score be 0?

    A: The Gradle task includes a minimum line-change threshold. Trivial PRs (e.g., modifying fewer than 10 lines of code) bypass the LDR check entirely to ensure minor documentation fixes or typo corrections are never blocked.

  • Q: Does LDR penalize good documentation?

    A: No. The AST analyzer ignores Javadoc and comments entirely, focusing only on executable nodes.

  • Q: What about Kotlin and Scala code?

    A: This initial KIP applies exclusively to Java files. Kotlin, Scala, Shell scripts, and configuration files are completely bypassed until dedicated parsers are proposed in a future KIP.

  • Q: What if a PR modifies both Java and non-Java files?

    A: The AIV Gate evaluates only the Java files in the diff. Non-Java files are ignored entirely in this initial KIP.

Design Gate (Architecture)

  • Q: Checkstyle already handles our code quality. Why do we need a new AST Design Gate?

    A: Checkstyle is fantastic for formatting and simple regex (like globally blocking System.exit()), but it struggles with complex, conditional architectural logic. Checkstyle cannot easily enforce "If KafkaConsumer is used, ensure it is not wrapped in java.util.concurrent." This custom task handles the architectural patterns that Checkstyle structurally cannot.

  • Q: Who maintains the .validation/design-rules.yaml file? Won't it become a dumping ground for arbitrary rules?

    A: The Kafka PMC owns the configuration file. Adding or removing a design rule requires a standard pull request and lazy consensus on the mailing list (72 hours), and the schema dictates providing a rationale to prevent rule creep. A single PMC veto blocks the rule.

CI & Build Integration

  • Q: Will parsing the AST significantly slow down local builds (./gradlew check) or CI?

    A: No. By isolating the tool via includeBuild and utilizing the native GitHub Actions .patch payload (via DiffParser), there is no shelling out to the git binary. The JavaParser cold-start and AST traversal takes ~2-5 seconds total per PR, not per file.

  • Q: How do I test my PR locally?

    A: Run ./gradlew checkContributionQuality to see your LDR score and design violations before you push.

Security & Edge Cases

  • Q: Is my code being sent to a third party?

    A: No. 100% local execution ensures Kafka's IP never leaves our controlled environment.

  • Q: Won't failing a CI check generate massive email spam to the dev@kafka mailing list?

    A: No. The validation task will simply fail the CI check exactly like Checkstyle does today. It will output a clear error message in the console pointing to CONTRIBUTING.md. We will not use automated bot comments to avoid webhook noise.

  • Q: What if the script has a bug and blocks legitimate human contributors?

    A: The system prioritizes developer velocity. Any Kafka Committer can bypass the gate immediately by adding /aiv skip to their commit message. The CI runner verifies the actor against the committers list to prevent abuse.

  • Q: Does a "Green" AIV status mean the PR is safe from sophisticated exploits like ShadowRay?

    A: No. AIV is a first line of defense against structural slop and known design anti-patterns. It does not replace human reviewers for semantic security. Complex, logic-dense code generated by AI could still contain subtle unauthenticated execution paths. A passing AIV score simply ensures the code has enough substance to warrant human architectural review.

  • Q: How do we prevent AI models from "faking" logic density by hallucinating redundant method calls?

    A: AI-generated code occasionally attempts to boost its density score by inventing redundant or hallucinated method calls. To combat this, future enhancements (see the Future Work section) outline "BOM-Grounding." This process will cross-reference and verify that the suggested Kafka API methods actually exist in the current trunk before counting them as valid logic nodes.

Rejected Alternatives

  • Using LLM-based code review bots (e.g., CodeRabbit): Rejected due to the severe risks of vendor lock-in, data sovereignty violations, and API costs associated with transmitting proprietary Apache codebases to third-party providers. Generic LLMs also lack the domain-specific context to enforce deep Apache invariants.

  • Relying solely on human reviewers: Continuing to manually review and close low-substance PRs does not scale and exacerbates maintainer burnout.

  • Restricting PR access to Collaborators Only: While GitHub recently introduced this feature to stop bot spam, using it heavily restricts legitimate, first-time open-source contributors from participating in Kafka.

  • Using Third-Party GitHub Actions: Using pre-compiled external binaries for PR validation introduces supply-chain security risks and prevents developers from running the exact same checks locally on their laptops. Building it natively into our Gradle scripts solves both issues.

  • Adding only AGENTS.md: A soft control that AI models probabilistically ignore. This KIP provides deterministic hard control.

  • No labels