This page is auto-generated! Please do NOT edit it, all changes will be lost on next update
Contents
Airavata
Migrate Apache Airavata Deployment from Ansible to OpenTofu
Objective
Replace existing Ansible deployment scripts with OpenTofu configurations to improve deployment efficiency and maintainability for bare-metal environments.
Requirements
- Assessment of Current Ansible Scripts
- Review Existing Playbooks: Analyze the current Ansible playbooks located in the Airavata GitHub repository to understand the deployment processes and dependencies.
- Identify Core Components: Determine the essential services and configurations managed by Ansible, such as Kafka, RabbitMQ, Zookeeper, MariaDB, etc.
- Development of OpenTofu Configurations
- Define Infrastructure as Code (IaC): Utilize OpenTofu's declarative language to codify the infrastructure components identified in the assessment phase.
- Module Creation: Develop reusable modules for each service (e.g., Kafka, RabbitMQ, Zookeeper) to promote consistency and ease of management.
- Testing and Validation
- Simulate Deployments: Use OpenTofu's planning capabilities to simulate deployments, ensuring configurations align with the desired infrastructure state.
- Iterative Refinement: Address any discrepancies or issues identified during testing to refine the OpenTofu configurations.
- Documentation
- Update Deployment Guides: Revise existing documentation to reflect the new OpenTofu-based deployment process, providing clear instructions for users.
Streamline Grouping and Filtering in the Experiment Browser UI
Embed a researcher-focused dashboard to group and preview experiments in the Django portal. The goal is to improve how past experiment runs can be tracked, grouped, and arranged for faster lookup and insights.
Proposed Improvements:
- Group submitted experiments by project, application, allocation, etc.
- Clean, customizable dashboard elements (e.g., charts) to preview past experiments.
Develop an Integrated Feature Test Environment for Apache Airavata
Objective
Enhance the current development workflow by incorporating a simulated High-Performance Computing (HPC) environment into Apache Airavata's existing Integrated Development Environment (IDE) integration. This will enable developers to test and validate features locally without relying on physical HPC resources.
Requirements
- Simulated HPC Environment Integration
- Dockerized Slurm Simulation: Develop a Docker container that emulates an HPC environment using Slurm, facilitating the testing of job scheduling and execution.
- Seamless IDE Integration: Ensure that this simulated environment integrates smoothly with the existing IDE setup, allowing developers to initiate and monitor jobs as they would in a real HPC setting.
- Development of Comprehensive Test Scenarios
- Job Submission Tests: Create scripts to test various job submission scenarios, including successful executions, intentional failures, and long-running processes.
- Feature Validation: Ensure that all features exposed by Apache Airavata can be tested within this simulated environment.
- User-Friendly Setup
- Simplified Configuration: Design the setup process to require minimal configuration, enabling developers to initiate the environment and execute tests with just a few commands
A Central Admin Dashboard to Inspect Health + Logs of Airavata Services
Develop a devops dashboard to monitor Apache Airavata services, enabling real-time tracking of service health, uptime, and logs independent of the science gateway(s).
This centralized tool will help administrators efficiently monitor service performance and troubleshoot issues. The dashboard will feature a user-friendly monitoring UI that displays real-time status updates and logs for each service.
Proposed Solution:
- A logging subproces alongside each service, pushing logs to an external service.
- A devops dashboard that aggregates the logs and provides a unified view into the system.
- API calls from devops dashboard to each service, for proactive health-checking.
- Ability to monitor multiple gateways from the same dashboard.
Update Airavata Django Portal to a Supported Python Version
The Airavata Django Portal currently runs on Python 3.6, which reached its end-of-life (EOL) in 2022. Continuing to use an unsupported Python version poses security risks and limits access to new features and package updates. Upgrading to a supported version (Python 3.12 or later) will ensure long-term maintainability, security, and compatibility with modern dependencies.
Impact:
• Improved security and stability
• Access to the latest language features and performance improvements
• Compatibility with actively maintained third-party packages
Proposed Solution: Update the codebase and dependencies for compatibility with Python 3.12+, and ensure everything works as expected post-upgrade.
Modernize and unify the user interface and user experience across all Apache Airavata products
This epic focuses on modernizing the Apache Airavata Catalog Interface deployed in Cybershuttle by improving its user experience and interface design. Additionally, it aims to unify the look, feel, and branding across all Airavata products and web properties (e.g., Django Portal, Custos UI), culminating in a refreshed visual identity. This effort will enhance usability, improve accessibility, increase user trust, and establish a consistent identity for the ecosystem. A design sprint and logo refresh campaign are also proposed in this epic.
Goals: * Redesign and implement a modern, user-friendly Airavata Catalog interface for Cybershuttle.
- Conduct UX research and usability testing to identify user pain points.
- Create a cohesive UI component library and design system for reuse across Airavata UIs.
- Unify graphics, fonts, color palettes, and layout across all Apache Airavata and Custos interfaces.
- Refresh the Airavata logo and branding for web, documentation, and marketing collateral.
- Provide branding and UI guidelines for developers and contributors.
Facilitating computational experiment generation in AIRAVATA
Computational sciences involve extensive experimentation which often involves searching over space of parameters, variables, functions and workflows. Individual researchers and groups often perform a large number of such searches to identify critical functional forms and workflows for any particular study. The goal of this work is to provide a tool that facilitates this search process. This will enable visualization, identifying or learning templates and generate potential experiments based on past experiments using LLM and neurosymbolic methods.
This task requires the following specific goals for this work :
- Provide visualization of past computational experiments: Tracking various computational experiments with various variations is often a challenging problem for individuals and groups of researchers. Often various adhoc approaches (directories, git etc) are used to track these changes, but often it is very difficult to provide an entire overview of past experiments. The goal of this work is to develop a visualization approach that allows to examine all the past experiments. This will require dimensionality reduction on the embeddings from LLMs which have been tested on its code generation abilities (eg. codellama, Llama 4 Maverick) for generating visualizations. Further comparison in the performance with standard code cloning and similarity measures will be required.
- Identify template based on past experiment database: It is common for several computational experiments to share a common structure, in such cases identifying the 'template' allows for identifying common approaches in past experiments and to generate new ones. This work will need software engineer approach and AI based approaches to identify such templates. The templates will also be integrated with the visualization (in addition to embedding based visualization) allowing for examining the collections of experiments that belong to each template.
- Generate new suggested experiments using templates and visualization guided search. Generation of new experiments is a key component of computational science work. To facilate this process, will require a visual interactive way to generate experiments based within the regions of previous experiments and also in the space where it was not previously explored. In addition, this will require generation of new experiments based on templates that were identified from the previous step. Template based generation could also provide a verifiable way to generate experiments.
Come up with UX designs for Airavata data and model catalog
Currently Airavata mainly support computing resource access orchestration and internally it handles data layer as the enabling mechanism to launch remote jobs. In the new Airavata release, we want the data layer and computational modals to be available as first class citizens of the platform. As the first step, we need to go through multiple iterations of UX design review to finalize how the product should look like to the end user
Containerized Deployment of Airavata Services
Currently, all Airavata services are packaged and deployed as Java bundles. The goal is to containerize each service by wrapping it within a Dockerfile, allowing seamless deployment on container-enabled resources while also enabling local execution for development purposes.
This enhancement has potential to improve deployment consistency, simplify dependency management, and provide greater flexibility in running Airavata services across different environments, for both testing and production use cases.
Apache Dubbo
GSoC 2025 - Add more traffic management rule support for Dubbo Proxyless Mesh
Background and Goal
The concept of[ Proxyless Mesh|https://istio.io/v1.15/blog/2021/proxyless-grpc/] was first introduced in this blog. Please read it to learn more concept details.
We have started the development of Dubbo Proxyss Mesh for a while, so that means you don't have to start the project from scratch, anyone who gets involved can start with a specific task at hand.
In this specific GSoC project, we need developers to mainly focus on implementing more traffic management features of Istio for Dubbo.
Relevant Skills
- Familiar with Java
- Familiar with Service Mesh, istio and Microservice architectures
- Familiar with Kubernetes
Potential Mentors
- Jun Liu, Apache Dubbo PMC Chair, junliu@apache.org
GSoC 2025 - Dubbo Admin traffic management feature
Background and Goal
Dubbo is an easy-to-use, high-performance microservice framework that provides both RPC and rich enterprise-level traffic management features.
The community has been working on the improvement of Dubbo's traffic management abilities, to make it support rich features like traffic spliting, canary release, a/b testing, circuit breaker, mocking, etc. The complete traffic management architecture in Dubbo consists of two major parts, Control Plane and Data Plane. In Dubbo, Control Plane refers to Dubbo Admin, with source code in apache/dubbo-kubernetes. Dubbo Data Plane is implemented by Dubbo sdk (Java, Go, etc)
The traffic management rules Dubbo ueses now is compatible with the rules in Istio. That means the rules generated by Dubbo Admin and sent to SDK is Istio compatible rules. In this project, we need developers to work mainly on Dubbo Admin to make sure it generates and sends those rules correctly.
Relevant Skills
- Familiar with Golang
- Familiar with Service Mesh, istio and Microservice architectures
- Familiar with Kubernetes
Potential Mentors
- Jun Liu, Apache Dubbo PMC Chair, junliu@apache.org
- dev@dubbo.apache.org
GSoC 2025 - Dubbo triple protocol for go language implementation
Background and Goal
Dubbo is an easy-to-use, high-performance microservice framework that provides both RPC and rich enterprise-level traffic management features.
- keep-alive
- connection management
- programming api
- error code
Relevant Skills
- Familiar with Golang
- Familiar with RPC
- Familiar HTTP/1/2/3 protocol
Potential Mentors
- Jun Liu, Apache Dubbo PMC Chair, junliu@apache.org
- dev@dubbo.apache.org
GSoC 2025 - Dubbo Gradle IDL Plugin
Background and Goal
In the API-First design paradigm, IDL (Interface Definition Language) and its corresponding generation tools have become essential. IDL files are the specifications for defining service interfaces, and generation tools can convert IDL files into executable code, thereby simplifying the development process and improving efficiency.
Currently, Apache Dubbo only provides a Maven IDL generation plugin, lacking a Gradle plugin. This brings inconvenience to developers using Gradle to build projects.
Necessity
- Unify Build Tools: Gradle is the preferred build tool for Android projects and many Java projects. Providing a Dubbo Gradle IDL plugin can maintain the consistency of build tools and reduce the cost for developers to switch between different build tools.
- Simplify Configuration: Gradle plugins can simplify the configuration and generation process of IDL files. Developers only need to add plugin dependencies and simple configurations in the `build.gradle` file to complete the generation of IDL files without manually executing complex commands.
- Integrate Development Process: Gradle plugins can be better integrated with IDEs (Integrated Development Environments). Developers can directly execute Gradle tasks in the IDE, thereby realizing the automatic generation of IDL files and improving development efficiency.
Implementation Plan
- Plugin Development: Develop a Gradle plugin that encapsulates the Dubbo IDL generation tool and provides a concise configuration interface.
- Configuration: In the `build.gradle` file, developers can configure parameters such as the path of the IDL file and the directory of the generated code.
- Task: The plugin provides a Gradle task for executing the generation of IDL files. Developers can execute the task through the command line or the IDE.
- Dependency Management: The plugin can automatically manage the dependencies of the Dubbo IDL generation tool, ensuring that developers do not need to manually download and configure it.
Expected Results
- Developers can use Gradle to build Dubbo projects and easily generate the code corresponding to the IDL.
- Simplify the configuration and generation process of IDL files, and improve development efficiency.
- Better integration with IDEs to achieve automatic generation of IDL files.
Potential Mentors
- Albumen Kevin, Apache Dubbo PMC, albumenj@apache.org
- dev@dubbo.apache.org
GSoC 2025 - Enhancing Dubbo Python Serialization
Background and Goal
Currently, Dubbo Python exposes a serialization function interface that requires users to implement their own serialization methods. For commonly used serialization formats such as JSON and Protobuf, users must manually configure them each time. To streamline this process, we aim to build a built-in serialization layer that provides support for these common serialization formats by default.
Goal
We recommend using Pydantic to achieve this. Therefore, we expect the implementation to:
1. an internal serialization layer based on Pydantic, with support for at least JSON and Protobuf.
2. Leverage Pydantic's additional features, including data validation and other useful functionalities.
Relevant Skills
1. Familiar with Python
2. Familiar with RPC
Potential Mentors
- Albumen Kevin, Apache Dubbo PMC, albumenj@apache.org
- dev@dubbo.apache.org
GSoC 2025 - Service Discovery
Background and Goal
Service Discovery
- Well organized logs
- Actuator endpoints
- Tools
Relevant Skills
- Familiar with Java
- Familiar with Microservice architecture
Potential Mentors
- Jun Liu, Apache Dubbo PMC Chair, junliu@apache.org
- dev@dubbo.apache.org
Apache NuttX
Firmware Upgrades over Silicon-Heaven Protocol for NXboot Demonstrated on pysimCoder
NuttX has gained support for NXboot firmware loader and updater recently. It allows fail-safe upgrades and recovery from upgrade failures caused by resets during upgrades, temporal MTD devices failures or upgrade to non functional firmware version.
The silicon-heaven is open source communication protocol which allows to build distributed applications mounted into tree like hierarchical system of brokers. The update service implemented over this protocol has advantage that even nodes behind NATs firewalls can be reached, commanded and updated.
pysmCoder is block diagram editor and real time code generator implemented in Python which can be used for rapid control applications prototyping for NuttX (see related documentation). The silicon-heaven protocol allows to connect and inspect running application internal states at tune blocks parameters at runtime.
The goal of this project is to collect and finalize silicon-heaven libraries allowing to build updater application which connect into complete silicon-heaven tree yet keep it small enough to fit into memory constrained targets. The implemented libraries should be integrated into mainline NuttX applications configuration system and simple application should be available as a new example in the NuttX system. The standalone (probably Python/pyshv) host system application with graphic and plain command line mode should be implemented. Then whole solution integration into pysimCoder should be demonstrated. The firmware update file node and control properties should be compiled into control model SHV properties tree when runtime upgrade is enabled in the model editor.
NuttX support for IEEE 802.3cg 10BASE−T1S Open Alliance SPI MACPHY (i.e. ONsemi NCV7410)
The 10 MBit/s multidrop Ethernet is promising standard for automotive peripherals and other industrial systems communication where minimal wires count and no need for a central communication element (hub/switch) is a priority over communication speed. The only single pair of wires is enough to connect multiple devices (the basic standard guarantees up to 8 nodes with link length at least up to 25 meters.) implementing communication according IEEE 802.3cg 10BASE−T1S to standard.
The goal of this Apache NuttX GSoC project is to implement driver for the T1S MACPHY combined interfaces. The core of the driver should be implemented as independent of specific platforms and targets architectures with minimized amount of bring-up code need for mapping onto specific board. The general usability with whole range of the MACPHY interfaces from multiple vendors compliant to Open Alliance MACPHY SPI protocol should be kept in the mind. In the frame of the long GSoC project the support for beacon synchronized mode should be implemented as well as its configuration and test application.
The ESP32C6 is proposed as initial widely available and cheap platform for driver development. Then the portability should be demonstrated on ATSAMV71 or other ARM based industrial environment focused targets.
The possible mentors:Karel Koci, Pavel Pisa and Michal Lenc
Lucene.NET
Apache Lucene.NET Replicator and Dependency Injection Enhancements
Background and Goal
Apache Lucene.NET is a .NET port of the Apache Lucene search engine (originally written in Java). This powerful library enables indexing and searching of documents with custom queries, making it a core component in many production environments. With over 100 million NuGet downloads, Lucene.NET is utilized in diverse scenarios, from local search functionality in mobile apps to supporting large-scale cloud infrastructures.
Lucene.NET already provides a foundation for replicating a search index from a primary node to one or more replica nodes, enabling High Availability (HA) and scalability through load balancing. Currently, our Lucene.Net.Replicator.AspNetCore project offers minimal replication support for ASP.NET Core servers, but it remains unpublished on NuGet and lacks the robustness required for most use cases. Your focus for this project will be to enhance and finalize the ASP.NET Core library, ensuring a seamless user experience by adhering to best practices and making replication setup as straightforward as possible – ideally requiring just one line of code.
Additionally, users may need replication support for applications outside ASP.NET Core, such as cloud-based distributed architectures, Windows services, or command-line tools running on Linux. To address this, we propose creating modular intermediate libraries using Microsoft.Extensions.DependencyInjection.Abstractions, enabling flexible and reusable replication configurations. This approach should also ensure that essential components like IndexWriter and IndexReader are configured in a straightforward and user-friendly manner.
Your task will also include creating one or more sample projects that demonstrate how to effectively use the enhanced replication functionality. These projects should serve as practical, real-world examples for the community, showcasing best practices and ease of use. Additionally, you will be responsible for thoroughly testing the code changes to ensure they work as intended in real-world scenarios. This includes writing comprehensive unit tests to guarantee the reliability and quality of the solution.
We plan for this to be a hands-on mentorship, and we will set up any infrastructure for you. As a contributor, your responsibilities will include analyzing the problem, developing a detailed plan, refining it with input from the project team, and collaborating regularly to implement the solution through pull requests and code reviews.
Relevant Skills
- Familiarity with C# and unit testing
- Strong grasp of design patterns and practices, such as dependency injection and i.e. the fluent builder and abstract factory patterns
- Basic understanding of HTTP(S) and networking
- Not required, but good to have:
- Familiarity with ASP.NET Core 5 or later
- Understanding of distributed architectures
- Familiarity with Lucene(.NET) search indexes
CloudStack
Apache CloudStack DRS improvements
As a Operator I would like to have the loads on my systems more evenly/centrally distributed. At the moment there is a simple DRS for clusterwide distribution of loads, this is however not applying zone wide distribution or based on automated queries/improvements.
In addition we should add historic data for the VM in planning possible migrations.
At the moment allocated metrics are used. An first improvement would be to use actual metrics.
ref: cloudstack issue: https://github.com/apache/cloudstack/issues/10397
verification of LDAP connection
When a new ldap connection is added there is no diagnostics to verify the validity/usability of the connection, making trouble shooting troublesome. This issue aims to facilitate ldap configuration.
ref. cloudstack issue: https://github.com/apache/cloudstack/issues/6934
SSL - LetsEncrypt the Console Proxy
New Global Option For Letsencrypt enable on console proxy. Letsencrypt domain name option for letsencrypt ssl auto renew
ref. cloudstack issue: https://github.com/apache/cloudstack/issues/3141
Autodetect IPs used inside the VM on L2 networks
With regards to IP info reporting, Cloudstack relies entirely on it's DHCP data bases and so on. When this is not available (L2 networks etc) no IP information is shown for a given VM.
I propose we introduce a mechanism for "IP autodetection" and try to discover the IPs used inside the machines by means of querying the hypervisors. For example with KVM/libvirt we can simply do something like this:
{{root@fedora35 ~]# virsh domifaddr win2k22 --source agent
Name MAC address Protocol Address
-------------------------------------------------------------------------------
Ethernet 52:54:00:7b:23:6a ipv4 192.168.0.68/24
Loopback Pseudo-Interface 1 ipv6 ::1/128
- - ipv4 127.0.0.1/8}}
The above command queries the qemu-guest-agent inside the Windows VM. The VM needs to have the qemu-guest-agent installed and running as well as the virtio serial drivers (easily done in this case with virtio-win-guest-tools.exe ) as well as a guest-agent socket channel defined in libvirt.
ref. cloudstack issue: https://github.com/apache/cloudstack/issues/7142
[GSoC] [CloudStack] Improve CloudMonkey user experience by enhancing autocompletion
Summary
Currently a lot of API parameters do not get auto-completed as cloudmonkey isn't able to deduce the probable values for those parameters based on the list APIs heuristics. A lot of these parameters are enums on CloudStack end and by finding a way to expose these and consume them on cloudmonkey side, we could improve the usability of the CLI greatly.
Benefits to CloudStack
- Improved end user experience when using CLI
- Reduce incorrect inputs
Deliverables
- Expose enums and all other relevant information that can be used to enhance auto-completion of parameters on CloudStack end -
- May require framework level changes and changes to APIs
- Consume these exposed details on Cloudmonkey end
Dependent projects
https://github.com/apache/cloudstack-cloudmonkey/
Ref CloudStack Issue: https://github.com/apache/cloudstack/issues/10442
add securitata integration to cloudstack
Currently, Cloudstack only has ACLs (in Advanced Networks) that as a layer of securing access to the networks (VPCs). However, these only operate in the Layer 3 and 4 of OSI Layer.
In todays day and age, where Cybersecurity threats become more advanced, complex and operate in Layer 7 OSI layer, there needs to be a way for Cloudstack to allow its own tenants to implement its own form of mature cybersecurity solution.
The problem all this while is that if a user is using a VPC or L2 Networks, 3rd party firewalls such as PFsense, FortinetVM Firwall etc cant be implemented effectively due to a lack of being able to set static routes that stays with the VR after it is recreated.
There needs to be a better option for users of cloudstack to implement a deeper form of cybersecurity to protect their workloads.
ref. github issue: https://github.com/apache/cloudstack/issues/10445
eBPF-based Network Observability for CloudStack
CloudStack’s network monitoring is mostly based on logs and external agents, making real-time traffic analysis difficult. This project will integrate eBPF-based network observability to capture per-VM traffic metrics, detect anomalies, and improve tenant isolation.
Benefits to CloudStack
- Enhanced security: Detect suspicious activity at the kernel level.
- Real-time traffic monitoring: Gain deep insights into VM networking.
- Better tenant isolation: Identify cross-tenant traffic issues.
Deliverables
- Develop eBPF probes to capture:
- Per-VM network traffic metrics (packets, bytes, latency)
- Connection tracking for detecting unauthorized access patterns
- Packet drops and retransmission rates
- Expose network metrics via CloudStack’s API.
- Provide visualization through Prometheus/Grafana.
- Document setup, usage, and performance benchmarks.
Expected Outcome
An eBPF-based solution that improves network observability in CloudStack, providing security and performance insights with minimal resource usage.
Enhancing CloudStack Monitoring with eBPF
Apache CloudStack currently relies on traditional monitoring tools, which may lack deep visibility into kernel-level events and networking performance. This project aims to integrate eBPF-based monitoring into CloudStack to provide lightweight, real-time performance analysis and security auditing.
Benefits to CloudStack
- Improved observability: Gain fine-grained insights into VM performance metrics.
- Lower overhead: eBPF runs in the kernel and avoids the performance penalties of user-space monitoring tools.
- Enhanced security auditing: Detect and log anomalies in system behavior.
Deliverables
- Implement eBPF programs to track:
- VM CPU usage
- Memory consumption
- Disk I/O metrics
- Network traffic analysis
- Develop a CloudStack-compatible API or CLI for retrieving eBPF-generated insights.
- Provide visualization support using Prometheus/Grafana.
- Write documentation for setup and usage.
Expected Outcome
A robust eBPF-based monitoring solution integrated into CloudStack, offering real-time performance insights with minimal overhead.
ref. cloudstack issue: https://github.com/apache/cloudstack/issues/10415
This project is marked as part-time, but the scope can be extended to full-time. This depends largely on whether the full amount of metrics to track is implemented or only one, as a proof of concept.
StreamPipes
Extend visualization capabilities of Apache StreamPipes
Background
Apache StreamPipes is a self-service Industrial IoT toolbox which helps users to connect, analyze and exploit industrial data streams. StreamPipes offers a variety of tools which help users to interact with data from industrial sources such as PLCs. An adapter library allows to get real-time data from industrial controllers or other systems, a pipeline editor allows to build stream processing pipelines using either graphical or code-based flow modeling, and a data explorer allows to quickly create visualizations based on connected adapters.
Current Challenges
The StreamPipes data explorer consists of a chart view, where users can create charts based on live data, and a dashboard view, where users can create live dashboards based on charts.
The data explorer provides a set of charts, which are mainly based on Apache ECharts. The currently available chart library includes time-series line/bar charts, heatmaps, scatter plots, density charts and others. To improve the user experience and add additional capabilities, we plan to extend this chart library with additional charts that are useful for industrial data analytics.
Objectives
The primary objectives of this project are as follows:
- Explore the Apache ECharts library and identify useful additional charts for industrial data analytics
- Improve the StreamPipes data explorer by adding new chart types using Apache ECharts
- Add a more advanced table visualization
- Extend existing charts with additional configurations (e.g., axis configurations, labels, data transformations)
- Add a data preview for all charts, which is shown below the actual chart in the chart view
- Design and implement end-to-end-tests using Cypress
Recommended Skills
- Proficiency in TypeScript programming + testing
- Proficiency in Angular
- Excellent logical thinking and problem-solving skills.
- Good sense for beautifully looking user interfaces
Mentor
Dominik Riemer, Apache StreamPipes PMC, riemer@apache.org
Difficulty: Major
Project Size: ~350 hours (large)
Kvrocks
[GSOC][Kvrocks] Improve the controller UI
Background
Apache Kvrocks is a distributed key-value NoSQL database that uses RocksDB as its storage engine and is compatible with Redis protocol.
In the past, basic Web UI capabilities have been provided for Apache Kvrocks Controller, including features such as cluster creation and migration. In the future, we aim to offer a better and more modern UI experience, also enhancing centralized visualization capabilities.
Objectives
The key objectives of the project include the following:
- Refactor the existing UI pages
- Enhance the visualization capabilities for cluster migration
- Provide a cluster Overview dashboard
Recommend Skills
- Familiar with next.js & tailwind
- Have a basic understanding of RESTFul
- Have an experience of Apache Kvrocks
Mentor: Hulk Lin, Apache Apache Kvrocks PMC, hulk@apache.org
Mailing List: dev@kvrocks.apache.org
Please leave comments if you want to be a mentor
[GSOC][Kvrocks] Support database backup to cloud storage
Backgroud:
Kvrocks is a key-value database that provides a Redis-compatible API on top of RocksDB. Currently, Kvrocks lacks a built-in mechanism for database backup to cloud storage, which is crucial for data durability, disaster recovery, and scalability in cloud environments.
This project aims to implement a robust backup system that allows users to store Kvrocks backups directly in cloud storage services such as Amazon S3, Google Cloud Storage, and/or Azure Blob Storage. The solution will integrate with the existing Kvrocks backup and restore mechanisms while ensuring efficient and secure data transfer.
Deliverables:
- Cloud Storage Integration: Implement backup storage support for Amazon S3, Google Cloud Storage, and Azure Blob Storage using SDKs, REST APIs or libraries (e.g. Apache OpenDAL).
- Backup & Restore Commands: Extend Kvrocks’ backup functionality to allow exporting and importing database snapshots from cloud storage.
- Configuration & Authentication: Provide user-configurable options to specify storage credentials and backup parameters.
- Incremental Backup Support (Stretch Goal): Optimize storage usage by implementing differential or incremental backup capabilities.
- Documentation & Tests: Comprehensive documentation and test coverage to ensure reliability and ease of use.
Recommended Skills:
- Good at coding in C++;
- Knowledge about database internals and cloud storage;
- Knowledge about Kvrocks or Redis.
Mentor: Mingyang Liu, Apache Kvrocks PMC member, twice@apache.org
Mailing List: dev@kvrocks.apache.org
Beam
Enhancing Apache Beam JupyterLab Sidepanel for JupyterLab 4.x and Improved UI/UX
The Apache Beam JupyterLab Sidepanel provides a valuable tool for interactive development and visualization of Apache Beam pipelines within the JupyterLab environment. This project aims to significantly enhance the sidepanel by achieving full compatibility with the latest JupyterLab 4.x release and implementing substantial UI/UX improvements. This will ensure seamless integration with modern JupyterLab workflows and provide a more intuitive and user-friendly experience for Apache Beam developers.
Beam ML Vector DB/Feature Store integrations
Apache Beam's Python SDK provides a powerful way to define data processing pipelines. In particular, many users want to use Beam for machine learning use cases like feature generation, embedding generation, and retrieval augmented generation (RAG). Today, however, Beam integrates with a relatively limited set of feature stores and vector DBs for these use cases. This project aims to build out a rich ecosystem of connectors to systems like Pinecone and Tecton to enable these ML use cases.
Simplify management of Beam infrastructure, access control and permissions via Platform features
This project consists in a series of tasks that build a sort of 'infra platform' for Beam. Some tasks include:
- Automated cleaning of infrastructure: [Task]: Build a cleaner for assets in the GCP test environment #33644
- Implement Infra-as-code for Beam infrastructure
- Implement access permissions using IaC: [Task]: Build a cleaner for assets in the GCP test environment #33644
- Implement drift detection for IaC resources for Beam
- Implement 'best-practice' key management for Beam (i.e. force key rotation for service account keys, and store in secret manager secrets)
A quality proposal will include a series of features beyond the ones listed above. Some ideas:
- Detection of policy breakages, and nagging to fix
- Security detections based on cloud logging
- others?
Beam YAML ML, Iceberg, and Kafka User Accessibility
Apache Beam's YAML DSL provides a powerful and declarative way to define data processing pipelines. However, its adoption for complex use cases like Machine Learning (ML) and Managed IO (specifically Apache Iceberg and Kafka) is hindered by a lack of comprehensive documentation and practical examples. This project aims to significantly improve the Beam YAML documentation and create illustrative examples focused on ML workflows and Iceberg/Kafka integration, making these advanced features more accessible to users.
SkyWalking
SkyWalking BanyanDB Extend remote.FS with Object Storage Support for AWS, Google Cloud, and Azure
Overview:
The current implementation of the remote.FS interface only supports a local file system (via the implementation in local.go). This GSOC2025 project proposes to extend remote.FS with popular object storage services—namely AWS S3, Google Cloud Storage, and Azure Blob Storage. This enhancement will allow the project to support robust cloud-based backup and restore operations in addition to local storage.
Proposed Features:
- AWS S3 Implementation:
- Implement methods for Upload, Download, List, and Delete operations using the AWS S3 API.
- Google Cloud Storage Implementation:
- Provide a module that integrates with Google Cloud Storage to perform similar operations.
- Azure Blob Storage Implementation:
- Develop functionality to access and manage Azure Blob Storage via the remote.FS interface.
Implementation Details:
- Interface Compliance:
Each object storage implementation must adhere to the remote.FS interface defined in remote.go.
- Error Handling & Resilience:
Implement robust error handling, logging, and retry mechanisms to ensure reliable operations across different cloud services.
- Testing:
Develop comprehensive unit and integration tests to cover edge cases and guarantee compatibility and stability.
- Documentation:
Update the project documentation to detail configuration, deployment, and usage of each cloud storage option.
Seata
GSoC 2025 - Apache Seata(Incubating) Extend multi-raft cluster mode
Description
Synopsis
The current Apache Seata Server supports the Raft cluster mode, but the performance and throughput of the cluster are significantly limited due to the single leader in a single Raft group. Therefore, the goal is to extend Seata Server to support multi-raft capability.
Benefits to Community
Due to the characteristics of Raft, requests are processed on the leader node and the results are submitted to the followers through the Raft consensus protocol. As a result, a significant amount of computational load is placed on the leader node, while followers only need to receive the final computed result. This causes the CPU, memory, and other metrics of the leader to be much higher than those of the followers. Additionally, the throughput of a single leader is limited by the machine configuration of the highest-spec node in the cluster, making it difficult to balance the traffic effectively. Therefore, supporting multi-raft would make the load distribution more balanced across all nodes in the cluster, improving throughput and performance, while also reducing the waste of machine resources.
Deliverables
The expected delivery goal is to apply the multi-raft capability of the sofa-jraft component to Seata Server through detailed learning and practice
The step expected are the following:
- Learning and using the sofa-jraft component
- Understanding and practicing the transaction grouping capability in Seata
- Gaining a certain level of understanding of Seata's communication protocol
- Gaining a certain level of understanding of Seata's storage model, especially the Raft mode
- Ensuring compatibility between different versions
Useful links
Mentor
- Mentor: Jianbin Chen, Apache Seata(Incubating) PPMC Member jianbin@apache.org
- Mentor: Jianbin Chen, Apache Seata(Incubating) PPMC Member jianbin@apache.org
GSoC 2025 - Apache Seata(Incubating)Unlocking the Power of Metadata in Apache Seata From Load Balancing to Advanced Routing
Synopsis
Currently, Apache Seata relies on a registry (e.g., Nacos, Zookeeper, Eureka, Etcd3, Consul, Seata Naming Server) for service discovery and load balancing. However, the existing registry mechanism lacks support for custom metadata, which limits the flexibility of client-side load balancing strategies. For example, clients cannot dynamically adjust traffic distribution based on server-side metadata such as weight, region, or version. This project aims to enhance the registry module in Apache Seata by adding metadata support and enabling clients to implement advanced load balancing strategies based on this metadata.
Benefits to Community
Improved Load Balancing Flexibility: By allowing Seata Server instances to register custom metadata (e.g., weight, region, version), clients can implement more sophisticated load balancing strategies, such as weighted round-robin, zone-aware routing, or version-based routing. This ensures better resource utilization and improved system performance.
Enhanced Scalability: With metadata-driven load balancing, Seata can better handle large-scale deployments by distributing traffic more intelligently across server instances. For example, high-traffic regions can be assigned more resources, while low-traffic regions can operate with minimal overhead.
Better Resource Utilization: Metadata such as server weight or capacity can help clients avoid overloading specific instances, leading to more balanced resource usage across the cluster.
Extensibility: The addition of metadata support opens the door for future enhancements, such as dynamic traffic shaping, A/B testing, or canary deployments.
Deliverables
The expected deliverables for this project include:
Registry Metadata Support:
- Extend the registry module (e.g., Nacos, Zookeeper, Eureka, Etcd3, Consul, Seata Naming Server) to allow Seata Server instances to register custom metadata (e.g., weight, region, version).
- Ensure backward compatibility with existing registry implementations.
Client-Side Load Balancing Enhancements:
- Implement a metadata-aware load balancing mechanism in the Seata client (TM/RM).
- Provide built-in load balancing strategies (e.g., weighted random, zone-aware) and allow users to plug in custom strategies via SPI.
Documentation and Testing:
- Update the Seata documentation to explain how to configure and use metadata for load balancing.
- Write unit tests and integration tests to validate the new functionality
Steps Expected
1.Understand Seata's Registry Mechanism:
- Study how Seata integrates with various registries (e.g., Nacos, Zookeeper, Eureka, Etcd3, Consul, Seata Naming Server).
- Identify the current limitations in metadata support and load balancing.
2.Extend Registry Module:
- Modify the registry module to allow Seata Server instances to register custom metadata.
- Ensure the metadata is propagated to clients during service discovery.
3.Implement Metadata-Aware Load Balancing:
- Enhance the client-side load balancing logic to consider metadata (e.g., weight, region) when selecting a server instance.
- Provide built-in strategies (e.g., weighted random, zone-aware) and support custom strategies via SPI.
4.Ensure Compatibility and Performance:
- Test the new functionality with different registry implementations (e.g., Nacos, Zookeeper, Eureka, Etcd3, Consul, Seata Naming Server).
- Optimize performance to minimize the overhead of metadata processing.
5.Documentation and Testing:
- Write clear documentation on how to configure and use the new metadata and load balancing features.
- Develop comprehensive unit tests and integration tests.
Useful Links
Mentor
Mentor: Jiangke Wu, Apache Seata(Incubating) PPMC Member xingfudeshi@apache.org
GSoC 2025 - Apache Seata(Incubating) Enhancing Connection Pool Management for Apache Seata AT/XA Transaction Modes
Project Overview
Title
Enhancing Connection Pool Management for Apache Seata AT/XA Transaction Modes
Abstract
Apache Seata(incubating) is a popular distributed transaction solution, providing solutions like AT, TCC, and XA for ensuring data consistency in microservice architectures. This project aims to enhance the connection pool management for Seata's AT/XA transaction modes by integrating a comprehensive monitoring and configuration management system within the Seata console. The enhanced functionality will facilitate better resource management and operational efficiency for organizations utilizing Seata.
Detailed Description
Objectives
1. Console Metrics Visualization: Develop functionality to view various metrics related to the connection pool in the Seata console. The metrics should be displayed based on IP/connection pool granularity, helping users easily identify resource allocation and utilization.
2. Metrics Control via Console: Allow users to control various aspects of the connection pools directly from the Seata console. This includes the ability to adjust minimum and maximum connection counts, configure connection acquisition timeout, and manage connection pool keep-alive settings.
Deliverables
1. Connection Pool Metrics Monitoring:
- Visual display of connection pool metrics including current connections, idle connections, active connections, etc.
- Granular view based on IP and connection pool, enabling detailed monitoring and management.
Connection Pool Configuration Management:
2. Implement functions in the Seata console to change connection pool settings:
- Adjust minimum and maximum connection thresholds.
- Set and modify the timeout for obtaining connections.
- Configure keep-alive settings for maintaining active pool connections.
3. Comprehensive Documentation:
- Provide documentation on how to use the new connection pool features.
Include developer notes for future contributions and improvements.
Implementation Plan
Phase 1: Requirement Analysis and Design
- Collaborate with mentors to finalize requirements and design a detailed architecture plan.
- Explore existing Seata console features and connection pool management libraries.
Phase 2: Development of Monitoring Features
- Implement backend logic to gather connection pool metrics.
Develop Seata console UI components for metric visualization.
Phase 3: Development of Control Features
- Integrate functionality to dynamically adjust connection pool configurations via the console.
- Ensure robust validation and error-handling mechanisms are in place.
Phase 4: Testing and Documentation
- Conduct thorough testing to ensure reliability and performance.
- Write user and developer documentation explaining features, usage, and configuration.
Required Skills
- Passion for Open Source: Enthusiasm to contribute consistently to open-source projects, with a curiosity for technology.
- Understanding of Seata Architecture: Basic knowledge of Apache Seata’s architecture and transaction models.
- Java Proficiency: Strong command of Java programming for backend development.
Benefits to Apache Seata
The project will enhance Apache Seata’s usability by providing detailed insights and management capabilities for connection pools. This will lead to more efficient resource utilization, aiding organizations in maintaining system performance and reliability within their distributed transactions.
Conclusion
This project represents an opportunity to significantly improve the operational capabilities of Seata AT/XA transaction modes by enriching the connection pool management features. With rigorous execution, it will provide valuable resources for both users and developers within the Apache Seata community.
Useful Link
[Apache Seata website](https://seata.apache.org/)
Contact Information
- Mentor Name: [Min Ji](jimin@apache.org) , Apache Seata(incubating) PPMC member
RocketMQ
Refactoring the RocketMQ Dashboard UI and Enhancing Usability
Background
Apache RocketMQ is renowned as a cloud-native messaging and streaming platform, enabling the creation of event-driven applications with simplicity and flexibility. The RocketMQ Dashboard is a crucial component that provides users with insight into system performance and client interactions through intuitive graphs and statistical data. Despite its fundamental role, the current user interface (UI) of the RocketMQ Dashboard is outdated, affecting user experience and interaction efficiency. Additionally, while the Dashboard offers valuable functionalities, there is a pressing need to enhance its usability and ensure robust security. This project aims to refactor the RocketMQ Dashboard by redesigning its UI with a more contemporary and user-friendly approach, improving overall usability, and introducing effective security measures to safeguard data and user interactions.
Relevant Skills
- Strong Java development skills.
- Experience with modern front-end technologies and frameworks
- Proficiency in Spring Boot development.
- Understanding of UX/UI design principles. - Knowledge of security best practices in web applications.
- A keen interest in open-source projects and a willingness to learn and adapt.
Tasks
- Launch and experiment with the RocketMQ Dashboard to understand current functionalities.
- Refactor the UI of the RocketMQ Dashboard to align with modern user interface standards, ensuring it is intuitive and visually appealing.
- Improve usability by streamlining workflows, enhancing navigation, and incorporating responsive design.
- Integrate security features to protect user data, prevent unauthorized access, and mitigate potential vulnerabilities.
- Maintain compatibility with existing RocketMQ functionalities while focusing on enhancements.
Learning Material
- RocketMQ HomePage: https://rocketmq.apache.org([https://rocketmq.apache.org|https://rocketmq.apache.org/])
- RocketMQ GitHub Repository: https://github.com/apache/rocketmq([https://github.com/apache/rocketmq])
- RocketMQ Dashboard GitHub Repository: https://github.com/apache/rocketmq-dashboard([https://github.com/apache/rocketmq-dashboard])
Mentor
Rongtong Jin, Apache RocketMQ PMC, jinrongtong@apache.org
Potential Mentor
Juntao Ji, 3160102420@zju.edu.cn
Optimizing Apache RocketMQ's POP Orderly Consumption Process
Background
Apache RocketMQ is a distributed messaging and streaming platform that supports various messaging protocols. One of the key features of RocketMQ is its orderly message consumption capability, which guarantees that messages are processed in the order they are sent. However, there are existing issues with the POP Orderly consumption process that need to be addressed to enhance its reliability and performance.
Current Challenges
Currently, the POP Orderly feature faces several shortcomings, particularly in scenarios where network instability leads to the loss of the attemptId carried by the consumer from the previous round. This issue can result in message consumption getting stuck until the acknowledgment response (ack) for the previous message pull times out. Such situations hinder the efficient processing of messages and reduce the overall effectiveness of the messaging system.
Objectives
The primary objectives of this project are as follows:
●Refactor the POP Orderly Code: Analyze and redesign the existing codebase to improve its structure, maintainability, and performance.
●Optimize Performance: Implement performance enhancements that allow the POP Orderly feature to cope with network fluctuations and reduce the likelihood of consumption halting.
●Elegant Process Resolution: Develop a more graceful approach to handling the issue of consumption stalling, ensuring that the system can recover more smoothly from failures.
Recommended Skills
1. Proficiency in Java programming.
2. Strong understanding of concurrent programming.
3. Excellent logical thinking and problem-solving skills.
4. Familiarity with message queue systems, particularly Apache RocketMQ.
Mentor
Rongtong Jin, Apache RocketMQ PMC, jinrongtong@apache.org
Potential Mentor
Juntao Ji, 3160102420@zju.edu.cn
Difficulty: Major
Project Size: ~350 hours (large)
DolphinScheduler
Enhancing Apache DolphinScheduler with Generalized OIDC Authentication
Background
Apache DolphinScheduler is a distributed and extensible workflow scheduler platform designed to orchestrate complex data processing tasks. It provides a user-friendly interface for defining, scheduling, and monitoring workflows, making it easier to manage and automate data pipelines. DolphinScheduler supports various types of tasks, including shell scripts, SQL queries, and custom scripts, and integrates seamlessly with popular big data ecosystems.
Currently, the Apache DolphinScheduler system supports user login via Password, LDAP, Casdoor SSO, and OAuth. However, as a data platform, it frequently needs to integrate with enterprise - internal user accounts to achieve unified identity authentication, which is crucial for ensuring system security and unified user account management. The existing implementation of Casdoor has a high degree of dependence on the Casdoor project, and the OAuth implementation lacks universality and flexibility.
Our objective is to implement a more generalized OIDC (OpenID Connect) login authentication mechanism. This will enable users to make better use of unified login authentication. Moreover, popular open source login authentication projects like Dexidp, Keycloak, and OAuthProxy all support OIDC. By supporting OIDC, users can integrate with both internal and third-party login authentication methods, such as Feishu Login and WeChat Work Login.
Relevant Skills
Strong proficiency in Java development.
Experience in modern frontend technologies and frameworks.
Highlevel expertise in Spring Boot development.
Thorough familiarity with OIDC and OAuth2 protocols.
Keen interest in opensource projects and eagerness to learn and adapt.
Tasks
Initiate and conduct experiments with Apache DolphinScheduler to comprehensively understand its current functionalities.
Implement and support a more generalized OIDC (OpenID Connect) login authentication mechanism.
Compose corresponding E2E test cases.
Create corresponding documentation for third-party login integrations, covering Keycloak, Dexidp, OAuthProxy, as well as Feishu Login and WeChat Work Login.
Optimize the UI of the Apache DolphinScheduler login page.
Ensure compatibility with the existing functionalities of Apache DolphinScheduler during the process of focusing on enhancements.
Learning Material
Apache DolphinScheduler HomePage: https://dolphinscheduler.apache.org
Apache DolphinScheduler GitHub Repository: https://github.com/apache/dolphinscheduler
Sprint OAuth 2.0 Client: https://docs.spring.io/spring-security/reference/reactive/oauth2/client/index.html
pac4j OIDC: https://www.pac4j.org/docs/clients/openid-connect.html
OIDC (OpenID Connect): https://openid.net/developers/how-connect-works/
Mentor
Gallardot, Apache DolphinScheduler committer, gallardot@apache.org
SbloodyS, Apache DolphinScheduler PMC, zihaoxiang@apache.org
Difficulty: Medium
Project Size: ~150 hours (medium)
HugeGraph
[GSoC][HugeGraph] Implement Agentic GraphRAG Architecture
Apache HugeGraph(incubating) is a fast-speed and highly-scalable graph database/computing/AI ecosystem. Billions of vertices and edges can be easily stored into and queried from HugeGraph due to its excellent OLTP/OLAP ability.
Website: https://hugegraph.apache.org/
GitHub:
- https://github.com/apache/incubator-hugegraph/
- https://github.com/apache/incubator-hugegraph-ai/
Description
Currently, we have implemented a basic GraphRAG that relies on fixed processing workflows (e.g., knowledge retrieval & graph structure updates using the same execution pipeline), leading to insufficient flexibility and high overhead in complex scenarios. The proposed task introduces an Agentic architecture based on the principles of "dynamic awareness, lightweight scheduling, concurrent execution," focusing on solving the following issues:
- Rigid Intent Recognition: Existing systems cannot effectively distinguish between simple retrievals (e.g., entity queries) and complex operations (e.g., multi-hop reasoning), often defaulting to BFS-based template subgraph searches.
- Coupled Execution Resources: Memory/computational resources are not isolated based on task characteristics, causing long-tail tasks to block high-priority requests.
- Lack of Feedback Mechanisms: Absence of self-correction capabilities for erroneous operations (e.g., automatically switching to similar vertices/entities after path retrieval failures).
The task will include three core parts:
1. Dynamic Awareness Layer
- Implement an LLM-based real-time (as of February 14, 2025) intent classifier that categorizes tasks (L1 simple retrieval/L2 path reasoning/L3 graph computation/L4+ etc.) based on semantic features (verb types/entity complexity/temporal modifiers).
- Build a lightweight operation cache to generate feature hashes for high-frequency requests, enabling millisecond-level intent matching.
2. Task Orchestration Layer
- Introduce a suitable workflow/taskflow framework emphasizing low coupling, high performance, and flexibility.
- Adopt a preemptive scheduling mechanism allowing high-priority tasks to pause non-critical phases of low-priority tasks (e.g., suspending subgraph preloading without interrupting core computations).
3. Concurrent Execution
- Decouple traditional RAG pipelines into composable operations (entity recall → path validation → context enhancement → result refinement), with dynamic enable/disable support for each component.
- Implement automatic execution engine degradation, triggering fallback strategies upon sub-operation failures (e.g., switching to alternative methods if Gremlin queries timeout).
Recommended Skills
- Proficiency in Python and familiarity with at least one open/closed-source LLM.
- Experience with one LLM RAG/Agent framework like LangGraph/RAGflow/LLamaindex/Dify.
- Knowledge of LLM optimization techniques and RAG construction (KG extraction/construction experience is a plus).
- Strong algorithmic engineering skills (problem abstraction, algorithm research, big data processing, model tuning).
- Familiarity with VectorDB/Graph/KG/HugeGraph read-write workflows and principles.
- Understanding of graph algorithms (e.g., community detection, centrality, PageRank) and open-source community experience preferred.
Task List
- Develop a hierarchical triggering mechanism for the intent classifier to categorize L1~LN tasks within milliseconds (accuracy >90%).
- Semi-automatically generate Graph Schema/extraction prompts.
- Support dynamic routing and query decomposition.
- Design an execution trace tracker to log micro-operation resource consumption and generate optimization reports.
- Enhance retrieval with graph algorithms: Apply node importance evaluation, path search, etc., to optimize knowledge recall.
- Implement a dialogue memory management module for context-aware state tracking and information reuse.
Size
- Difficulty: Hard
- Project size: ~350 hours (full-time/large)
Potential Mentors
- Imba Jin: jin@apache.org
(Apache HugeGraph PPMC)
- Simon: ming@apache.org
(Apache HugeGraph PPMC)
Doris
Apache DorisEvaluating Column Encoding and Optimization
Synopsis
Apache Doris is a real-time data warehouse that utilizes columnar storage. Currently, Doris applies default encoding methods based on column data types. This project aims to evaluate the efficiency of these default encodings (e.g., encoding/decoding time and compression ratios) using benchmark datasets like TPC-DS, HTTP logs, and TPC-H. The findings will guide optimizations to improve performance.
Key Objectives
- A. Develop a tool to evaluate encoding efficiency. The tool will take a column of data and an encoding method as input and output metrics such as compression ratio and processing speed.
- B. Optimize dictionary encoding for string columns. Current implementations apply dictionary encoding by default without evaluating data suitability, leading to inefficiencies for non-dictionary-friendly data.
- C. Assess the effectiveness of BitShuffle encoding for enhancing downstream compression.
Benefits to the Community
Improve data compression efficiency in Apache Doris.- Enhance query performance through optimized encoding/decoding.
Technical Details
- Languages/Tools: C++ for encoding logic, GitHub for version control.
- Methodology:
Benchmark existing encoding methods (e.g., dictionary, BitShuffle).
Develop an evaluation framework to measure compression ratios and processing overhead.- Implement optimizations for specific data types and use cases.
Timeline (12+ Weeks, Full-Time Commitment - 30 hrs/week)
- Community Bonding (Weeks 1-2)
Engage with mentors and the Doris community.
Set up the development environment and study the codebase.
Document current column encoding strategies for all data types.
- Phase 1: Planning & Initial Development (Weeks 3-6)
Build a tool to evaluate encoding schemes across data types.
Run benchmarks using TPC-DS, HTTP logs, and TPC-H datasets.
- Phase 2: Analysis & Optimization (Weeks 7-10)
- Optimize Dictionary Encoding: Automatically detect and skip non-dictionary-friendly data (e.g., high-cardinality strings).
- BitShuffle Evaluation: Quantify its impact on compression ratios and processing speed.
Address additional optimization opportunities identified during analysis.
- Phase 3: Finalization & Refinement (Weeks 11-12+)
Refine code and documentation based on community feedback.- Submit PRs and ensure their merge into the Doris master branch.
🔹 Total Effort: 350+ hoursExpected Outcomes
A tool to evaluate encoding efficiency for all Doris column types.
Optimized dictionary encoding logic with automated suitability checks.
Improved BitShuffle integration for enhanced compression.- Additional optimizations identified during the project.
This project will strengthen Apache Doris’s performance in real-time analytics scenarios while fostering collaboration within the open-source community.
Contact Information *
Mentor Name: [Yongqiang Yang](dataroaring@apache.org) , Apache Doris PMC member
Mentor Name:[Chen Zhang](zhangchen@apache.org) Apache Doris Committer
Apache Doris Enhancing Group Commit Functionality
Synopsis
The current Group Commit mechanism in Apache Doris batches data until a predefined size or time threshold is met before committing. This project aims to improve flexibility and control over data visibility by introducing the following enhancements: #
Trigger Immediate Flush After a Specified Number of Imports: Allow data to be committed automatically after accumulating a configurable number of import operations.
SYNC TABLE Syntax Support: Enable users to explicitly trigger Group Commit for a table via SQL (e.g., SYNC TABLE table_name), ensuring the command returns only after the commit completes.
System Table for Monitoring: Add an information_schema.group_commit system table to track Group Commit status, including columns such as BE host, table ID, and commit metadata (e.g., batch size, latency).
Technical Details
Languages: C++ (core) and Java (SQL syntax integration).
Tools: GitHub for version control and collaborative development.
Timeline (12+ Weeks, Full-Time Commitment - 30 hrs/week)
Community Bonding (Weeks 1-2)
Collaborate with mentors and the Apache Doris community.
Set up the development environment and review the existing Group Commit implementation.
Document the current Group Commit workflow and proposed optimizations.
Phase 1: Implementation & Testing (Weeks 3-6)
Develop support for flushing data after a configurable number of imports.
Implement the SYNC TABLE syntax to trigger manual Group Commit.
Design and integrate the information_schema.group_commit system table.
Conduct performance benchmarking and rigorous testing.
Phase 2: Refinement & Integration (Weeks 7+)
Address feedback from code reviews and community testing.
Finalize documentation and ensure backward compatibility.Submit pull requests (PRs) and work toward merging changes into the master branch.
🔹 Total Effort: 210+ hours
Expected Outcomes
Enhanced flexibility in Group Commit with configurable flush triggers (size, time, or import count).
A user-friendly SYNC TABLE SQL command for explicit commit control.
A monitoring system table (information_schema.group_commit) for real-time visibility into commit operations.Robust performance validation and integration into Apache Doris’s core workflow.
This project will empower users with finer control over data ingestion and visibility while maintaining Doris’s high-throughput capabilities.
Contact Information *
Mentor Name: [Yongqiang Yang](dataroaring@apache.org) , Apache Doris PMC member
Mentor Name:[Yi Mei](zhangchen@apache.org) Apache Hbase Committer
HertzBeat
[GSOC][HertzBeat] AI Agent Based on the MCP Protocol for Monitoring Info Interaction
Website: https://hertzbeat.apache.org/
Github: http://github.com/apache/hertzbeat/
*Background*
Apache HertzBeat is an open-source real-time monitoring tool that supports a wide range of monitoring targets, including web services, databases, middleware, and more. It features high performance, scalability, and security.
With the advancement of artificial intelligence (AI) technologies, integrating AI with monitoring systems can significantly enhance their usability and interactivity. By developing an AI Agent based on the Model Context Protocol (MCP), we aim to enable conversational interaction for querying monitoring information, adding new monitoring tasks, and retrieving monitoring metrics. This will provide a more user-friendly and intelligent monitoring management experience.
*Objectives*
1. Research and Implementation: Develop an AI Agent based on Apache HertzBeat and the MCP protocol to enable conversational interaction with users.
2. Functional Implementation:
- Query Monitoring And Alarm Information: Allow users to query the status of monitoring targets (e.g., normal, abnormal) and retrieve metrics data (e.g., CPU usage, memory usage, response time), alarm data through conversational commands.
- Add New Monitoring Tasks: Enable users to add new monitoring targets (e.g., web services, databases, middleware) and configure alert thresholds via conversational commands.
- Retrieve Monitoring Metrics Data: Allow users to obtain metrics data for specific monitoring targets and support data visualization via conversational commands.
*Requirements Analysis*
- Apache HertzBeat: As the core backend for the monitoring system, it provides functions for data collection, storage, and management.
- MCP Protocol: An open protocol that enables seamless integration between LLM applications and external data sources and tools.
- Front-end Interaction: Develop a user-friendly interface that supports voice or text input and displays monitoring information and interaction results.
*Recommended Skills*
- Java + TypeScript: Apache HertzBeat is developed based on this technology stack. Therefore, mastering these technologies is crucial for integrating with HertzBeat.
- SpringAi: It is recommended to use SpringAi to build the AI agent.
- LLM + MCP: You need to have an understanding of LLM (Large Language Models) and the MCP protocol. SpringAi seem supports the MCP protocol or consider use the mcp-sdk directly.
*Size*
- Difficulty: Hard
- Project size: ~350 hours
*Potential Mentors*
- Chao Gong: gongchao@apache.org
- Shenghang Zhang: shenghang@apache.org
Mahout
Apache Mahout Refactoring the Website
Synopsis
Apache Mahout has been evolving, with a recent shift in focus toward Quantum Computing (Qumat). However, the official website does not currently reflect this transition, making it difficult for developers and contributors to engage with Mahout’s new direction. Additionally, legacy components like MapReduce and Samsara are no longer actively developed but still occupy prominent space on the website.
This project aims to refactor the Apache Mahout website to:
- Bring Quantum Computing (Qumat) front and center as the new core focus of the project.
- Deprecate outdated technologies (MapReduce and Samsara) while keeping the documentation intact with clear deprecation warnings.
- Improve website structure, navigation, and content organization to enhance accessibility and usability.
By executing these changes, this project will ensure that new and existing users can quickly access relevant information while keeping historical documentation available in a structured manner.
Benefits to the Community
A well-organized and up-to-date website is essential for any open-source project. This proposal offers multiple benefits to the Apache Mahout community:
1. Highlighting Quantum Computing (Qumat)
- Restructure the website so that Qumat-related content is the primary focus.
- Ensure that all documentation, blogs, and tutorials related to Qumat are easily discoverable from the homepage.
2. Deprecating MapReduce and Samsara
- Add clear deprecation warnings to pages related to MapReduce and Samsara.
- Ensure these technologies remain accessible for historical reference but indicate that they are no longer actively maintained.
3. Improved Navigation and Accessibility
- Design a more intuitive navigation system for easy exploration of different sections.
- Ensure smooth access to documentation, blogs, and learning resources.
4. Updating Outdated Content
- Perform a full website audit to identify obsolete articles, guides, and references.
- Refresh and rewrite content where necessary, focusing on Mahout’s latest advancements.
5. Engaging New Contributors
- A modern, user-friendly website will attract more developers, researchers, and open-source contributors to the project.
Deliverables
1. Website Restructuring
- Modify the homepage and navigation bar to prominently feature Quantum Computing (Qumat) as the main focus.
- Ensure Qumat-related documentation and blog posts are front and center.
2. Deprecation of MapReduce and Samsara
- Add banner notifications on all MapReduce and Samsara pages marking them as deprecated.
- Ensure clear explanations so users understand these technologies are no longer in active development.
3. Content Review & Updates
- Perform a recursive LS audit to identify outdated and redundant content.
- Update old blogs and articles to align with Mahout’s latest developments.
4. Improved Website Navigation
- Implement a modern, responsive, and mobile-friendly navigation system.
- Optimize loading speed and ensure smooth user experience.
5. Documentation Enhancement
- Ensure all essential documentation is accessible from the homepage.
- Improve the readability and structure of the docs.
Technical Details
The project will utilize:
- HTML, CSS, JavaScript for website front-end improvements.
- Modern front-end frameworks (if required) to enhance UX/UI.
- Shell scripting or Python to perform a recursive LS audit of the website structure.
- Version control via GitHub for tracking changes and ensuring collaboration.
Expected Outcomes
✅ A refactored website that clearly emphasizes Quantum Computing (Qumat).
✅ A deprecated but accessible archive for MapReduce and Samsara.
✅ An updated and well-structured content repository for Mahout users and contributors.
✅ An intuitive, user-friendly website that engages both new and existing users.
Timeline (12+ Weeks, Full-Time Commitment - 30 hrs/week)
Community Bonding (Weeks 1-2)
- Engage with mentors and the Mahout community.
- Gather feedback on website restructuring priorities.
- Set up the development environment and review existing website architecture.
Phase 1: Planning & Initial Development (Weeks 3-6)
- Redesign homepage and navigation bar to prioritize Qumat.
- Identify and start modifying MapReduce and Samsara pages with deprecation warnings.
- Conduct a recursive LS audit to locate outdated files and redundant content.
Phase 2: Implementation & Testing (Weeks 7-10)
- Implement the new website navigation and homepage.
- Update and restructure documentation and blog content.
- Optimize the website’s file structure based on LS audit findings.
- Conduct extensive testing for responsiveness, accessibility, and performance.
Phase 3: Content Finalization & Refinement (Weeks 11-12+)
- Finalize deprecation notices for MapReduce and Samsara.
- Ensure all Qumat-related content is easily accessible.
- Perform last-minute optimizations and bug fixes.
- Gather final feedback from the community and document all changes.
🔹 Total Timeline: 350+ hrs
Why This Should Be a GSoC Project
This project directly aligns with Google Summer of Code’s mission to enhance open-source software. By modernizing the Apache Mahout website, we ensure that its new focus on Quantum Computing (Qumat) is clearly reflected, making it easier for developers and researchers to engage with Mahout’s latest advancements.
Additionally, this project is well-scoped for GSoC, combining front-end development, content management, and structured auditing—all crucial aspects for a website overhaul.
Mentorship & Feasibility
- The project has clear, well-defined goals and structured milestones.
- It will be mentored by an experienced Apache Mahout maintainer who is applying for the mentor role.
- The tasks are technically feasible within the GSoC timeframe.
Conclusion
Refactoring the Apache Mahout website is essential for reflecting its new focus on Quantum Computing (Qumat) while ensuring historical documentation remains accessible. By modernizing the site, we enhance usability, improve accessibility, and help new users quickly understand Mahout’s direction.
This project will significantly enhance Mahout’s online presence and ensure the community stays well-informed and engaged.