This page is auto-generated! Please do NOT edit it, all changes will be lost on next update
Contents
Alternatives like Postfix...
Given Apache James recent push to adopt a distributed mail queue based on Pulsar supporting delays (JAMES-3687), it starts making sense developing tooling for MX related tooling.
I propose myself to mentor a Gsoc on this topic.
At the end of this GSOC you will...
James ships a couple of MX related tools within smtp-hooks/mailets in default packages. It would make sense to me to move those as an extension.
James supports today...
checks agains DNS blacklists. `DNSRBLHandler` or `URIRBLHandler` smtp hook for instance. This can be moved as an extension IMO.
We would need a little performance benchmark to document performance implications of activating DNS-RBL.
Finally as quoted by a gitter guy: it would make more sens to have this done as a MailHook rather as a RcptHook as it would avoid doing the same job again and over again for each recipients. See JAMES-3820 .
Grey listing. There's an existing implementation using JDBC as an underlying storage.
Move it as an extension.
Remove JDBC storage, propose 2 storage possibilities: in-memory for single node, REDIS for a distributed topology.
Some work around whitelist mailets? Move it as an extension, propose JPA, Cassandra, and XML configured implementations ? With a route to manage entries in there for JPA + Cassandra ?
I would expect a student to do his own little audit and come up with extra suggestions!
https://www.mail-archive.com/server-dev@james.apache.org/msg71462.html
A good long term objective for the PMC is to drop RabbitMQ in
favor of pulsar (third parties could package their own components using
RabbitMQ if they wishes...)
This means:
While contributions would of course be welcomed on this topic, we could
offer it as part of GSOC 2022, and we could co-mentor it with mentors of
the Pulsar community (see [3])
[3] https://lists.apache.org/thread/y9s7f6hmh51ky30l20yx0dlz458gw259
Would such a plan gain traction around here ?
James today provides a command line tool to do administration tasks like creating a domain, listing users, setting quota, etc.
It requires access to JMX port and even if lot of admins are confortable with such tools, to make our user base broader, we probably should expose the same commands in Rest and provide a fancy default web ui.
The task would need some basic skills on frontend tools to design an administration board, knowledge on what REST mean and enough Java understanding to add commands to existing Rest backend.
In the team, we have a strong focus on test (who want a mail server that is not tested enough ?) so we will explain and/or teach the student how to have the right test coverage of the features using modern tools like Cucumber, Selenium, rest-assured, etc.
Placeholder for tasks that could be undertaken in this year's GSoC.
Ideas:
Add implementations of extended precision floating point numbers.
An extended precision floating point number is a series of floating-point numbers that are non-overlapping such that:
double-double (a, b): |a| > |b| a == a + b
Common representations are double-double and quad-double (see for example David Bailey's paper on a quad-double library: QD).
Many computations in the Commons Numbers and Statistics libraries use extended precision computations where the accumulated error of a double would lead to complete cancellation of all significant bits; or create intermediate overflow of integer values.
This project would formalise the code underlying these use cases with a generic library applicable for use in the case where the result is expected to be a finite value and using Java's BigDecimal and/or BigInteger negatively impacts performance.
An example would be the average of long values where the intermediate sum overflows or the conversion to a double loses bits:
long[] values = {Long.MAX_VALUE, Long.MAX_VALUE}; System.out.println(Arrays.stream(values).average().getAsDouble()); System.out.println(Arrays.stream(values).mapToObj(BigDecimal::valueOf) .reduce(BigDecimal.ZERO, BigDecimal::add) .divide(BigDecimal.valueOf(values.length)).doubleValue()); long[] values2 = {Long.MAX_VALUE, Long.MIN_VALUE}; System.out.println(Arrays.stream(values2).asDoubleStream().average().getAsDouble()); System.out.println(Arrays.stream(values2).mapToObj(BigDecimal::valueOf) .reduce(BigDecimal.ZERO, BigDecimal::add) .divide(BigDecimal.valueOf(values2.length)).doubleValue());
Outputs:
-1.0 9.223372036854776E18 0.0 -0.5
Placeholder for tasks that could be undertaken in this year's GSoC.
Ideas (extracted from the "dev" ML):
Other suggestions welcome, as well as
As discussed extensively on the "dev" ML[1][2], there are two competing designs (please review them on the dedicated git branch) for the refactoring of the basic functionality currently implemented in the org.apache.commons.math4.legacy.genetics "legacy" package.
TL;DR;
[1] https://markmail.org/message/qn7gq2y7xjoxukzp
[2] https://markmail.org/message/f66iii3a4kmjaprr
A placeholder ticket, to link other issues and organize tasks related to the 1.0 release of Commons Imaging.
The 1.0 release of Commons Imaging has been postponed several times. Now we have a more clear idea of what's necessary for the 1.0 (see issues with fixVersion 1.0 and 1.0-alpha3, and other open issues), and the tasks are interesting as it involves both basic and advanced programming for tasks such as organize how test images are loaded, or work on performance improvements at byte level and following image format specifications.
The tasks are not too hard to follow, as normally there are example images that need to work with Imaging, as well as other libraries in C, C++, Rust, PHP, etc., that process these images correctly. Our goal with this issue is to a) improve our docs, b) improve our tests, c) fix possible security issues, d) get the parsers in Commons Imaging ready for the 1.0 release.
Assigning the label for GSoC 2023, and full time. Although it would be possible to work on a smaller set of tasks for 1.0 as a part time too.
Github Issue: https://github.com/apache/rocketmq/issues/6282
High-speed storage media, such as solid-state drives (SSDs), are typically more expensive than traditional hard disk drives (HDDs). To minimize storage costs, the local data disk size of a rocketmq broker is often limited. HDFS can store large amounts of data at a lower cost, it has better support for storing and retrieving data sequentially rather than randomly. In order to preserve message data over a long period or facilitate message export, the RocketMQ project previously introduced a tiered storage plugin. Now it is necessary to implement a storage plugin to save data on hdfs.
Anyways, the most important relevant skill is motivation and readiness to learn during the project!
Apache RocketMQ
Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity, and flexible scalability.
Page: https://rocketmq.apache.org
Repo: https://github.com/apache/rocketmq
Background
RocketMQ 5.0 introduced a new component, the controller, which controls the high availability master-slave switch in multi-replica scenarios. It uses the DLedger Raft library as a consensus replication state machine for metadata. As a completely independent component, it can run normally in some scenarios, but in large-scale clusters, it is necessary to maintain a large number of broker groups, which is a great challenge for operational capabilities and resource waste. When dealing with a large number of Broker groups, we need to optimize performance in large-scale scenarios, leveraging the high-performance writing of DLedger itself and performing some optimization for the current Controller architecture.
Task
1. Polish the usage of DLedger
Currently, on the Controller side, a task queue single thread is used for requesting reads and writes to DLedger, that is, only one read/write request can be processed at a time. However, DLedger itself implements many optimizations for multi-client reads and writes and can ensure linear consistency reading. However, now all read and write processing is performed using a single logical DLedger client, which will become a serious performance bottleneck in large-scale scenarios.
2. Optimization of DLedger features usage
DLedger itself can implement many optimizations, such as ReadIndex read and FollowerRead read. After implementation, we can fully leverage the performance of reads. Currently, all Broker nodes communicate with the Leader node of the Controller. In large-scale scenarios, this will cause the requests of each Controller group to be concentrated on the Leader node, and the other Follower nodes will not share the request processing of the Leader, which will cause single-point performance bottlenecks for the Leader.
3. Full asynchronous + parallel processing
Currently, DLedger itself is fully asynchronous, but on the Controller side, all requests for the DLedger side are synchronized, and many Controller-side operations are performed synchronously in a single thread, such as heartbeat checks and other timed tasks. In large-scale scenarios, the logic of these single-threaded synchronous operations will block a large number of requests from Broker-side, so asynchronous + parallel processing can be used for optimization.
4. Correctness testing and performance testing
After completing the above optimizations, it is necessary to conduct correctness testing on the new version and use distributed chaos testing frameworks such as OpenChaos to verify correct operation under fault scenarios such as network partition and random crashes.
After completing the correctness testing, a detailed performance testing report can be produced by comparing the new and old versions.
Skills Required
Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.
Page: https://rocketmq.apache.org
RocketMQ 5.0 has released various language clients including Java, CPP, and Golang, to cover all major programming languages, a Python client needs to be implemented.
Related Repo: https://github.com/apache/rocketmq-clients
The developer is required to be familiar with the Java implementation and capable of developing a Python client, while ensuring consistent functionality and semantics.
Relevant Skills
Python language
Basic knowledge of RocketMQ 5.0
Yangkun Ai, PMC of Apache RocketMQ, aaronai@apache.org
Apache RocketMQ{}
Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.
Page: https://rocketmq.apache.org
Background
With the official release of RocketMQ 5.1.0, tiered storage has arrived as a new independent module in the Technical Preview milestone. This allows users to unload messages from local disks to other cheaper storage, extending message retention time at a lower cost.
Reference RIP-57: https://github.com/apache/rocketmq/wiki/RIP-57-Tiered-storage-for-RocketMQ
In addition, RocketMQ introduced a new high availability architecture in version 5.0.
Reference RIP-44: https://github.com/apache/rocketmq/wiki/RIP-44-Support-DLedger-Controller
However, currently RocketMQ tiered storage only supports single replicas.
Task
Currently, tiered storage only supports single replicas, and there are still the following issues in the integration with the high availability architecture:
So you need to provide a complete plan to solve the above issues and ultimately complete the integration of tiered storage and high availability architecture, while verifying it through the existing tiered storage file version and OpenChaos testing.
Relevant Skills
Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity, and flexible scalability.
Page: https://rocketmq.apache.org
Repo: https://github.com/apache/rocketmq
RocketMQ 5.0 has released a new module called `proxy`, which supports gRPC and remoting protocol. Additionally, it can be deployed in two modes, namely Local and Cluster modes. The performance tuning task will provide contributors with a comprehensive understanding of Apache RocketMQ and its intricate data flow, presenting a unique opportunity for beginners to acquaint themselves with and actively participate in our community.
The task is to tune RocketMQ proxy for optimal performance involves latency and throughput. possess a thorough knowledge of Java implementation and possess the ability to fine-tune Netty, gRPC, the operating system, and RocketMQ itself. We anticipate that the developer responsible for this task will provide a performance report about measurements of both latency and throughput.
Basic knowledge of RocketMQ 5.0, Netty, gRPC, and operating system.
Mailing List: dev@rocketmq.apache.org
Mentor
Zhouxiang Zhan, committer of Apache RocketMQ, zhouxzhan@apache.org
RocketMQ Streams is a lightweight stream processing framework, application gains the stream processing ability by depending on RocketMQ Streams as an SDK.
Repo of RocketMQ Streams: https://github.com/apache/rocketmq-streams
The architecture document of RocketMQ Streams: RocketMQ Streams examples,
The observability needs to be enhanced in the following aspects:
This task need you to study the implementation details of RocketMQ Streams streams, and find out the key indicators in the stream processing process. Design and implement a complete set of observability solutions, and finally use it to complete runtime problem diagnosis;
nize, Committer of of Apache RocketMQ, karp@apache.org
Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.
Page: https://rocketmq.apache.org
Github: https://github.com/apache/rocketmq
RocketMQ is a widely used message middleware system in the Java community, which mainly supports Java8. As Java has evolved many new features and improvements have been added to the language and the Java Virtual Machine (JVM). However, RocketMQ still lacks compatibility with the latest Java versions, preventing users from taking advantage of new features and performance improvements. Therefore, we are seeking community support to upgrade RocketMQ to support higher versions of Java and enable the use of new features and JVM parameters.
We aim to update the RocketMQ codebase to support newer versions of Java in a cross-compile manner. The goal is to enable RocketMQ to work with Java17, while maintaining backward compatibility with previous versions of Java. This will involve identifying and updating any dependencies that need to be changed to support the new Java versions, as well as testing and verifying that the new version of RocketMQ works correctly. With these updates, users will be able to take advantage of the latest Java features and performance improvements. We hope that the community can come together to support this task and make RocketMQ a more versatile and powerful middleware system.
Yangkun Ai, PMC of Apache RocketMQ, aaronai@apache.org
Apache RocketMQ is a distributed messaging and streaming platform with low latency, high performance and reliability, trillion-level capacity and flexible scalability.
Page: https://rocketmq.apache.org
Github: https://github.com/apache/rocketmq
RocketMQ 5.0 client has been released recently, we need to integrate it with Spring.
Related issue: https://github.com/apache/rocketmq-clients/issues/275
Rongtong Jin, PMC of Apache RocketMQ, jinrongtong@apache.org
Yangkun Ai, PMC of Apache RocketMQ, aaronai@apache.org
Apache EventMesh (incubating)
Apache EventMesh is a fully serverless platform used to build distributed event-driven applications.
Website: https://eventmesh.apache.org
GitHub: https://github.com/apache/incubator-eventmesh
Upstream Issue: https://github.com/apache/incubator-eventmesh/issues/3327
Background
Currently, EventMesh has good usability in microservice scenarios. However, EventMesh's support for Kubernetes is still relatively weak.We hope the community can contribute EventMesh integration with the k8s.
Task
1.Discuss with the mentors your implementation idea
2. Learn the details of the Apache EventMesh project
3. Integrate EventMesh with the k8s
Recommended Skills
1.Familiar with Java
2.Familiar with Kubernetes
Mentor
Eason Chen, PPMC of Apache EventMesh, https://github.com/qqeasonchen, chenguangsheng@apache.org
Mike Xue, PPMC of Apache EventMesh, https://github.com/xwm1992, mikexue@apache.org
Apache EventMesh (incubating)
Apache EventMesh is a fully serverless platform used to build distributed event-driven applications.
Website: https://eventmesh.apache.org
GitHub: https://github.com/apache/incubator-eventmesh
Upstream Issue: https://github.com/apache/incubator-eventmesh/issues/3494
Background
Through eventmesh‘s event bridge feature, we can connect data to heterogeneous data storage, we hope that the community can optimize the current eventbridge capability of EventMesh to realize the data connection of different event stores.
Task
1. Discuss with the mentors what you need to do
2. Learn the details of the Apache EventMesh project
3. Verify the ability of different EventMesh cluster instances to synchronize data, sort out the corresponding verification step documents, and optimize the current EventMesh bridge features
Recommended Skills
1. Familiar with Java
2. Familiar with MQ is better
Mentor
Eason Chen, PPMC of Apache EventMesh, https://github.com/qqeasonchen, chenguangsheng@apache.org
Mike Xue, PPMC of Apache EventMesh, https://github.com/xwm1992, mikexue@apache.org
Apache EventMesh (incubating)
Apache EventMesh is a fully serverless platform used to build distributed event-driven applications.
Website: https://eventmesh.apache.org
GitHub: https://github.com/apache/incubator-eventmesh
Upstream Issue: https://github.com/apache/incubator-eventmesh/issues/3488
Background
We hope that the community can contribute to the maintenance of documents, including the archiving of Chinese and English content of documents of different release versions, the maintenance of official website documents, the improvement of project quick start documents, feature introduction, etc.
Task
1.Discuss with the mentors what you need to do
2. Learn the details of the Apache EventMesh project
3. Improve and supplement the content of documents on GitHub, maintain official website documents, record eventmesh quick user experience, and feature display videos
Recommended Skills
1.Familiar with MarkDown
2. Familiar with Java\Go
Mentor
Eason Chen, PPMC of Apache EventMesh, https://github.com/qqeasonchen, chenguangsheng@apache.org
Mike Xue, PPMC of Apache EventMesh, https://github.com/xwm1992, mikexue@apache.org
Apache EventMesh (incubating)
Apache EventMesh is a fully serverless platform used to build distributed event-driven applications.
Website: https://eventmesh.apache.org
GitHub: https://github.com/apache/incubator-eventmesh
Upstream Issue: https://github.com/apache/incubator-eventmesh/issues/3495
Background
At present, eventmesh-admin provides a management interface for eventmesh storage, but it only implements the management function of rocketmq, which needs to be further expanded. At the same time, it can provide CLI for users to quickly start and experience.
Task
1. Discuss with the mentors what you need to do
2. Learn the details of the Apache EventMesh project
3. Implement the management interface for other eventmesh storage
4. Support the CLI for quick start the eventmesh
Recommended Skills
1. Familiar with Java
2. Familiar with MQ is better
Mentor
Eason Chen, PPMC of Apache EventMesh, https://github.com/qqeasonchen, chenguangsheng@apache.org
Mike Xue, PPMC of Apache EventMesh, https://github.com/xwm1992, mikexue@apache.org
Apache EventMesh (incubating)
Apache EventMesh is a fully serverless platform used to build distributed event-driven applications.
Website: https://eventmesh.apache.org
GitHub: https://github.com/apache/incubator-eventmesh
Upstream Issue: https://github.com/apache/incubator-eventmesh/issues/3492
Background
Through eventmesh's source/sink connector, we can connect data to heterogeneous data storage, we hope that the community can provide source/sink connector capabilities, such as connecting rocketmq data to different rocketmq clusters or kafka clusters
Task
1. Discuss with the mentors what you need to do
2. Learn the details of the Apache EventMesh project
3. Implement one of the source/sink connector based on the above background
Recommended Skills
1. Familiar with Java
2. Familiar with MQ is better
Mentor
Eason Chen, PPMC of Apache EventMesh, https://github.com/qqeasonchen, chenguangsheng@apache.org
Mike Xue, PPMC of Apache EventMesh, https://github.com/xwm1992, mikexue@apache.org
Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.
StreamPipes has grown significantly over the past few years, with new features and contributors joining the project. However, as the project continues to evolve, e2e test coverage must also be improved to ensure that all features remain functional. Modern frameworks, such as Cypress, make it quite easy and fun to automatically test even complex application functionalities. As StreamPipes approaches its 1.0 release, it is important to improve e2e testing to ensure the robustness of the project and its use in real-world scenarios.
Do not create any account on behalf of Apache StreamPipes in Cypress or using the name of Apache StreamPipes for any account creation. Your mentor will take care of it.
References
You can find our corresponding issue on GitHub here
Name: Philipp Zehnder
email: zehnder[at]apache.org
community: dev[at]streampipes.apache.org
website: https://streampipes.apache.org/
Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.
StreamPipes is grown significantly throughout recent years. We were able to introduce a lot of new features and attracted both users and contributors. Putting the cherry on the cake, we were graduated as an Apache top level project in December 2022. We will of course continue developing new features and never rest to make StreamPipes even more amazing.
StreamPipes really shines when connecting Industrial IoT data. Such data sources typically originate from machine controllers, called PLCs (e.g., Siemens S7). But there are also new protocols such as OPC-UA which allow to browse available data within the controller. Our goal is to make connectivity of industrial data sources a matter of minutes.
Currently, data sources can be connected using the built-in module `StreamPipes Connect` from the UI. We provide a set of adapters for popular protocols that can be customized, e.g., connection details can be added.
To make it even easier to connect industrial data sources with StreamPipes, we plan to add an OPC-UA browser. This will be part of the entry page of StreamPipes connect and should allow users to enter connection details of an existing OPC-UA server. Afterwards, a new view in the UI shows available data nodes from the server, their status and current value. Users should be able to select values that should be part of a new adapter. Afterwards, a new adapter can be created by reusing the current workflow to create an OPC-UA data source.
This is a really cool project for participants interested in full-stack development who would like to get a deeper understanding of industrial IoT protocols. Have fun!
Anyways, the most important relevant skill is motivation and readiness to learn during the project!
Github issue can be found here: https://github.com/apache/streampipes/issues/1390
Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.
StreamPipes has grown significantly throughout recent years. We were able to introduce a lot of new features and attracted both users and contributors. Putting the cherry on the cake, we were graduated as an Apache top level project in December 2022. We will of course continue developing new features and never rest to make StreamPipes even more amazing. Although, since we are approaching with full stream towards our `1.0` release, we want to project also to get more mature. Therefore, we want to address one of our Achilles' heels: our test coverage.
Don't worry, this issue is not about implementing myriads of tests for our code base. As a first step, we would like to make the status quo transparent. That means we want to measure our code coverage consistently across the whole codebase (Backend, UI, Python library) and report the coverage to codecov. Furthermore, to benchmark ourselves and motivate us to provide tests with every contributing, we would like to lock the current test coverage as an lower threshold that we always want to achieve (meaning in case we drop CI builds fail etc). With time we then can increase the required coverage lever step to step.
More than monitoring our test coverage, we also want to invest in better and more clean code. Therefore, we would like to adopt sonarcloud for our repository.
Do not create any account in behalf of Apache StreamPipes in Sonarcloud or in CodeCov or using the name of Apache StreamPipes for any account creation. Your mentor will take care of it.
References
You can find our corresponding issue on GitHub here
Name: Tim Bossenmaier
email: bossenti[at]apache.org
community: dev[at]streampipes.apache.org
website: https://streampipes.apache.org/
Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere
There is a proposal about new CRD Cluster and ComputeNode as belows:
Currently we try to promote ComputeNode as major CRD to represent a special ShardingSphere Proxy deployment. And plan to use Cluster indicating a special ShardingSphere Proxy cluster.
This issue is to enhance ComputeNode reconciliation availability. The specific case list is as follows.
ComputeNode IT - https://github.com/apache/shardingsphere-on-cloud/blob/main/shardingsphere-operator/pkg/reconcile/computenode/compute_node_test.go
Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org
Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org
Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere
ShardingSphere provides two adapters: ShardingSphere-JDBC and ShardingSphere-Proxy.
Now, ShardingSphere uses logback for logging, but consider the following situations:
Why doesn't the log facade suffice? Because ShardingSphere provides users with clustered logging configurations (such as changing the log level online), this requires dynamic construction of logger, which cannot be achieved with only the log facade.
1. Design and implement logging SPI to support multiple logging frameworks (such as logback and log4j2)
2. Allow users to choose which logging framework to use through the logging rule
1. Master JAVA language
2. Basic knowledge of logback and log4j2
3. Maven
Longtao Jiang, Committer of Apache ShardingSphere, jianglongtao@apache.org
Trista Pan, PMC of Apache ShardingSphere, panjuan@apache.org
Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere
ShardingSphere has designed its own metadata database to simulate metadata queries that support various databases.
More details:
https://github.com/apache/shardingsphere/issues/21268
https://github.com/apache/shardingsphere/issues/22052
Notice, these issues can be a good example.
https://github.com/apache/shardingsphere/pull/22053
https://github.com/apache/shardingsphere/pull/22057/
https://github.com/apache/shardingsphere/pull/22166/
https://github.com/apache/shardingsphere/pull/22182
Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org
Zhengqiang Duan, PMC of Apache ShardingSphere, duanzhengqiang@apache.org
Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere
The community just added CDC (change data capture) feature recently. Change feed will be published in created network connection after logging in, then it could be consumed.
Since Kafka is popular distributed event streaming platform, it's useful to import change feed into Kafka for later processing.
Hongsheng Zhong, PMC of Apache ShardingSphere, zhonghongsheng@apache.org
Xinze Guo, Committer of Apache ShardingSphere, azexin@apache.org
Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere
Currently we try to promote StorageNode as major CRD to represent a set of storage units for ShardingSphere.
The elementary task is that the storage node controller could manage the lifecycle of a set of storage units, like PostgreSQL, in kubernetes.
We don't hope to create another wheel like pg-operator. So consider using a predefined parameter group to generate the target CRD.
Relevant Skills
1. Master Go language, Ginkgo test framework
2. Have a basic understanding of Apache ShardingSphere Concepts and DistSQL
DistSQL Converter - https://github.com/apache/shardingsphere-on-cloud/blob/main/shardingsphere-operator/pkg/distsql/converter.go, etc.
A struct defined as below:
```golang
type EncryptRule struct{}
func (t EncryptRule) ToDistSQL() string {}
```
While invoking ToDistSQL() it will generate a DistSQL regarding a EncryptRule like:
```SQL
CREATE ENCRYPT RULE t_encrypt (....
```
Mentor
Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org
Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org
Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere
There is a proposal about new CRD Cluster and ComputeNode as belows:
Currently we try to promote StorageNode as major CRD to represent a set of storage units for ShardingSphere.
The elementary task is that the storage node controller could manage the lifecycle of a set of storage units, like PostgreSQL, in kubernetes.
We don't hope to create another wheel like pg-operator. So consider using a predefined parameter group to generate the target CRD.
1. Master Go language, Ginkgo test framework
2. Have a basic understanding of Apache ShardingSphere Concepts
3. Be familiar with Kubernetes Operator, kubebuilder framework
StorageNode Controller - https://github.com/apache/shardingsphere-on-cloud/blob/main/shardingsphere-operator/pkg/controllers/storagenode_controller.go
Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org
Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org
Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere
There is a proposal about the background of ChaosEngineering as belows:
Introduce ChaosEngineering for ShardingSphere #32
And we also proposed a generic controller for ShardingSphereChaos as belows:
[GSoC 2023] Introduce New CRD ShardingSphereChaos #272
The ShardingSphereChaos controller is aiming at different chaos tests. This JVMChaos is an important one.
Write several scripts to implement different JVMChaos for main features of ShardingSphere. The specific case list is as follows.
JVMChaos Scripts - https://github.com/apache/shardingsphere-on-cloud/chaos/jvmchaos/scripts/
Mentor
Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org
Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org
Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
Page: https://shardingsphere.apache.org/
Github: https://github.com/apache/shardingsphere
There is a proposal about the background of ChaosEngineering as belows:
The ShardingSphereChaos controller is aiming at different chaos tests.
Propose a generic controller for ShardingSphereChaos, which reconcile CRD ShardingSphereChaos, prepare, execute and verify test.
1. Master Go language, Ginkgo test framework
2. Have a deep understanding of Apache ShardingSphere concepts and practices.
3. Kubernetes operator pattern, kube-builder
ShardingSphereChaos Controller - https://github.com/apache/shardingsphere-on-cloud/shardingsphere-operator/pkg/controllers/chaos_controller.go, etc.
Liyao Miao, Committer of Apache ShardingSphere, miaoliyao@apache.org
Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org
Apache ShardingSphere is positioned as a Database Plus, and aims at building a standard layer and ecosystem above heterogeneous databases. It focuses on how to reuse existing databases and their respective upper layer, rather than creating a new database. The goal is to minimize or eliminate the challenges caused by underlying databases fragmentation.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere
The ShardingSphere SQL federation engine provides support for complex SQL statements, and it can well support cross-database join queries, subqueries, aggregation queries and other statements. An important part of SQL federation engine is to convert the SQL statement parsed by ShardingSphere into SqlNode, so that Calcite can be used to implement SQL optimization and federated query.
This issue is to solve the MySQL exception that occurs during SQLNodeConverterEngine conversion. The specific case list is as follows.
You need to compare the difference between actual and expected, and then correct the logic in SQLNodeConverterEngine so that actual can be consistent with expected.
After you make changes, remember to add case to SUPPORTED_SQL_CASE_IDS to ensure it can be tested.
Notice, these issues can be a good example.
https://github.com/apache/shardingsphere/pull/14492
1. Master JAVA language
2. Have a basic understanding of Antlr g4 file
3. Be familiar with MySQL and Calcite SqlNode
SQLNodeConverterEngineIT
Zhengqiang Duan, PMC of Apache ShardingSphere, duanzhengqiang@apache.org
Chuxin Chen, Committer of Apache ShardingSphere, tuichenchuxin@apache.org
Trista Pan, PMC of Apache ShardingSphere, panjuan@apache.org
Background
Apache ShenYu is a Java native API Gateway for service proxy, protocol conversion and API governance. Currently, ShenYu has good usability and performance in microservice scenarios. However, ShenYu's support for Kubernetes is still relatively weak.
Tasks
1. Support the registration of microservices deployed in K8s Pod to shenyu-admin and use K8s as the register center.
2. Discuss with mentors, and complete the requirements design and technical design of Shenyu K8s Register Center.
3. Complete the initial version of Shenyu K8s Register Center.
4. Complete the CI test of Shenyu K8s Register Center, verify the correctness of the code.
5. Write the necessary documentation, deployment guides, and instructions for users to connect microservices running inside the K8s Pod to ShenYu
Relevant Skills
1. Know the use of Apache ShenYu, especially the register center
2. Familiar with Java and Golang
3. Familiar with Kubernetes and can use Java or Golang to develop
Apache ShenYu is a Java native API Gateway for service proxy, protocol conversion and API governance. Currently, ShenYu has good usability and performance in microservice scenarios. However, ShenYu's support for Kubernetes is still relatively weak.
1. Discuss with mentors, and complete the requirements design and technical design of shenyu-ingress-controller.
2. Complete the initial version of shenyu-ingress-controller, implement the reconcile of k8s ingress api, and make ShenYu as the ingress gateway of k8s.
3. Complete the ci test of shenyu-ingress-controller, verify the correctness of the code.
1. Know the use of Apache ShenYu
2. Familiar with Java and Golang
3. Familiar with Kubernetes and can use java or golang to develop Kubernetes Controller
Issues : https://github.com/apache/shenyu/issues/4438
website : https://shenyu.apache.org/
Shenyu is a native API gateway for service proxy, protocol translation and API governance. but Shenyu lack of End-To-End Tests.
Relevant skills:
1.Understand the architecture of ShenYu
2.Understand SpringCloud micro-service and ShenYu SpringCloud proxy plugin.
3.Understand ShenYu e2e framework and architecture.
How to coding
1.please refer to org.apache.shenyu.e2e.testcase.plugin.DividePluginCases
How to test
1.start shenyu admin in docker
2.start shenyu boostrap in docker
3.run test case org.apache.shenyu.e2e.testcase.plugin.PluginsTest#testDivide
1.develop e2e tests of the springcloud plug-in.
2.write shenyu e2e springcloud plugin documentation in shenyu-website.
3.refactor the existing plugin test cases.
Links:
website: https://shenyu.apache.org/
issues: https://github.com/apache/shenyu/issues/4474
Apache ShenYu is a Java native API Gateway for service proxy, protocol conversion and API governance. Currently, ShenYu has good scalability in the Java language. However, ShenYu's support for multiple languages is still relatively weak.
The wasm bytecode is designed to be encoded in a size- and load-time-efficient binary format. WebAssembly aims to execute at native speed by taking advantage of common hardware capabilities available on a wide range of platforms.
The goal of WasmPlugin is to be able to run wasm bytecode(wasmer-java is a good choice, if you find a better choice, please discuss with me), and other languages can write ShenYu plugins based on this language(such as Rust/golang/C++) as long as they can be compiled into wasm bytecode.
More documents on wasm and WASI are as follows:
https://github.com/WebAssembly/design
https://github.com/WebAssembly/WASI
Know the use of Apache ShenYu, especially the plugin
Familiar with Java and another language which can be compiled into wasm bytecode
1.develop shenyu-wasm-plugin.
2.write integrated test for shenyu-wasm-plugin.
3.write wasm plugin documentation in shenyu-website.
Links:
website: https://shenyu.apache.org/
Shenyu is a native API gateway for service proxy, protocol translation and API governance. It can manage and maintain the API through Shenyu-admin, and support internationalization in Chinese and English. Unfortunately, Shenyu-admin is only internationalized on the front end. The message prompt returned by the back-end interface is still in English. Therefore, we need to implement internationalization support for the back-end interface.This will lay a good foundation for shenyu to move towards more language support.
java.util.Locale; org.springframework.context.MessageSource; org.springframework.context.support.ResourceBundleMessageSource;
## zh request example POST http://localhost:9095/plugin Content-Type: application/json Location: cn-zh X-Access-Token: xxx { "name": "test-create-plugin", "role": "test-create-plugin", "enabled": true, "sort": 100 } Respone { "code": 600, "message": "未登录" } ### en request example POST http://localhost:9095/plugin Content-Type: application/json Location: en X-Access-Token: xxx { "name": "test-create-plugin", "role": "test-create-plugin", "enabled": true, "sort": 100 } Respone { "code": 600, "message": "token is error" }
Background
At present, shenyu needs to manually check whether the license is correct one by one when releasing the version.
Tasks
Relevant Skills
Background
Apache Traffic Control is a Content Delivery Network (CDN) control plane for large scale content distribution.
Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.
There are multiple aspects to this project:
Skills:
Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
Page: https://doris.apache.org
Github: https://github.com/apache/doris
Apache Doris accelerates high-concurrency queries utilizing page cache, where the decompressed data is stored.
Currently, the page cache in Apache Doris uses a simple LRU algorithm, which reveals a few problems:
Page: https://doris.apache.org
Github: https://github.com/apache/doris
Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
Page: https://doris.apache.org
Github: https://github.com/apache/doris
Apache Doris supports acceleration of queries on external data sources to meet users' needs for federated queries and analysis.
Currently, Apache Doris supports multiple external catalogs including those from Hive, Iceberg, Hudi, and JDBC. Developers can connect more data sources to Apache Doris based on a unified framework.
Task
Phase One:
Phase Two:
Page: https://doris.apache.org
Github: https://github.com/apache/doris
Apache Doris
Apache Doris is a real-time analytical database based on MPP architecture. As a unified platform that supports multiple data processing scenarios, it ensures high performance for low-latency and high-throughput queries, allows for easy federated queries on data lakes, and supports various data ingestion methods.
Page: https://doris.apache.org
Github: https://github.com/apache/doris
In Apache Doris, dictionary encoding is performed during data writing and compaction. Dictionary encoding will be implemented on string data types by default. The dictionary size of a column for one segment is 1M at most. The dictionary encoding technology accelerates strings during queries, converting them into INT, for example.
Page: https://doris.apache.org
Github: https://github.com/apache/doris
SkyWalking BanyanDB is an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data.
[1]: EXPLAIN in MySQL
SkyWalking BanyanDB is an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data.
Now the deployment methods for SkyWalking are limited, we only have Helm Chart for users to deploy in Kubernetes, other users that are not using Kubernetes have to do all the house keeping stuffs to set up SkyWalking on, for example, VM.
This issue aims to add a Terraform provider, so that users can conveniently spin up a cluster for demonstration or testing, we should evolve the provider and allow users to customize as their need and finally users can use this in their production environment.
In this task, we will mainly focus on the support for AWS. In the Terraform provider, users need to provide their access key / secret key, and the provider does the rest stuffs: create VMs, create database/OpenSearch or RDS, download SkyWalking tars, configure the SkyWalking, and start the SkyWalking components (OAP/UI), create public IPs/domain name, etc.
Currently skywalking OAP is bundled as a tar ball when releasing, and the start time is long, we are looking for a way to distribute the binary executable in a more convenient way and speed up the bootstrap time. Now we found that GraalVM is a good fit not only it can solve the two aforementioned points but also it will bring benefits that, we can rewrite our LAL or even MAL system in the future with a more secure and isolated method, wasm, which is supported GraalVM too!
so this task is to adjust OAP, build it into GraalVM and make all tests in OAP passed.
SkyWalking BanyanDB is an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data.
The BanyanDB UI is a web interface provided BanyanDB server. It's developed with Vue3 and Vite3
The UI should have a user-friendly Overview page.
The Overview page must display a list of nodes running in a cluster.
For each node in the list, the following information must be shown:
The web app must automatically refresh the node data at a configurable interval to show the most recent information.
Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on algorithm optimiztion for the clustering technique.
Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This year we will proceed on log clustering implementation with a revised architecture and this task will require student to focus on Flink and its integration with SkyWalking OAP.
Apache SkyWalking is an application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Kubernetes) architectures. This task is about enhancing Python agent performance, the tracking issue can be seen here -< https://github.com/apache/skywalking/issues/10408
Today, you can do all sorts of Machine Learning using Apache Beam (https://beam.apache.org/documentation/ml/overview/).
Many of our users, however, have a hard time getting started with ML and understanding how Beam can be applied to their day to day work. The goal of this project is to build out a series of Beam pipelines as Jupyter Notebooks demonstrating real world ML use cases, from NLP to image recognition to using large language models. As you go, there may be bugs or friction points as well which will provide opportunities to contribute back to Beam's core ML libraries.
Mentor for this will be Danny McCormick
There is a community effort to build a Beam runner to run Beam pipelines on top of Ray: https://github.com/ray-project/ray_beam_runner/
This involves pushing that project forward. It will require writing lots of Python code, and specifically going through the list of issues (https://github.com/ray-project/ray_beam_runner/issues) and solving as many of them as possible to make sure the runner is compliant.
Good resource docs:
This project is large.
Beam has an experimental, ongoing implementation for a Rust SDK.
This project involves advancing that implementation and making sure it's compiant with Beam standards.
Good resource materials:
This project is large.
Beam library developers and Beam users would appreciate this : )
This project involves prototyping a few different solutions, so it will be large.
Background
The Apache Teaclave (incubating) is a cutting-edge solution for confidential computing, providing Function-as-a-Service (FaaS) capabilities that enable the decoupling of data and function providers. Despite its impressive functionality and security features, Teaclave currently lacks a mechanism for data providers to enforce policies on the data they upload. For example, data providers may wish to restrict access to certain columns of data for third-party function providers. Open Policy Agent (OPA) offers flexible control over service behavior and has been widely adopted by the cloud-native community. If Teaclave were to integrate OPA, data providers could apply policies to their data, enhancing Teaclave’s functionality. Another potential security loophole in Teaclave is the absence of a means to verify the expected behavior of a function. This gap leaves the system vulnerable to exploitation by malicious actors. Fortunately, most of Teaclave’s interfaces can be reused, with the exception of the function uploading phase, which may require an overhaul to address this issue. Overall, the integration of OPA and the addition of a function verification mechanism would make Teaclave an even more robust and secure solution for confidential computing.
Benefits
If this proposal moves on smoothly, new functionality will be added to the Teaclave project that enables the verification of the function behavior that it strictly conforms to a prescribed policy.
Deliverables
Timeline Estimation
Difficulty: Major
Project size: ~350 hour (large)
Potential mentors:
Mingshen Sun, Apache Teaclave (incubating) PPMC, mssun@apache.org
SeaTunnel is a very easy-to-use ultra-high-performance distributed data integration platform that supports real-time synchronization of massive data. It can synchronize tens of billions of data stably and efficiently every day, and has been used in the production of nearly 100 companies.
SeaTunnel provides a Connector API that does not depend on a specific execution engine. Connectors (Source, Transform, Sink) developed based on this API can run On many different engines, such as SeaTunnel Zeta, Flink, Spark that are currently supported. SeaTunnel has supported more than 100 Connectors, and the number is surging.
Website: https://seatunnel.apache.org/
GitHub: https://github.com/apache/incubator-seatunnel
To use SeaTunnel, the current user needs to first create and write a config file that specifies the engine that runs the job, as well as engine related parameters. Then define the Source, Transform, and Sink of the job. We hope to provide a client that allows users to define the engine, Source, Transform, and Sink information of the job run directly through code in the client without having to start with a config file. The user can then submit the job definition information through the Client, and SeaTunnel will run these jobs. After the job is submitted, the user can obtain the status of the job running through the Client. For jobs that are already running, users can use this client to manage them, such as stopping jobs, temporary jobs, and so on.
1. Discuss with the mentors what you need to do
2. Learn the details of the Apache SeaTunnel project
3. Discuss and complete design and development
This is a project to implement a tool for PMC task automation.
This is a large project.
Mentor will be aizhamal ,
Github issue: https://github.com/apache/cloudstack/issues/2872
ConfigDrive / cloud-init supports a network_data.json file which can contain network information for a VM.
By providing the network information using ConfigDrive to a VM we can eliminate the need for DHCP and thus the Virtual Router in some use-cases.
An example JSON file:
{ "links": [ { "ethernet_mac_address": "52:54:00:0d:bf:93", "id": "eth0", "mtu": 1500, "type": "phy" } ], "networks": [ { "id": "eth0", "ip_address": "192.168.200.200", "link": "eth0", "netmask": "255.255.255.0", "network_id": "dacd568d-5be6-4786-91fe-750c374b78b4", "routes": [ { "gateway": "192.168.200.1", "netmask": "0.0.0.0", "network": "0.0.0.0" } ], "type": "ipv4" }, { "id": "eth0", "ip_address": "2001:db8:100::1337", "link": "eth0", "netmask": "64", "network_id": "dacd568d-5be6-4786-91fe-750c374b78b4", "routes": [ { "gateway": "2001:db8:100::1", "netmask": "0", "network": "::" } ], "type": "ipv6" } ], "services": [ { "address": "8.8.8.8", "type": "dns" } ] }
In Basic Networking and Advanced Networking zones which are using a shared network you wouldn't require a VR anymore.
Github issue: https://github.com/apache/cloudstack/issues/6949
Github issue: https://github.com/apache/cloudstack/issues/6934
Please add a button to test the ldaps connection or a list button to list some user button.
Github issue: https://github.com/apache/cloudstack/issues/4482
NFS Primary Storage mounts are handled by libvirt.
Currently libvirt defaults to NFS version 3 when mounting while it does support NFS version 4 if provided in the XML definition: https://libvirt.org/formatstorage.html#StoragePoolSource
<source> <host name='localhost'/> <dir path='/var/lib/libvirt/images'/> <format type='nfs'/> <protocol ver='4'/> </source>
Maybe pass the argument 'nfsvers' to the URL provided to the Management Server and then pass this down to the Hypervisors which generate the XML for libvirt.
Github issue: https://github.com/apache/cloudstack/issues/6637
Weave project are looking for maintainers, it may be worth exploring what CNI is widely used and standard/stable for CKS use-case.
Github issue: https://github.com/apache/cloudstack/issues/3141
New Global Option For Letsencrypt enable on console proxy. Letsencrypt domain name option for letsencrypt ssl auto renew
Github issue: https://github.com/apache/cloudstack/issues/3065
Extend the Direct Download functionality to work with Ceph storage
Github issue: https://github.com/apache/cloudstack/issues/7142
Description:
With regards to IP info reporting, Cloudstack relies entirely on it's DHCP data bases and so on. When this is not available (L2 networks etc) no IP information is shown for a given VM.
I propose we introduce a mechanism for "IP autodetection" and try to discover the IPs used inside the machines by means of querying the hypervisors. For example with KVM/libvirt we can simply do something like this:
{{root@fedora35 ~]# virsh domifaddr win2k22 --source agent
Name MAC address Protocol Address
-------------------------------------------------------------------------------
Ethernet 52:54:00:7b:23:6a ipv4 192.168.0.68/24
Loopback Pseudo-Interface 1 ipv6 ::1/128
Once we have this information we could display it in the UI/API as "Autodetected VM IPs" or something like that.
I imagine it's very similar for VMWare and XCP-ng.
Thank you
Github issue: https://github.com/apache/cloudstack/issues/7127
Description:
The Import-Export functionality is only allowed for the Vmware hypervisor. The functionality is developed within a VM ingestion framework that allows the extension to other hypervisors. The Import-Export functionality consists on few APIs and the UI to interact with them:
The complexity on KVM should be parsing the existing XML domains into different resources and map them in CloudStack to populate the database correctly.
The load of streaming jobs usually fluctuate according to the input rate or operations (e.g., window). Supporting the automatic scaling could reduce the operational cost of running streaming applications, while minimizing the performance degradation that can be caused by the bursty loads.
We can harness the cloud resources such as VMs and serverless frameworks to acquire computing resources on demand. To realize the automatic scaling, the following features should be implemented.
1) state migration: scaling jobs require moving tasks (or partitioning a task to multiple ones). In this situation, the internal state of the task should be serialized/deserialized.
2) input/output rerouting: if a task is moved to a new worker, the input and output of the task should be redirected.
3) dynamic Executor or Task creation/deletion: Executor}}s or {{Task can be dynamically created or deleted.
4) scaling policy: a scaling policy that decides when and how to scale out/in should be implemented.
This is an umbrella issue to keep track of the issues related to the dynamic task sizing feature on Nemo.
Dynamic task sizing needs to consider a workload and try to decide on the optimal task size based on the runtime metrics and characteristics. It should have an effect on the parallelism and the partitions, on how many partitions an intermediate data should be divided/shuffled into, and to effectively handle skews in the meanwhile.
We aim to handle the problem on throttled resources (heterogeneous resources) and skewed input data. In order to solve this problem, we suggest dynamic work stealing that can dynamically track task statuses and steal workloads among each other. To do this, we have the following action items:
Missing a deadline often has significant consequences for the business. And simulator can contributes to other approach for optimization
So Implement a Simulator for Stream Processing Based on Functional models.
There are some requirements:
The current SimulatedTaskExecutor is hardly available. because it needs actual metric to predict execution time. To increase utilization, we need new model that predicts a task level execution time with statistical analysis.
Some of the related TODOs are as follows:
Currently, Nemo doesn't have a spill mechanism. This makes executors prone to memory problems such as OOM(Out Of Memory) or GC when task data is large. For example, handling skewed shuffle data in Nemo results in OOM and executor failure, as all data has to be handled in-memory.
We need to spill in-memory data to secondary storage when there are not enough memory in executor.
There are some factors that can affect the stage group level simulation, such as a latency, the rate of skewed data and the error rate of the executor etc. It is required to find a reasonable distribution form for these factors. Such as the normal distribution or the landau distribution. In actual running, It makes it possible to approximate the model with a small amount of data.
In-memory caching and spilling are essential features in in-memory big data processing frameworks, and Nemo needs one.
If the the compile time identifies what data can be cached, the runtime requires logic to make this happen.
Implementation needs:
In Stream processing, we have many methods, starting from the primitive checkpoint-and-replay to a more fancy version of reconfiguration and reinitiation of stream workloads. We aim to find a way to find the most effective and efficient way of reconfiguring stream workloads. Sub-issues are to be created later on.
Issues for making it easy to install and use Nemo on Google Dataproc.
I would like to propose a project idea suitable for GSoC and Outreachy comprises of following sub tasks.
1. Complete rest of unfinished work on ArangoDB module - https://issues.apache.org/jira/browse/GORA-650
2. Upgrade HBase driver - https://issues.apache.org/jira/browse/GORA-706
3. Upgrade Hive driver - https://issues.apache.org/jira/browse/GORA-707
We could scope out the project adding / removing sub tasks based on the available capacity of the student and project.
I would like to propose a project idea suitable for GSoC and Outreachy comprises of following sub tasks.
1. Complete rest of unfinished work on Geode module - https://issues.apache.org/jira/browse/GORA-698
2. Upgrade Hadoop version - https://issues.apache.org/jira/browse/GORA-537
We could scope out the project adding / removing sub tasks based on the available capacity of the student and project.
Lombok could help us to not only reduce a large amount of code, but also to fix a couple of inconsistencies in the code base:
The layered architecture of Fineract requires mapping between REST DTO classes and internal entity classes. The current code base contains various strategies to achieve this:
All of these approaches are very manual (and error prone) and difficult to maintain. Mapstruct can help here:
Challenges:
For a large number of monolithic applications, problems such as performance will be encountered during large-scale deployment. For interface-oriented programming languages, Dubbo provides the capability of RPC remote calls, and we can help applications decouple through interfaces. Therefore, we can provide a deployer to help users realize the decoupling and splitting of microservices during deployment, and quickly provide performance optimization capabilities.
Since Dubbo runs on a distributed architecture, it naturally has the problem of difficult API interface definition management. It is often difficult for us to know which interface is running in the production environment. So we can provide an API-defined reporting platform, and even a management platform. This platform can automatically collect all APIs of the cluster, or can be directly defined by the user, and then unified distribution management is carried out through a mechanism similar to git and maven package management.
Dubbo currently supports a large number of Java language features through hessian under the Java SDK, such as generics, interfaces, etc. These capabilities will not be compatible when calling across systems. Therefore, Dubbo needs to provide the ability to detect the interface definition and determine whether the interface published by the user can be described by native json.
Dubbo currently provides a very simple performance testing tool. But for such a complex framework as Dubbo, the functional coverage is very low. We urgently need a testing tool that can test multiple complex scenarios. In addition, we also hope that this set of testing tools can be run automatically, so that we can track the current performance of Dubbo in time.
WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine. For web client users, we can provide Dubbo's wasm client, so that front-end developers can simply initiate Dubbo requests in the browser, and realize Dubbo's full-link unification.
This task needs to be implemented on a browser such as Chrome to initiate a request to the Dubbo backend.
At present, Dubbo provides RPC capabilities and a large number of service governance capabilities. This has led to the fact that Dubbo cannot be used well if some of Dubbo's own components only need to use RPC capabilities or some users who need extreme lightweight.
Goal: To provide a Dubbo RPC kernel, users can directly program for service calls and focus on RPC.
As a development framework closely related to users, Dubbo provides many functional features (such as configuring timeouts, retries, etc.). We hope that a tool can be given to users to scan which features are used, which features are deprecated, which ones will be deprecated in the future, and so on. Based on this tool, we can provide users with a better migration solution.
Suggestion: You can consider based on static code scanning or javaagent implementation.
Dubbo supports the communication mode based on the gRPC protocol through Triple. For this reason, Dubbo has developed a compiling plug-in for proto files based on jprotoc. Due to the activeness of jprotoc, currently Dubbo compiler cannot run well on the latest protobuf version. Therefore, we need to consider implementing a new compiler with reference to gRPC.
Dubbo is a development framework that is closely related to users, and many usages by users may cause exceptions handled by Dubbo. Usually, in this case, users can only judge through logs. We hope to provide an i18n localized log output tool to provide users with a more friendly log troubleshooting experience.
As more and more projects start to develop based on Gradle and profit from Gradle, Dubbo also hopes to migrate to the Gradle project. This task requires you to transform the dubbo project[1] into a gradle project.
At present, the abstraction of connection by client in different protocols in Dubbo is not perfect. For example, there is a big discrepancy between the client abstraction of connection in dubbo and triple protocols. As a result, the enhancement of connection-related functions in the client is more complicated, and the implementation cannot be reused. At the same time, the client also needs to implement a lot of repetitive code when extending the protocol.
Reduce the complexity of the client part when extending the protocol, and increase the reuse of connection-related modules.
Dubbo currently supports protobuf as a serialization method. Protobuf relies on proto (Idl) for code generation, but currently lacks tools for managing Idl files. For example, for java users, proto files are used for each compilation. It is more troublesome, and everyone is used to using jar packages for dependencies.
Implement an Idl management and control platform, support idl files to automatically generate dependency packages in various languages, and push them to relevant dependency warehouses
As a development framework that is closely related to users, Dubbo may have a huge impact on users if any problems occur during the iteration process. Therefore, Dubbo needs a complete set of automated regression testing tools.
At present, Dubbo already has a set of testing tools based on docker-compose, but this set of tools cannot test the compatibility in the kubernetes environment. At the same time, we also need a more reliable test case construction system to ensure that the test cases are sufficiently complete.
Including but not limited to programming patterns, configuration, apis, documentation and demos.
Dubbo Admin is a console of Dubbo. Today, Dubbo's observability is becoming more and more powerful. We need to directly observe some indicators of Dubbo on Dubbo Admin, and even put forward suggestions for users to improve problems.
In charge of the maintenance of the development of the UI pages of the whole Dubbo Admin project.
WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine. Many capabilities of Dubbo support extensions, such as custom interceptors, routing, load balancing, etc. In order to allow the user's implementation to be used on Dubbo's multiple language SDKs, we can implement cross-platform operation based on wasm capabilities.
The implementation of this topic needs to provide a set of mechanisms for Wasm on Dubbo, covering the implementation of Java and Go. Also supports at least Filter, Router and Loadbalance.
Dubbo currently supports the rest protocol based on http1, and the triple protocol based on http2, but currently the two protocols based on the http protocol are implemented independently, and at the same time, they cannot replace the underlying implementation, and their respective implementation costs are relatively high.
In order to reduce maintenance costs, we hope to be able to abstract http. The underlying implementation of the target implementation of http has nothing to do with the protocol, and we hope that different protocols can reuse related implementations.
HTTP/3 has been formalized as a standard in the last year. Dubbo, as a framework that supports publishing and invoking Web services, needs to support the HTTP/3 protocol.
This task needs to expand the implementation of the current rest protocol to support publishing HTTP/3 services and calling HTTP/3 services.
Pixiu acts as a gateway, forwarding traffic to various services.
Pixiu needs to support communication between different applications on the browser, and WASM needs to be supported on the browser. Currently, it only supports the HTTP protocol.
This project needs to complete the communication protocol below WASM (gRPC is preferred)
1. Support gRPC protocol
2. Support dubbo protocol
The front end calls gRPC reference https://github.com/grpc/grpc-web
In the istio mesh environment, the public dubbo/dubbo go provider can be exposed outside the cluster through the http/https protocol through the istio ingress gateway. This requires the ingress gateway to complete the conversion from http to dubbo protocol, which is the main scenario of pixiu; this project Need to complete:
1. Customize pixiu, which can be used as an istio ingress gateway, proxy http/https requests and convert them into dubbo requests;
2. The gateway supports basic user authentication methods.
Basic reference: https://istio.io/latest/blog/2019/custom-ingress-gateway/
https://cloud.ibm.com/docs/containers?topic=containers-istio-custom-gateway
Our Commons components use Commons Skin, a skin, or theme, for Apache Maven Site.
Our skin uses Bootstrap 2.x, but Bootstrap is already at 5.x release, and we are missing several improvements (UIUX, accessibility, browser compatibility) and JS/CSS bugs fixed over the years.
Work happening on Apache Maven Skins. Maybe we could adapt/use that one?
https://issues.apache.org/jira/browse/MSKINS-97
Gateway admins need period reports for various reporting and planning.
Features Include:
The Airavata Django Portal [1] allows users to create, execute and monitor computational experiments. However, when a user wants to then post-process or visualize the output of that computational experiment they must then download the output files and run tools that they may have on their computer or other systems. By integrating with JupyterHub the Django Portal can give users an environment in which they can explore the experiment's output data and gain insights.
The main requirements are:
Integrate Apache Superset (https://superset.apache.org/) to visualize Airavata Catalogs (https://github.com/apache/airavata/tree/master/modules/registry)
Advanced Possibilities:
Explore Multi-tenanted JupyterHub
As discussed on the architecture mailing list [1] and summarized at [2], Airavata will need to develop a metascheduler. In the short term, a user request (demeler, gobert) is to have airavata throttle jobs to resources. In the future more informed scheduling strategies needs to be integrated. Hopefully, the actual scheduling algorithms can be borrowed from third party implementations.
[1] - http://markmail.org/message/tdae5y3togyq4duv
[2] - https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Metascheduler
Complete all transports in MFT
Custos does not have the capabilities to efficiently backup and restore a live instance. This is essential for high available services.
Using SEAGrid Rich Client as an example, develop a native application based on electronJS to mimic Airavata Django Portal.
Reference example - https://github.com/SciGaP/seagrid-rich-client