Contents
- Syncope
- Synapse
- StreamPipes
- Spatial Information Systems
- Solr
- Pulsar Expose the broker level message metadata to the client.
- OODT
- James Server
- Fineract Cloud Native
- SkyWalking
- ShardingSphere
- IoTDB
- TrafficControl
- DolphinScheduler
- CouchDB
- CloudStack
- Clerezza
- Cassandra
- Script to autogenerate cassandra.yamlAdd ability to disable schema changes, repairs, bootstraps, etc (during upgrades)
- Add ability to disable schema changes, repairs, bootstraps, etc (during upgrades)Add ability to ttl snapshots
- Beam
- Apache NuttX
- Apache Nemo
- Apache Hudi
- Apache Fineract
- APISIX
...
Introduce event windowing to the StreamPipes core/sdk
Apache StreamPipes
Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.
Background
Currently, window logic can be individually defined per pipeline element. The whole windowing logic needs to be declared in the controller and runtime logic needs to be individually added based on the selected runtime wrapper (Java, Siddhi, Flink, etc...).
As many data processors benefit from using window-functions (i.e PEs such as Event Counter, Count Aggregation, Rate Limiter), windowing logic is often duplicated as it needs to be implemented for every new pipeline element. In addition, the feature set of supported window operators differs (and often depends on the developer) as it is unclear which windows and parameters should/can be offered.
Therefore, adding support for explicit window semantics to the SDK/Core would make implementing data processors and sinks using windows much easier and less error-prone.
Tasks
- Design and introduce new processor and controller classes for windowed event processors (e.g., WindowedDataProcessor) which handle the windowing logic internally and only expose the higher-level methods to users (i.e onCurrentEvent, onExpiredEvent, etc...).
- Implement internal logic for few window functions (i.e TimeWindow, LengthWindow, TimeBatchWindow, LengthBatchWindow, etc...)
- Write a few sample pipeline-elements using your new API!
Relevant Skills
- Basic knowledge in StreamPipes core (cloning the repo, going through the codebase/documents would do).
- Basic knowledge of stream analytics window functions (this is not a must, but it's awesome if you know your way around analytics window functions).
- Some Java experience.
Learning Material
For StreamPipes:
- https://streampipes.apache.org/docs/
- https://streampipes.apache.org/media.html
- https://github.com/apache/incubator-streampipes
- https://github.com/apache/incubator-streampipes-extensions
For Streaming Analytics:
- https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions
- https://www.mikulskibartosz.name/difference-between-tumbling-and-sliding-window/
- https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/developing-storm-applications/content/understanding_sliding_and_tumbling_windows.html
For the context for the issue:
Mentor
- Grainier Perera (grainier [at] apache.org).
More powerful real-time visualizations for StreamPipes
Apache StreamPipes
Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.
Background
Current wrappers such as standalone (JVM, Siddhi) or distributed (Flink) already allow to develop new processors in the given runtime environment. The idea is to extend the list of standalone runtime wrappers to also support pure Python processors. We already got a minimal working version that however is pretty inflexible and still relies on Java as a proxy to the pipeline management in the backend service for the model declaration in the setup phase ( capabilities, requirements, static properties of a processor) as well as the actual invocation in the execution phase ( receiving specific configuration from pipeline management when pipeline is started). This issue is to track the status of the development.
Tasks
- Add API endpoints as an interface for registration/invocation ( partly done)
- Port relevant model classes over to Python (declaration + invocation descriptions)
- Implement support for various transport protocols and transport formats
- Implement dev friendly alternative to Java builder pattern for model declaration
- Implement overall runtime logic for Python wrapper
Currently, the live dashboard (implemented in Angular) offers an initial set of simple visualizations, such as line charts, gauges, tables and single values. More advanced visualizations, especially those relevant for condition monitoring tasks (e.g., monitoring sensor measurements from industrial machines) is easy. Visualizations can be flexibly created by users and there is an SDK that allows to express requirements (e.g., based on data type or semantic type) for visualizations to better guide users through the creation process.
Tasks
- Extend the set of real-time visualizations in StreamPipes, e.g., by integrating existing visualizations from Apache ECharts.
- Improve the existing dashboard, e.g., by introducing better filtering or more advanced customization options.
Relevant Skills
0. Don't be afraid! We'll guide you through your first steps with StreamPipes.
- Excellent Python skills
- Excellent understanding of stream processing paradigm incl. message broker such as Kafka, MQTT, etc
- Good Understanding of RESTful web services (HTTP, etc.)
- Basic Java skills to understand existing wrapper logic
Info
- SIP-02 to collect design decisions https://cwiki.apache.org/confluence/display/STREAMPIPES/SIP-02+Python+wrapper
- Current python runtime wrapper implementation: https://github.com/apache/incubator-streampipes/tree/dev/streampipes-wrapper-python
- POC example: https://github.com/apache/incubator-streampipes-examples/tree/dev/streampipes-pipeline-elements-examples-processors-jvm/src/main/java/org/apache/streampipes/pe/examples/jvm/python
Mentor
Patrick Wiener, PPMC Apache StreamPipes (wiener@apache.org)
- Angular
- Basic knowledge of Apache ECharts
Mentor
Dominik Riemer, PPMC Apache StreamPipes (riemer@apache.org)
New Python Wrapper
Apache StreamPipes
Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.
Background
Current wrappers such as standalone (JVM, Siddhi) or distributed (Flink) already allow to develop new processors in the given runtime environment. The idea is to extend the list of standalone runtime wrappers to also support pure Python processors. We already got a minimal working version that however is pretty inflexible and still relies on Java as a proxy to the pipeline management in the backend service for the model declaration in the setup phase ( capabilities, requirements, static properties of a processor) as well as the actual invocation in the execution phase ( receiving specific configuration from pipeline management when pipeline is started). This issue is to track the status of the development.
Tasks
- Add API endpoints as an interface for registration/invocation ( partly done)
- Port relevant model classes over to Python (declaration + invocation descriptions)
- Implement support for various transport protocols and transport formats
- Implement dev friendly alternative to Java builder pattern for model declaration
- Implement overall runtime logic for Python wrapper
Relevant Skills
0. Don't be afraid! We'll guide you through your first steps with StreamPipes.
- Excellent Python skills
- Excellent understanding of stream processing paradigm incl. message broker such as Kafka, MQTT, etc
- Good Understanding of RESTful web services (HTTP, etc.)
- Basic Java skills to understand existing wrapper logic
Info
- SIP-02 to collect design decisions https://cwiki.apache.org/confluence/display/STREAMPIPES/SIP-02+Python+wrapper
- Current python runtime wrapper implementation: https://github.com/apache/incubator-streampipes/tree/dev/streampipes-wrapper-python
- POC example: https://github.com/apache/incubator-streampipes-examples/tree/dev/streampipes-pipeline-elements-examples-processors-jvm/src/main/java/org/apache/streampipes/pe/examples/jvm/python
Mentor
Patrick Wiener, PPMC Apache StreamPipes (wiener@apache.org)
Spatial Information Systems
Create metadata, CRS and tabular data editors in JavaFX
Creates the foundation of a GUI application for Apache SIS based on JavaFX. This application should leverage the functionalities available in Apache SIS 0.8. In particular:
- Read metadata from files in various formats (currently ISO 19139, GeoTIFF, NetCDF, LANDSAT, GPX, Moving Features)
- Get Coordinate Reference System from a registry or from GML or WKT definitions and apply coordinate transformations.
- Show vector data in a tabular format.
Since SIS does not yet have a renderer engine, we can not yet show maps in the application. However the application should be designed with this goal in mind.
This project should create a metadata editor showing the ISO 19115 metadata. We should provide a simplified view with only the essential information, and an advanced view showing all information. The information to shown should be customizable. The user should be able to edit the metadata and save them in ISO 19139 format.
The project should also create the necessary widgets for showing a Coordinate Reference System (CRS) definition and allow the user to edit it. Another widget should use the CRS definitions for applying coordinate operations (map projections) using the existing Apache SIS referencing engine, and show the result in a table with information about accuracy and domain of validity.
Edit (March 2021): A JavaFX application has been created. It has widget for metadata and vector data, but we still need widget for Coordinate Reference System definitions. See SIS wiki for screenshots.
More powerful real-time visualizations for StreamPipes
Apache StreamPipes
Apache StreamPipes (incubating) is a self-service (Industrial) IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams. StreamPipes offers several modules including StreamPipes Connect to easily connect data from industrial IoT sources, the Pipeline Editor to quickly create processing pipelines and several visualization modules for live and historic data exploration. Under the hood, StreamPipes utilizes an event-driven microservice paradigm of standalone, so-called analytics microservices making the system easy to extend for individual needs.
Background
Currently, the live dashboard (implemented in Angular) offers an initial set of simple visualizations, such as line charts, gauges, tables and single values. More advanced visualizations, especially those relevant for condition monitoring tasks (e.g., monitoring sensor measurements from industrial machines) is easy. Visualizations can be flexibly created by users and there is an SDK that allows to express requirements (e.g., based on data type or semantic type) for visualizations to better guide users through the creation process.
Tasks
- Extend the set of real-time visualizations in StreamPipes, e.g., by integrating existing visualizations from Apache ECharts.
- Improve the existing dashboard, e.g., by introducing better filtering or more advanced customization options.
Relevant Skills
0. Don't be afraid! We'll guide you through your first steps with StreamPipes.
- Angular
- Basic knowledge of Apache ECharts
Mentor
Dominik Riemer, PPMC Apache StreamPipes (riemer@apache.org)
Spatial Information Systems
Coordinate operation methods to implement
This is an umbrella task for some coordinate operation methods not yet supported in Apache SIS. Coordinate operations include map projections (e.g. Transverse Mercator, Lambert Conic Conformal, etc.), datum shifts (e.g. transformations from NAD27 to NAD83 in United States), transformation of vertical coordinates, etc. We can of course not list all possible formulas that we do not support, but this JIRA task lists at least some of the operations listed in the EPSG guidance notes.
The main material for this work is the EPSG guidance notes, which can be downloaded freely from the following site:
IOGP Publication 373-7-2 – Geomatics Guidance Note number 7, part 2
Coordinate Conversions and Transformations including Formulas
http://www.epsg.org/GuidanceNotes
Google summer of code students interested in this work would need to be reasonably comfortable with the Java language (but not necessarily with the JDK library at large, since this work uses relatively few JDK classes outside Math), and in mathematic. In particular, this work requires a good understanding of affine transforms: their representation as a matrix, and how to map a term in a formula to a coefficient in the affine transform matrix.
Apache SIS has one advanced feature which is not easily found in popular geospatial software or text books: the capability to compute the derivative (or more precisely, the Jacobian) of a transformation at a given point. Implementation of this feature requires the capability to find the analytic derivative of a non-linear formula and to simplify it.
Implementations of those formulas take place in one of the org.apache.sis.referencing.operation sub-packages (projection or transform). Implementations of JUnit test happen partially in Apache SIS, and partially in the "conformance module" of the GeoAPI project, if possible through the Geospatial Integrity of Geoscience Software (GIGS) tests
Create metadata, CRS and tabular data editors in JavaFX
Creates the foundation of a GUI application for Apache SIS based on JavaFX. This application should leverage the functionalities available in Apache SIS 0.8. In particular:
- Read metadata from files in various formats (currently ISO 19139, GeoTIFF, NetCDF, LANDSAT, GPX, Moving Features)
- Get Coordinate Reference System from a registry or from GML or WKT definitions and apply coordinate transformations.
- Show vector data in a tabular format.
Since SIS does not yet have a renderer engine, we can not yet show maps in the application. However the application should be designed with this goal in mind.
This project should create a metadata editor showing the ISO 19115 metadata. We should provide a simplified view with only the essential information, and an advanced view showing all information. The information to shown should be customizable. The user should be able to edit the metadata and save them in ISO 19139 format.
The project should also create the necessary widgets for showing a Coordinate Reference System (CRS) definition and allow the user to edit it. Another widget should use the CRS definitions for applying coordinate operations (map projections) using the existing Apache SIS referencing engine, and show the result in a table with information about accuracy and domain of validity.
Edit (March 2021): A JavaFX application has been created. It has widget for metadata and vector data, but we still need widget for Coordinate Reference System definitions. See SIS wiki for screenshots.
Coordinate operation methods to implement
Solr
Refactor test infra to work with a managed SolrClient; ditch TestHarness
This is a proposal to substantially refactor SolrTestCaseJ4 and some of its intermediate subclasses in the hierarchy. In essence, I envision that tests should work with a SolrClient typed "solrClient" field managed by the test infrastructure. With only a few lines of code, a test should be able to pick between an instance based on EmbeddedSolrServer (lighter tests), HttpSolrClient (tests HTTP/Jetty behavior directly or indirectly), SolrCloud, and perhaps a special one for our distributed search tests. STCJ4 would refactor its methods to use the solrClient field instead of TestHarness. TestHarness would disappear as-such; bits of its existing code would migrate elsewhere, such as to manage an EmbeddedSolrServer for testing.
I think we can do a transition like this in stages and furthermore minimally affecting most tests by adding some deprecated shims. Perhaps STCJ4 should become the deprecated shim so that users can still use it during 7.x and to help us with the transition internally too. More specifically, we'd add a new superclass to STCJ4 that is the future – "SolrTestCase".
Additionally, there are a bunch of methods on SolrTestCaseJ4 that I question the design of, especially ones that return XML strings like delI (generates a delete-by-id XML string) and adoc. Perhaps that used to be a fine idea before there was a convenient SolrClient API but we've got one now and a test shouldn't be building XML unless it's trying to test exactly that.
For consulting work I once developed a JUnit4 TestRule managing a SolrClient that is declared in a test with an annotation of @ClassRule. I had a variation for SolrCloud and EmbeddedSolrServer that was easy for a test to choose. Since TestRule is an interface, I was able to make a special delegating SolrClient subclass that implements TestRule. This isn't essential but makes use of it easier since otherwise you'd be forced to call something like getSolrClient(). We could go the TestRule route here, which I prefer (with or without having it subclass SolrClient), or we could alternatively do TestCase subclassing to manage the lifecycle.
Initially I'm just looking for agreement and refinement of the approach. After that, sub-tasks ought to be added. I won't have time to work on this for some time
This is an umbrella task for some coordinate operation methods not yet supported in Apache SIS. Coordinate operations include map projections (e.g. Transverse Mercator, Lambert Conic Conformal, etc.), datum shifts (e.g. transformations from NAD27 to NAD83 in United States), transformation of vertical coordinates, etc. We can of course not list all possible formulas that we do not support, but this JIRA task lists at least some of the operations listed in the EPSG guidance notes.
The main material for this work is the EPSG guidance notes, which can be downloaded freely from the following site:
IOGP Publication 373-7-2 – Geomatics Guidance Note number 7, part 2
Coordinate Conversions and Transformations including Formulas
http://www.epsg.org/GuidanceNotes
Google summer of code students interested in this work would need to be reasonably comfortable with the Java language (but not necessarily with the JDK library at large, since this work uses relatively few JDK classes outside Math), and in mathematic. In particular, this work requires a good understanding of affine transforms: their representation as a matrix, and how to map a term in a formula to a coefficient in the affine transform matrix.
Apache SIS has one advanced feature which is not easily found in popular geospatial software or text books: the capability to compute the derivative (or more precisely, the Jacobian) of a transformation at a given point. Implementation of this feature requires the capability to find the analytic derivative of a non-linear formula and to simplify it.
Implementations of those formulas take place in one of the org.apache.sis.referencing.operation sub-packages (projection or transform). Implementations of JUnit test happen partially in Apache SIS, and partially in the "conformance module" of the GeoAPI project, if possible through the Geospatial Integrity of Geoscience Software (GIGS) tests.
Solr
Pulsar
Integration with Apache Ranger
Currently, Pulsar only supports store authorization policies under local-zookeeper. Is it possible to support [ranger](https://github.com/apache/ranger), apache ranger can provide a framework for central administration of security policies and monitoring of user access.
Throttle the ledger rollover for the broker
In Pulsar, the ledger rollover is split the data of a topic into multiple segments. For each ledger roll over operation, the metadata of the topic needs to be updated in the ZookKeeper. High ledger rollover frequency may lead to the ZookKeeper cluster in heavy workload. In order to make the ZookKeeper run more stable, we should limit the ledger rollover rate.
Support reset cursor by message index
Currently, Pulsar supports resetting the cursor according to time and message-id, e.g. you can reset the cursor to 3 hours ago or reset the cursor to a specific message-id. For some cases that users want to reset to the 10,000 earlier messages, Pulsar has not supported this operation yet
PIP-70 https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-raw-Message-metadata Introduced a broker level entry metadata which can support message index for a topic(or message offset of a topic), this will provide the ability to support reset cursor according to the message index.
Support publish and consume avro objects in pulsar-perf
We should use perf tool to benchmark producing and consuming messages using Schema.
Refactor test infra to work with a managed SolrClient; ditch TestHarness
This is a proposal to substantially refactor SolrTestCaseJ4 and some of its intermediate subclasses in the hierarchy. In essence, I envision that tests should work with a SolrClient typed "solrClient" field managed by the test infrastructure. With only a few lines of code, a test should be able to pick between an instance based on EmbeddedSolrServer (lighter tests), HttpSolrClient (tests HTTP/Jetty behavior directly or indirectly), SolrCloud, and perhaps a special one for our distributed search tests. STCJ4 would refactor its methods to use the solrClient field instead of TestHarness. TestHarness would disappear as-such; bits of its existing code would migrate elsewhere, such as to manage an EmbeddedSolrServer for testing.
I think we can do a transition like this in stages and furthermore minimally affecting most tests by adding some deprecated shims. Perhaps STCJ4 should become the deprecated shim so that users can still use it during 7.x and to help us with the transition internally too. More specifically, we'd add a new superclass to STCJ4 that is the future – "SolrTestCase".
Additionally, there are a bunch of methods on SolrTestCaseJ4 that I question the design of, especially ones that return XML strings like delI (generates a delete-by-id XML string) and adoc. Perhaps that used to be a fine idea before there was a convenient SolrClient API but we've got one now and a test shouldn't be building XML unless it's trying to test exactly that.
For consulting work I once developed a JUnit4 TestRule managing a SolrClient that is declared in a test with an annotation of @ClassRule. I had a variation for SolrCloud and EmbeddedSolrServer that was easy for a test to choose. Since TestRule is an interface, I was able to make a special delegating SolrClient subclass that implements TestRule. This isn't essential but makes use of it easier since otherwise you'd be forced to call something like getSolrClient(). We could go the TestRule route here, which I prefer (with or without having it subclass SolrClient), or we could alternatively do TestCase subclassing to manage the lifecycle.
Initially I'm just looking for agreement and refinement of the approach. After that, sub-tasks ought to be added. I won't have time to work on this for some time.
Pulsar
Currently, Pulsar exposes the message written count metrics though the Prometheus endpoint, and the metrics maintain in the broker, no been persistent. So if the topic ownership changes or restart broker, this will lead to reset the message written count of the topic to 0. This will confused users and not able to get the correct message written count metrics.
PIP-70 https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-rawbroker-Messageentry-metadata Introduced metadata Introduced a broker level entry metadata and already which can support add message index and broker add a timestamp for the message. But currently, the client can't get the broker level message metadata since the broker skip this information when dispatching messages to the client. Provide a way to expose the broker level message metadata to the client.for a topic(or message offset of a topic), this will provide the ability to calculate the precise message written count for a topic. So we can leverage PIP-70 to improve the message written count metrics for the topic
Integration with Apache Ranger
Currently, Pulsar only supports store authorization policies under local-zookeeper. Is it possible to support [ranger](https://github.com/apache/ranger), apache ranger can provide a framework for central administration of security policies and monitoring of user access.
Throttle the ledger rollover for the broker
In Pulsar, the ledger rollover is split the data of a topic into multiple segments. For each ledger roll over operation, the metadata of the topic needs to be updated in the ZookKeeper. High ledger rollover frequency may lead to the ZookKeeper cluster in heavy workload. In order to make the ZookKeeper run more stable, we should limit the ledger rollover rate.
Improve the message backlogs for the topic
In Pulsar, the client usually sends several messages with a batch. From the broker side, the broker receives a batch and write the batch message to the storage layer.
The message backlog is maintaining how many messages should be handled for a subscription. But unfortunately, the current backlog is based on the batches, not the messages. This will confuse users that they have pushed 1000 messages to the topic, but from the subscription side, when to check the backlog, will return a value that lower than 1000 messages such as 100 batches. Not able to get the message based backlog is it's so expensive to calculate the number of messages in each batch.
PIP-70 https://github.com/apache
Support reset cursor by message index
Currently, Pulsar supports resetting the cursor according to time and message-id, e.g. you can reset the cursor to 3 hours ago or reset the cursor to a specific message-id. For some cases that users want to reset to the 10,000 earlier messages, Pulsar has not supported this operation yet
PIP-70 https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-raw-Message-metadata Introduced a broker level entry metadata which can support message index for a topic(or message offset of a topic), this . This will provide the ability to support reset cursor according to the message index.
to calculate the number of messages between a message index to another message index. So we can leverage PIP-70 to improve the message backlog implementation to able to get the message-based backlog.
For the Exclusive subscription or Failover subscription, it easy to implement by calculating the messages between the mark delete position and the LAC position. But for the Shared and Key_Shared subscription, the individual acknowledgment will bring some complexity. We can cache the individual acknowledgment count in the broker memory, so the way to calculate the message backlog for the Shared and Key_Shared subscription is `backlogOfTheMarkdeletePosition` - `IndividualAckCount`
Support publish and consume avro objects in pulsar-perf
We should use perf tool to benchmark producing and consuming messages using Schema.
Expose the
message written count metrics for the topicCurrently, Pulsar exposes the message written count metrics though the Prometheus endpoint, and the metrics maintain in the broker, no been persistent. So if the topic ownership changes or restart broker, this will lead to reset the message written count of the topic to 0. This will confused users and not able to get the correct message written count metrics.broker level message metadata to the client.
PIP-70 https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-brokerraw-entryMessage-metadata Introduced metadata Introduced a broker level entry metadata which can and already support add message index for a topic(or message offset of a topic), this will provide the ability to calculate the precise message written count for a topic. So we can leverage PIP-70 to improve the message written count metrics for the topicand broker add a timestamp for the message.
But currently, the client can't get the broker level message metadata since the broker skip this information when dispatching messages to the client. Provide a way to expose the broker level message metadata to the client.
OODT
Improve OPSUI React.js UI with advanced functionalities
In GSoC 2019, we implemented a new OPSUI UI based on React.js. See the related blog posts [1] [2]. Several advanced features require to be implemented including.
- Implement querying functionality at OPSUI side (scope can be determined)
- Show progress of workflows and file ingestions
- Introduce a proper REST API for resource manager component
- Introduce proper packaging (with configurable external REST API URLs) and deployment mechanism (as a docker deployment or an npm package)
In this project, the student will have to work on the UI with React.js and will have to implement several REST APIs using JAX-RS. Furthermore, will have to work on making OPSUI easy to deploy.
The existing wicket based OPSUI will be replaced by the new React.js based OPSUI at the end of this project. And the linked blog posts will be a good start to understand what the new React.js based OPSUI is capable of doing.
[1] https://medium.com/faun/gsoc-2019-apache-oodt-react-based-opsui-dashboard-d93a9083981c
[2] https://medium.com/faun/whats-new-in-apache-oodt-react-opsui-dashboard-4cc6701628a9
[3] https://medium.com/faun/apache-oodt-with-docker-84d32525c798
James Server
[GSOC-2021] Implement Thread support for JMAP
Why ?
Mail user agents generally allow displaying emails grouped by conversations (replies, forward, etc...).
As part of JMAP RFC-8621 implementation, there is a dedicated concepts: threads. We did implement JMAP Threads in a rather naive way: each email is a thread of its own.
This naive implementation is specification compliant but defeat the overall purposes of threads.
I propose myself to mentor the implementation of Threads as part of the James JMAP implementation.
Improve the message backlogs for the topic
In Pulsar, the client usually sends several messages with a batch. From the broker side, the broker receives a batch and write the batch message to the storage layer.
The message backlog is maintaining how many messages should be handled for a subscription. But unfortunately, the current backlog is based on the batches, not the messages. This will confuse users that they have pushed 1000 messages to the topic, but from the subscription side, when to check the backlog, will return a value that lower than 1000 messages such as 100 batches. Not able to get the message based backlog is it's so expensive to calculate the number of messages in each batch.
PIP-70 https://github.com/apache/pulsar/wiki/PIP-70%3A-Introduce-lightweight-raw-Message-metadata Introduced a broker level entry metadata which can support message index for a topic(or message offset of a topic). This will provide the ability to calculate the number of messages between a message index to another message index. So we can leverage PIP-70 to improve the message backlog implementation to able to get the message-based backlog.
For the Exclusive subscription or Failover subscription, it easy to implement by calculating the messages between the mark delete position and the LAC position. But for the Shared and Key_Shared subscription, the individual acknowledgment will bring some complexity. We can cache the individual acknowledgment count in the broker memory, so the way to calculate the message backlog for the Shared and Key_Shared subscription is `backlogOfTheMarkdeletePosition` - `IndividualAckCount`
OODT
Fineract Cloud Native
Machine Learning Scorecard for Credit Risk Assessment Phase 4
Mentors
- Lalit Mohan S
- VICTOR ROMERO
Overview & Objectives
Financial Organizations using Mifos/Fineract are depending on external agencies or their past experiences for evaluating credit scoring and identification of potential NPAs. Though information from external agencies is required, financial organizations can have an internal scorecard for evaluating loans so that preventive/proactive actions can be done along with external agencies reports. In industry, organizations are using rule based, Statistical and Machine learning methods for credit scoring, predicting potential NPAs, fraud detection and other activities. This project aims to implement a scorecard based on statistical and ML methods for credit scoring and identification of potential NPAs.
Description
The approach should factor and improve last year's GSOC work (https://gist.github.com/SupreethSudhakaranMenon/a20251271adb341f949dbfeb035191f7) on Features/Characteristics, Criteria and evaluation (link). The design and implementation of the screens should follow Mifos Application standards. Should implement statistical and ML methods with explainability on decision making. Should also be extensible for adding other functionalities such as fraud detection, cross-sell and up-sell, etc.
Helpful Skills
JAVA, Integrating Backend Service, MIFOS X, Apache Fineract, AngularJS, ORM, ML, Statistical Methods, Django
Impact
Streamlined Operations, Better RISK Management, Automated Response Mechanism
Other Resources
2019 Progress: https://gist.github.com/SupreethSudhakaranMenon/a20251271adb341f949dbfeb035191f7
Improve OPSUI React.js UI with advanced functionalities
In GSoC 2019, we implemented a new OPSUI UI based on React.js. See the related blog posts [1] [2]. Several advanced features require to be implemented including.
- Implement querying functionality at OPSUI side (scope can be determined)
- Show progress of workflows and file ingestions
- Introduce a proper REST API for resource manager component
- Introduce proper packaging (with configurable external REST API URLs) and deployment mechanism (as a docker deployment or an npm package)
In this project, the student will have to work on the UI with React.js and will have to implement several REST APIs using JAX-RS. Furthermore, will have to work on making OPSUI easy to deploy.
The existing wicket based OPSUI will be replaced by the new React.js based OPSUI at the end of this project. And the linked blog posts will be a good start to understand what the new React.js based OPSUI is capable of doing.
[1] https://medium.com/faun/gsoc-2019-apache-oodt-react-based-opsui-dashboard-d93a9083981c
[2] https://medium.com/faun/whats-new-in-apache-oodt-react-opsui-dashboard-4cc6701628a9
[3] https://medium.com/faun/apache-oodt-with-docker-84d32525c798
James Server
[GSOC-2021] Implement Thread support for JMAP
Why ?
Mail user agents generally allow displaying emails grouped by conversations (replies, forward, etc...).
As part of JMAP RFC-8621 implementation, there is a dedicated concepts: threads. We did implement JMAP Threads in a rather naive way: each email is a thread of its own.
This naive implementation is specification compliant but defeat the overall purposes of threads.
I propose myself to mentor the implementation of Threads as part of the James JMAP implementation.
Fineract Cloud Native
Machine Learning Scorecard for Credit Risk Assessment Phase 4
Mentors
- Lalit Mohan S
- VICTOR ROMERO
Overview & Objectives
Financial Organizations using Mifos/Fineract are depending on external agencies or their past experiences for evaluating credit scoring and identification of potential NPAs. Though information from external agencies is required, financial organizations can have an internal scorecard for evaluating loans so that preventive/proactive actions can be done along with external agencies reports. In industry, organizations are using rule based, Statistical and Machine learning methods for credit scoring, predicting potential NPAs, fraud detection and other activities. This project aims to implement a scorecard based on statistical and ML methods for credit scoring and identification of potential NPAs.
Description
The approach should factor and improve last year's GSOC work (https://gist.github.com/SupreethSudhakaranMenon/a20251271adb341f949dbfeb035191f7) on Features/Characteristics, Criteria and evaluation (link). The design and implementation of the screens should follow Mifos Application standards. Should implement statistical and ML methods with explainability on decision making. Should also be extensible for adding other functionalities such as fraud detection, cross-sell and up-sell, etc.
Helpful Skills
JAVA, Integrating Backend Service, MIFOS X, Apache Fineract, AngularJS, ORM, ML, Statistical Methods, Django
Impact
Streamlined Operations, Better RISK Management, Automated Response Mechanism
Other Resources
2019 Progress: https://gist.github.com/SupreethSudhakaranMenon/a20251271adb341f949dbfeb035191f7
Create Open Banking Layer for Fineract CN Self-Service App
Mentors
Overview & Objectives
Across our ecosystem we're seeing more and more adoption and innovation from fintechs. A huge democratizing force across the financial services sector is the Open Banking movement providing Open Banking APIs to enable third parties to directly interact with customers of financial institutions. We have recently started providing an Open Banking API layer that will allow financial institutions using Mifos and Fineract to offer third parties access to requesting account information and initiating payments via these APIs. Most recently the Mojaloop community, led by Google, has led the development of a centralized PISP API. We have chosen to the follow the comprehensive UK Open Banking API standard which is being followed and adopted by a number of countries through Sub-Saharan Africa and Latin America.
Tremendous impact can be had at the Base of the Pyramid by enabling third parties to establish consent with customers to authorize transactions to be initiated or information to be accessed from accounts at their financial institution. This Open Banking API layer would enable any institution using Mifos or Fineract to provide a UK Open Banking API layer to third parties and fintechs.
The API Gateway to connect to is still being chosen (WS02, Gravitee, etc.)
Description
The APIS that are consumed by the the reference Fineract 1.x mobile banking application have been documented in the spreadsheet below. The APIs have also been categorized according to whether they are an existing self-service API or back-office API and if they have an equivalent Open Banking API and if so, a link to the corresponding Open Banking API.
For each API with an equivalent Open Banking API, the interns must: Take rest api, upload swagger definition, do transformation in OpenBanking Adapter, and publish on API gateway.
For back-office and/or self-service APIs with no equivalent Open Banking API, the process is: Take rest api, upload swagger definition, and publish on API gateway.
For example:
- Submit Loan Application (Self-ServiceAPIwith EquivalentOpenBankingAPI)
- https://demo.mifos.io/api-docs/apiLive.htm#loans_create
- Used by Fineract 1.x Self-Service App
- ImagesAPI(Back-OfficeAPIwith No EquivalentOpenBankingAPI)
- https://demo.mifos.io/api-docs/apiLive.htm#client_images
- Used by Mifos Mobile and Mobile Wallet
- Fetch Identification CardAPI(Fineract CNAPIwith no equivalentOpenBankingAPI)
- https://docs.google.com/document/d/15LbxVoQQRoa4uU7QiV7FpJFVjkyyNb9_HJwFvS47O4I/edit?pli=1#heading=h.xfl6jxdpcpy1
Sample APIs to be Documented
-------------------------------------------
Mifos Mobile CN API Matrix (completed by Garvit)
https://docs.google.com/spreadsheets/d/1-HrfPKhh1kO7ojK15Ylf6uzejQmaz72eXf5MzCBCE3M/edit#gid=0
https://docs.google.com/document/d/15LbxVoQQRoa4uU7QiV7FpJFVjkyyNb9_HJwFvS47O4I/edit?pli=1#
Mobile Wallet API Matrix (completed by Devansh)
https://docs.google.com/spreadsheets/d/1VgpIwN2JsljWWytk_Qb49kKzmWvwh6xa1oRgMNIAv3g/edit#gid=0
Helpful Skills
Android development, SQL, Java, Javascript, Git, Spring, OpenJPA, Rest, Kotlin, Gravitee, WSO2
Impact
By providing a standard UK Open Banking API layer we can provide both a secure way for our trusted first party apps to allow customers to authenticate and access their accounts as well as an API layer for third party fintechs to securely access Fineract and request information or initiate transactions with the consent of customers.
Other Resources
CGAP Research on Open Banking: https://www.cgap.org/research/publication/open-banking-how-design-financial-inclusion
Docs: https://mifos.gitbook.io/docs/wso2-1/setup-openbanking-apis
Self-Service APIs: https://demo.mifos.io/api-docs/apiLive.htm#selfbasicauth
- https://cwiki.apache.org/confluence/display/FINERACT/Customer+Self-Service+Phase+2
Open Banking Adapter: https://github.com/openMF/openbanking-adapter - Transforms Open Banking API to Fineract API
- Works with both Fineract 1.x and Fineract CN
- Can connect to different API gateways and can transform against different API standards.
Reference Open Banking Fintech App:
- Backend: https://github.com/openMF/openbanking-tpp-server
- GUI: https://github.com/openMF/openbanking-tpp-client
Google Whitepaper on 3PPI: https://static.googleusercontent.com/media/nextbillionusers.google/en//tools/3PPI-2021-whitepaper.pdf
UK Open Banking API Standard: https://standards.openbanking.org.uk/
Open Banking Developer Zone: https://openbanking.atlassian.net/wiki/spaces/DZ/overview
Examples of Open Banking Apps: https://www.ft.com/content/a5f0af78-133e-11e9-a581-4ff78404524e
Functional Enhancements to Fineract CN Mobile
Mentors
Overview & Objectives
Just as we have a mobile field operations app on Apache Fineract 1.x, we have recently built out on top of the brand new Apache Fineract CN micro-services architecture, an initial version of a mobile field operations app with an MVP architecture and material design. Given the flexibility of the new architecture and its ability to support different methodologies - MFIs, credit unions, cooperatives, savings groups, agent banking, etc - this mobile app will have different flavors and workflows and functionalities.
Description
In 2020, our Google Summer of Code intern worked on additional functionality in the Fineract CN mobile app. In 2021, the student will work on the following tasks:
- Integrate with Payment Hub to enable disbursement via Mobile Money API
- Improve Task management features into the app.
- Create UI for creating new account and displaying account details
- Create UI for creating tellers and displaying tellers details
- Improve GIS features like location tracking, dropping of pin into the app
- Improve offline mode via Couchbase support
- Write Unit Test, Integration Test and UI tests
Helpful Skills
Android Development, Kotlin, Java, Git, OpenJPA, Rest API
Impact
Allows staff to go directly into the field to connect to the client. Reduces cost of operations by enabling organizations to go paperless and be more efficient.
Other Resources
Create Open Banking Layer for Fineract CN Self-Service App
Mentors
Overview & Objectives
Across our ecosystem we're seeing more and more adoption and innovation from fintechs. A huge democratizing force across the financial services sector is the Open Banking movement providing Open Banking APIs to enable third parties to directly interact with customers of financial institutions. We have recently started providing an Open Banking API layer that will allow financial institutions using Mifos and Fineract to offer third parties access to requesting account information and initiating payments via these APIs. Most recently the Mojaloop community, led by Google, has led the development of a centralized PISP API. We have chosen to the follow the comprehensive UK Open Banking API standard which is being followed and adopted by a number of countries through Sub-Saharan Africa and Latin America.
Tremendous impact can be had at the Base of the Pyramid by enabling third parties to establish consent with customers to authorize transactions to be initiated or information to be accessed from accounts at their financial institution. This Open Banking API layer would enable any institution using Mifos or Fineract to provide a UK Open Banking API layer to third parties and fintechs.
The API Gateway to connect to is still being chosen (WS02, Gravitee, etc.)
Description
The APIS that are consumed by the the reference Fineract 1.x mobile banking application have been documented in the spreadsheet below. The APIs have also been categorized according to whether they are an existing self-service API or back-office API and if they have an equivalent Open Banking API and if so, a link to the corresponding Open Banking API.
For each API with an equivalent Open Banking API, the interns must: Take rest api, upload swagger definition, do transformation in OpenBanking Adapter, and publish on API gateway.
For back-office and/or self-service APIs with no equivalent Open Banking API, the process is: Take rest api, upload swagger definition, and publish on API gateway.
For example:
- Submit Loan Application (Self-ServiceAPIwith EquivalentOpenBankingAPI)
- https://demo.mifos.io/api-docs/apiLive.htm#loans_create
- Used by Fineract 1.x Self-Service App
- ImagesAPI(Back-OfficeAPIwith No EquivalentOpenBankingAPI)
- https://demo.mifos.io/api-docs/apiLive.htm#client_images
- Used by Mifos Mobile and Mobile Wallet
- Fetch Identification CardAPI(Fineract CNAPIwith no equivalentOpenBankingAPI)
- https://docs.google.com/document/d/15LbxVoQQRoa4uU7QiV7FpJFVjkyyNb9_HJwFvS47O4I/edit?pli=1#heading=h.xfl6jxdpcpy1
Sample APIs to be Documented
-------------------------------------------
Mifos Mobile CN API Matrix (completed by Garvit)
https://docs.google.com/spreadsheets/d/1-HrfPKhh1kO7ojK15Ylf6uzejQmaz72eXf5MzCBCE3M/edit#gid=0
https://docs.google.com/document/d/15LbxVoQQRoa4uU7QiV7FpJFVjkyyNb9_HJwFvS47O4I/edit?pli=1#
Mobile Wallet API Matrix (completed by Devansh)
https://docs.google.com/spreadsheets/d/1VgpIwN2JsljWWytk_Qb49kKzmWvwh6xa1oRgMNIAv3g/edit#gid=0
Helpful Skills
Android development, SQL, Java, Javascript, Git, Spring, OpenJPA, Rest, Kotlin, Gravitee, WSO2
Impact
By providing a standard UK Open Banking API layer we can provide both a secure way for our trusted first party apps to allow customers to authenticate and access their accounts as well as an API layer for third party fintechs to securely access Fineract and request information or initiate transactions with the consent of customers.
Other Resources
CGAP Research on Open Banking: https://www.cgap.org/research/publication/open-banking-how-design-financial-inclusionDocs: https://mifos.gitbook.io/docs/wso2-1/setup-openbanking-apis
Self-Service APIs: https://demo.mifos.io/api-docs/apiLive.htm#selfbasicauth
- Repo on Github:
https://github.com/apache/fineract-cn-mobile - Fineract CN API documentation:
https://izakey.github.io/fineract-cn-api-docs-site/ - https://github.com/aasaru/fineract-cn-api-docs
Open Banking Adapter:
- Fineract+CN
- How to install and run Couchbase:
https://gist.github.com/
Reference Open Banking Fintech App:
- af6cd34058cacf20b100d335639b3ad8
- GSMA mobile money API:
https://developer.mobilemoneyapi.io/1.1/oas3/22466 - Payment Hub:
Google Whitepaper on 3PPI:
- search?q=openMF%2Fph-ee&ref=opensearch
- Some UI designs:
KHXtZPdIpC3TqvdIVZu8CW/fineract-cn-mobile?node-id=0%3A1
UK Open Banking API Standard:- 2020 GSoC progress report:
Examples of Open Banking Apps: https://www.ft.com/content/a5f0af78-133e-11e9-a581-4ff78404524e
SkyWalking
Apache SkyWalking: Python agent supports profiling
Apache SkyWalking [1] is an application performance monitor (APM) tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, K8s, Mesos) architectures.
SkyWalking is based on agent to instrument (automatically) monitored services, for now, we have many agents for different languages, Python agent [2] is one of them, which supports automatic instrumentations.
The goal of this project is to extend the agent's features by supporting profiling [3] a function's invocation stack, help the users to analyze which method costs the most major time in a cross-services call.
To complete this task, you must be comfortable with Python, have some knowledge of tracing system, otherwise you'll have a hard time coming up to speed..
[1] http://skywalking.apache.org
[2] http://github.com/apache/skywalking-python
[3] https://thenewstack.io/apache-skywalking-use-profiling-to-fix-the-blind-spot-of-distributed-tracing/
Apache SkyWalking: Python agent collects and reports PVM metrics to backend
Apache SkyWalking [1] is an application performance monitor (APM) tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, K8s, Mesos) architectures.
Tracing distributed systems is one of the main features of SkyWalking, with those traces, it can analyze some service metrics such as CPM, success rate, error rate, apdex, etc. SkyWalking also supports receiving metrics from the agent side directly.
In this task, we expect the Python agent to report its Python Virtual Machine (PVM) metrics, including (but not limited to, whatever metrics useful are also acceptable) CPU usage (%), memory used (MB), (active) thread/coroutine counts, garbage collection count, etc.
To complete this task, you must be comfortable with Python and gRPC, otherwise you'll have a hard time coming up to speed.
Live demo to play around: http://122.112.182.72:8080 (under reconstruction, maybe unavailable but latest demo address can be found at the GitHub index page http://github.com/apache/skywalking)
Functional Enhancements to Fineract CN Mobile
Mentors
Overview & Objectives
Just as we have a mobile field operations app on Apache Fineract 1.x, we have recently built out on top of the brand new Apache Fineract CN micro-services architecture, an initial version of a mobile field operations app with an MVP architecture and material design. Given the flexibility of the new architecture and its ability to support different methodologies - MFIs, credit unions, cooperatives, savings groups, agent banking, etc - this mobile app will have different flavors and workflows and functionalities.
Description
In 2020, our Google Summer of Code intern worked on additional functionality in the Fineract CN mobile app. In 2021, the student will work on the following tasks:
- Integrate with Payment Hub to enable disbursement via Mobile Money API
- Improve Task management features into the app.
- Create UI for creating new account and displaying account details
- Create UI for creating tellers and displaying tellers details
- Improve GIS features like location tracking, dropping of pin into the app
- Improve offline mode via Couchbase support
- Write Unit Test, Integration Test and UI tests
Helpful Skills
Android Development, Kotlin, Java, Git, OpenJPA, Rest API
Impact
Allows staff to go directly into the field to connect to the client. Reduces cost of operations by enabling organizations to go paperless and be more efficient.
Other Resources
- Repo on Github:
https://github.com/apache/fineract-cn-mobile - Fineract CN API documentation:
https://izakey.github.io/fineract-cn-api-docs-site/ - https://github.com/aasaru/fineract-cn-api-docs
https://cwiki.apache.org/confluence/display/FINERACT/Fineract+CN - How to install and run Couchbase:
https://gist.github.com/jawidMuhammadi/af6cd34058cacf20b100d335639b3ad8 - GSMA mobile money API:
https://developer.mobilemoneyapi.io/1.1/oas3/22466 - Payment Hub:
https://github.com/search?q=openMF%2Fph-ee&ref=opensearch - Some UI designs:
https://www.figma.com/file/KHXtZPdIpC3TqvdIVZu8CW/fineract-cn-mobile?node-id=0%3A1
https://gist.github.com/jawidMuhammadi/9fa91d37b1cbe43d9cdfe165ad8f2102
https://issues.apache.org/jira/browse/FINCN-241?filter=-2&jql=project%20%3D%20FINCN%20order%20by%20created%20DESC
...
ShardingSphere
Apache
SkyWalking: Python agent supports profilingApache SkyWalking [1] is an application performance monitor (APM) tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, K8s, Mesos) architectures.
SkyWalking is based on agent to instrument (automatically) monitored services, for now, we have many agents for different languages, Python agent [2] is one of them, which supports automatic instrumentations.
The goal of this project is to extend the agent's features by supporting profiling [3] a function's invocation stack, help the users to analyze which method costs the most major time in a cross-services call.
To complete this task, you must be comfortable with Python, have some knowledge of tracing system, otherwise you'll have a hard time coming up to speed..
[1] http://skywalking.apache.org
[2] http://github.com/apache/skywalking-python
[3] https://thenewstack.io/apache-skywalking-use-profiling-to-fix-the-blind-spot-of-distributed-tracing/
ShardingSphere: Proofread the DDL/TCL SQL definitions for ShardingSphere Parser
Apache ShardingSphere
Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere
Background
ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer` and `Oracle`, which means we have to understand different database dialect SQLs.
More details: https://shardingsphere.apache.org/document/current/en/features/sharding/principle/parse/
Task
This issue is to proofread the following definitions,
- All the DDL SQL definitions for Oracle except for ALTER, DROP, CREATE and TRUNCATE.
- All the TCL (Transaction Control Language) SQL definitions for Oracle
You can learn more here.
As we have a basic Oracle SQL syntax definitions but do not keep in line with Oracle DOC, we need you to find out the vague SQL grammar definitions and correct them referring to Oracle DOC.
Notice, when you review these target SQLs above, you will find that these definitions will involve some basic elements of Oracle SQL. No doubt, these elements are included in this task as well.
Relevant Skills
1. Master JAVA language
2. Have a basic understanding of Antlr g4 file
3. Be familiar with Oracle SQLs
Targets files
1. DDL SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/DDLStatement.g4
2. TCL SQLs g4 file:
https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/TCLStatement.g4
3. Basic elements g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/BaseRule.g4
References
1. Oracle SQL quick reference: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlqr/SQL-Statements.html#GUID-1FA35EAD-AED2-4619-BFEE-348FF05D1F4A
2. Detailed Oracle SQL info: https://docs.oracle.com/pls/topic/lookup?ctx=en/database/oracle/oracle-database/19/sqlqr&id=SQLRF008
Mentor
Juan Pan, PMC of Apache ShardingSphere, panjuan@apache.org
Apache SkyWalking: Python agent collects and reports PVM metrics to backend
Apache SkyWalking [1] is an application performance monitor (APM) tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, K8s, Mesos) architectures.
Tracing distributed systems is one of the main features of SkyWalking, with those traces, it can analyze some service metrics such as CPM, success rate, error rate, apdex, etc. SkyWalking also supports receiving metrics from the agent side directly.
In this task, we expect the Python agent to report its Python Virtual Machine (PVM) metrics, including (but not limited to, whatever metrics useful are also acceptable) CPU usage (%), memory used (MB), (active) thread/coroutine counts, garbage collection count, etc.
To complete this task, you must be comfortable with Python and gRPC, otherwise you'll have a hard time coming up to speed.
Live demo to play around: http://122.112.182.72:8080 (under reconstruction, maybe unavailable but latest demo address can be found at the GitHub index page http://github.com/apache/skywalking)
...
Apache ShardingSphere:
Proofread the DDL/TCL SQL definitions for ShardingSphere ParserAdd unit test for example
Apache ShardingSphere
Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere
Background
The examples of ShardingSphere do not have test cases.
After mvn install, developer can know compiling success only, but the can not guarantee code correct, especially config for YAML, spring namespace and spring boot starter.
Task
This issue is to add auto test cases with JUnit to assert startup success and code logic correct.
Notice, the code of current example may need to be refactor to make it easy for test.
Relevant Skills
1. Master JAVA language
2. Be familiar with spring framework
3. Have a basic understanding of JUnit
Targets files
Example repo: https://github.com/apache/shardingsphere/tree/master/examples
Mentor
Liang Zhang, PMC Chair of Apache ShardingSphere, zhangliang@apache.org
Apache ShardingSphere: Proofread the DML SQL definitions for ShardingSphere Parser
Apache ShardingSphere
Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
Page: https://shardingsphere.apache.org
Github
ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer` and `Oracle`, which means we have to understand different database dialect SQLs.
More details: https://shardingsphere.apache.org/document/current/en/features/sharding/principle/parse/
Task
This issue is to proofread the following definitions,
- All the DDL SQL definitions for Oracle except for ALTER, DROP, CREATE and TRUNCATE.
- All the TCL (Transaction Control Language) SQL definitions for Oracle
You can learn more here.
As we have a basic Oracle SQL syntax definitions but do not keep in line with Oracle DOC, we need you to find out the vague SQL grammar definitions and correct them referring to Oracle DOC.
Notice, when you review these target SQLs above, you will find that these definitions will involve some basic elements of Oracle SQL. No doubt, these elements are included in this task as well.
Relevant Skills
1. Master JAVA language
2. Have a basic understanding of Antlr g4 file
3. Be familiar with Oracle SQLs
Targets files
1. DDL SQLs g4 file: https://github.com/apache/shardingsphere
Background
ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer` and `Oracle`, which means we have to understand different database dialect SQLs.
More details: https://shardingsphere.apache.org/document/current/en/features/sharding/principle/parse/
Task
This issue is to proofread the DML(SELECT/UPDATE/DELETE/INSERT) SQL definitions for Oracle. As we have a basic Oracle SQL syntax definitions but do not keep in line with Oracle DOC, we need you to find out the vague SQL grammar definitions and correct them referring to Oracle DOC.
Notice, when you review these DML(SELECT/UPDATE/DELETE/INSERT) SQLs, you will find that these definitions will involve some basic elements of Oracle SQL. No doubt, these elements are included in this task as well.
Relevant Skills
1. Master JAVA language
2. Have a basic understanding of Antlr g4 file
3. Be familiar with Oracle SQLs
Targets files
1. DML SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/DDLStatement.g4
2. TCL SQLs g4 file:
https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/TCLStatementDMLStatement.g4
32. Basic elements g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/BaseRule.g4
References
1. Oracle SQL quick reference: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlqr/SQL-Statements.html#GUID-1FA35EAD-AED2-4619-BFEE-348FF05D1F4A
2. Detailed Oracle SQL info: https://docs.oracle.com/pls/topic/lookup?ctx=en/database/oracle/oracle-database/19/sqlqr&id=SQLRF008
Mentor
Juan Pan, PMC of Apache ShardingSphere, panjuan@apache.org
Apache ShardingSphere: Add unit test for example
Apache ShardingSphere
Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere
Background
The examples of ShardingSphere do not have test cases.
After mvn install, developer can know compiling success only, but the can not guarantee code correct, especially config for YAML, spring namespace and spring boot starter.
Task
This issue is to add auto test cases with JUnit to assert startup success and code logic correct.
Notice, the code of current example may need to be refactor to make it easy for test.
Relevant Skills
1. Master JAVA language
2. Be familiar with spring framework
3. Have a basic understanding of JUnit
Targets files
Example repo: https://github.com/apache/shardingsphere/tree/master/examples
Mentor
Liang Zhang, PMC Chair of Apache ShardingSphere, zhangliang@apache.org
Apache ShardingSphere: Proofread the DML SQL definitions for ShardingSphere Parser
Apache ShardingSphere
Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere
Background
ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer` and `Oracle`, which means we have to understand different database dialect SQLs.
More details: https://shardingsphere.apache.org/document/current/en/features/sharding/principle/parse/
Task
This issue is to proofread the DML(SELECT/UPDATE/DELETE/INSERT) SQL definitions for Oracle. As we have a basic Oracle SQL syntax definitions but do not keep in line with Oracle DOC, we need you to find out the vague SQL grammar definitions and correct them referring to Oracle DOC.
Notice, when you review these DML(SELECT/UPDATE/DELETE/INSERT) SQLs, you will find that these definitions will involve some basic elements of Oracle SQL. No doubt, these elements are included in this task as well.
Relevant Skills
1. Master JAVA language
2. Have a basic understanding of Antlr g4 file
3. Be familiar with Oracle SQLs
Targets files
1. DML SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/DMLStatement.g4
2. Basic elements g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/BaseRule.g4
References
1. Oracle SQL quick reference: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlqr/SQL-Statements.html#GUID-1FA35EAD-AED2-4619-BFEE-348FF05D1F4A
2. Detailed Oracle SQL info: https://docs.oracle.com/pls/topic/lookup?ctx=en/database/oracle/oracle-database/19/sqlqr&id=SQLRF008
Mentor
Juan Pan, PMC of Apache ShardingSphere, panjuan@apache.org
IoTDB
IoTDB
Implement PISA index in Apache IoTDB
Apache IoTDB is a highly efficient time series database, which supports high speed query process, including aggregation query.
Currently, IoTDB pre-calculates the aggregation info, or called the summary info, (sum, count, max_time, min_time, max_value, min_value) for each page and each Chunk. The info is helpful for aggregation operations and some query filters. For example, if the query filter is value >10 and the max value of a page is 9, we can skip the page. For another example, if the query is select max(value) and the max value of 3 chunks are 5, 10, 20, then the max(value) is 20.
However, there are two drawbacks:
1. The summary info actually reduces the data that needs to be scanned as 1/k (suppose each page has k data points). However, the time complexity is still O(N). If we store a long historical data, e.g., storing 2 years data with 500KHz, then the aggregation operation may be still time-consuming. So, a tree-based index to reduce the time complexity from O(N) to O(logN) is a good choice. Some basic ideas have been published in [1], while it can just handle data with fix frequency. So, improving it and implementing it into IoTDB is a good choice.
2. The summary info is helpless for evaluating the query like where value >8 if the max value = 10. If we can enrich the summary info, e.g., storing the data histogram, we can use the histogram to evaluate how many points we can return.
This proposal is mainly for adding an index for speeding up the aggregation query. Besides, if we can let the summary info be more useful, it could be better.
Notice that the premise is that the insertion speed should not be slow down too much!
By the way, IoTDB provides an index framework already. So, the PISA index should be compatible with the index framework.
You should know:
• IoTDB query process
• TsFile structure and organization
• Basic index knowledge
• Java
difficulty: Major
mentors:
hxd@apache.org
Reference:
[1] https://www.sciencedirect.com/science/article/pii/S0306437918305489
Apache IoTDB Integration Test
Apache IoTDB is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.
Now, IoTDB uses JUnit for its UT/IT test.
However, there are two drawbacks:
1. There are many singleton class instances in IoTDB. Therefore, modifying something in a test may impact others, and it requires us do many cleanup work after a test.
Especially, after we open an serversocket (by Thrift), though we have called the socket.close, the socket may be not closed quickly (controlled by Thrift). But, if the next test begins, then a "the port is already used" error will occur.
2. when testing IoTDB's cluster module, we may need to start at least 3 IoTDB instances in one server.
Using JUnit, the 3 instances will be in one JVM, which is conflicted with the reality "IoTDB has many singleton instances".
So, next, we want to use TestContainer, which is a combiner of Docker and JUnit.
This task is for:
1. using TestContainer to re-implement all IT codes of IoTDB;
2. using TestContainer to add some IT codes for IoTDB's cluster module.
Needed skills:
- Java
- Docker (Docker-Compose better)
- Know or learn Junit and TestContainer
[1] iotdb.apache.org
[2] https://www.testcontainers.org/
Apache IoTDB C# library
Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.
IoTDB has two kinds of client interfaces: SQL and native API (also called as session API.)
This task is for the native API.
IoTDB uses Apache Thrift[2] as its RPC framework, so all native API can be generated by Thrift. However, to accelerate the performance, we may use some byte array in Thrift, rather than a Struct, which is not quite friendly to users.
That is why we provide our session API. Session API just wraps the interfaces of the generated thrift codes. Now we have Java[4], Python and c++ version[3]. The C# version is left.
This task hopes you can provide a c# library for IoTDB.
Needed skills:
- Thrift
- C#
- know Java
[1] iotdb.apache.org
[2] http://thrift.apache.org/
[3] https://iotdb.apache.org/UserGuide/Master/Client/Programming%20-%20Other%20Languages.html
[4] https://iotdb.apache.org/UserGuide/Master/Client/Programming%20-%20Native%20API.html
Implement PISA index in Apache IoTDB
Apache IoTDB is a highly efficient time series database, which supports high speed query process, including aggregation query.
Currently, IoTDB pre-calculates the aggregation info, or called the summary info, (sum, count, max_time, min_time, max_value, min_value) for each page and each Chunk. The info is helpful for aggregation operations and some query filters. For example, if the query filter is value >10 and the max value of a page is 9, we can skip the page. For another example, if the query is select max(value) and the max value of 3 chunks are 5, 10, 20, then the max(value) is 20.
However, there are two drawbacks:
1. The summary info actually reduces the data that needs to be scanned as 1/k (suppose each page has k data points). However, the time complexity is still O(N). If we store a long historical data, e.g., storing 2 years data with 500KHz, then the aggregation operation may be still time-consuming. So, a tree-based index to reduce the time complexity from O(N) to O(logN) is a good choice. Some basic ideas have been published in [1], while it can just handle data with fix frequency. So, improving it and implementing it into IoTDB is a good choice.
2. The summary info is helpless for evaluating the query like where value >8 if the max value = 10. If we can enrich the summary info, e.g., storing the data histogram, we can use the histogram to evaluate how many points we can return.
This proposal is mainly for adding an index for speeding up the aggregation query. Besides, if we can let the summary info be more useful, it could be better.
Notice that the premise is that the insertion speed should not be slow down too much!
By the way, IoTDB provides an index framework already. So, the PISA index should be compatible with the index framework.
You should know:
• IoTDB query process
• TsFile structure and organization
• Basic index knowledge
• Java
difficulty: Major
mentors:
hxd@apache.org
Reference:
[1] https://www.sciencedirect.com/science/article/pii/S0306437918305489
Apache IoTDB
Integration Test: Metadata (Schema) Storage Engine
Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.
NowDifferent with traditional relational databases, IoTDB uses JUnit for its UT/IT test.
However, there are two drawbacks:
1. There are many singleton class instances in IoTDB. Therefore, modifying something in a test may impact others, and it requires us do many cleanup work after a test.
Especially, after we open an serversocket (by Thrift), though we have called the socket.close, the socket may be not closed quickly (controlled by Thrift). But, if the next test begins, then a "the port is already used" error will occur.
2. when testing IoTDB's cluster module, we may need to start at least 3 IoTDB instances in one server.
Using JUnit, the 3 instances will be in one JVM, which is conflicted with the reality "IoTDB has many singleton instances".
So, next, we want to use TestContainer, which is a combiner of Docker and JUnit.
This task is for:
1. using TestContainer to re-implement all IT codes of IoTDB;
2. using TestContainer to add some IT codes for IoTDB's cluster module.
Needed skills:
- Java
- Docker (Docker-Compose better)
- Know or learn Junit and TestContainer
tree-based structure in memory to manage the schema (a.k.a, metadata), and use a Write-Ahead-Log-like file structure to persist the schema.
Now, each time series will take about 300 Bytes in memory. However, an IoTDB instance may manage more than 100 million time series, which may take more than 30GB memory.
Therefore, we'd like to re-design the schema management module.
1. File: Persist the tree on disk like a b-tree.
2. WAL: implement the WAL of the metadata. So we can update the tree on disk in batch, rather than one operation by one.
3. Cache: we may have no memory to load the whole tree. So a cache is needed and query from the tree on disk is needed.
What knowledge you need to know:
1. Java
2. Basic design idea about Database [2]
[1] https://[1] iotdb.apache.org
[2]httpshttp://wwwpages.cs.testcontainers.org/wisc.edu/~dbbook/openAccess/firstEdition/slides/pdfslides/mod2l1.pdf
Apache IoTDB
C# library: GUI workbench
Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.
IoTDB has two kinds of client interfaces: SQL and native API (also called as session API.)
This task is for the native API.
IoTDB uses Apache Thrift[2] as its RPC framework, so all native API can be generated by Thrift. However, to accelerate the performance, we may use some byte array in Thrift, rather than a Struct, which is not quite friendly to users.
That is why we provide our session API. Session API just wraps the interfaces of the generated thrift codes. Now we have Java[4], Python and c++ version[3]. The C# version is left.
This task hopes you can provide a c# library for IoTDB.
Needed skills:
- Thrift
- C#
- know Java
As a database, it is good to have a workbench to operate IoTDB using a GUI.
For example, there is a 3rd-part web-based workbench for Apache Cassandra [2]. MySQL supports a more complex workbench application [3].
We also want to IoTDB has a workbench.
Task:
1. execute SQL and show results in Table or Chart.
2. view the schema of IoTDB (how many Storage groups, how many time series etc..)
3. View and modify IoTDB's configuration
4. View IoTDB's dynamic status (e.g., info that JMX can get)
(As we have integrated IOTDB with Apache Zeppelin, task 1 has done. So, we hope this workbench can be more lightweight than using Zeppelin.)
Better to use Java. (Python or some others are also ok).
Needed Skills:
- Java
- Web application development
[1] iotdb.apache.org
[2] https://github.com/avalanche123/cassandra-web
[3] https://www.mysql.com/cn/products/workbench/[1] iotdb.apache.org
[2] http://thrift.apache.org/
[3] https://iotdb.apache.org/UserGuide/Master/Client/Programming%20-%20Other%20Languages.html
[4] https://iotdb.apache.org/UserGuide/Master/Client/Programming%20-%20Native%20API.html
Apache IoTDB:
Metadata (Schema) Storage EngineComplex Arithmetic Operations in SELECT Clauses
Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.
Different with traditional relational databases, IoTDB uses tree-based structure in memory to manage the schema (a.k.a, metadata), and use a Write-Ahead-Log-like file structure to persist the schema.
Now, each time series will take about 300 Bytes in memory. However, an IoTDB instance may manage more than 100 million time series, which may take more than 30GB memory.
Therefore, we'd like to re-design the schema management module.
1. File: Persist the tree on disk like a b-tree.
2. WAL: implement the WAL of the metadata. So we can update the tree on disk in batch, rather than one operation by one.
3. Cache: we may have no memory to load the whole tree. So a cache is needed and query from the tree on disk is needed.
What knowledge you need to know:
1. Java
2. Basic design idea about Database [2]
[1] https://iotdb.apache.org
[2] http://pages.cs.wisc.edu/~dbbook/openAccess/firstEdition/slides/pdfslides/mod2l1.pdf
Apache IoTDB: GUI workbench
We have recently been working to improve the ease of use of IoTDB. For queries, we hope that IoTDB can provide more powerful analysis capabilities.
IOTDB supports many types of queries: raw data queries, function queries (including UDF queries), and so on. However, currently there is no easy way to combine the results of multiple queries. Therefore, we hope that IoTDB can support complex arithmetic operations in the SELECT clause, which will greatly improve the analysis capabilities.
Function description:
Applied to: raw time series, literal numbers and function outputs.
Applicable data types: all types except TIMESTAMP and TEXT.
Applicable operators: at least five binary operators ( + , - , * , / , % ) and two unary operator (+ , -).
Usage examples:
- raw queries
SELECT -a FROM root.sg.d;
SELECT a, b, c, b * b - 4 * a * c FROM root.sg.d WHERE b > 0;
SELECT a, b, -(bool_value * (a - b)) FROM root.sg.d;
SELECT -3.14 + a / 15 + 926 FROM root.sg.d;
SELECT +a % 3.14 FROM root.sg.d WHERE a < 0;
- function queries
SELECT a + abs(a), sin(a) * cos(a) FROM root.sg.d;
SELECT a, b, sqrt(a) * sqrt(b) / (a * b) FROM FROM root.sg.d WHERE a < 0;
- nested queries
select a, b, a + b + udf(sin(a) * sin(b), cos(a) * cos(b)) FROM root.sg.d;
select a, a + a, sin(sin(sin(a + a))) FROM root.sg.d WHERE a < 0;
Additional requirements:
1. For performance reasons, it's better to perform as few disk read operations as possible.
Example:
SELECT a, sin(a + a) FROM root.sg.d WHERE a < 0;
The series root.sg.d.a should be read only once during the query.
2. For performance reasons, it's better to reuse intermediate calculation results as much as possible.
Example:
SELECT a + a, sin(a + a) FROM root.sg.d WHERE a < 0;
The intermediate calculation result a + a should only be evaluated once during the query.
3. Need to consider memory-constrained scenarios.
What knowledge you need to know:
1. Java
2. Basic database knowledge (such as SQL, etc.)
3. ANTLR
4. IoTDB query process
Links:
[1] iotdb.apache.org
Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.
As a database, it is good to have a workbench to operate IoTDB using a GUI.
For example, there is a 3rd-part web-based workbench for Apache Cassandra [2]. MySQL supports a more complex workbench application [3].
We also want to IoTDB has a workbench.
Task:
1. execute SQL and show results in Table or Chart.
2. view the schema of IoTDB (how many Storage groups, how many time series etc..)
3. View and modify IoTDB's configuration
4. View IoTDB's dynamic status (e.g., info that JMX can get)
(As we have integrated IOTDB with Apache Zeppelin, task 1 has done. So, we hope this workbench can be more lightweight than using Zeppelin.)
Better to use Java. (Python or some others are also ok).
Needed Skills:
- Java
- Web application development
[1] iotdb.apache.org
[2] https://github.com/avalanche123/cassandra-web
[3] https://www.mysql.com/cn/products/workbench/
Apache IoTDB:
Complex Arithmetic Operations in SELECT Clausesintegration with Chaos Mesh
Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.
We have recently been working to improve the ease of use of IoTDB. For queries, we hope that IoTDB can provide more powerful analysis capabilities.
IOTDB supports many types of queries: raw data queries, function queries (including UDF queries), and so on. However, currently there is no easy way to combine the results of multiple queries. Therefore, we hope that IoTDB can support complex arithmetic operations in the SELECT clause, which will greatly improve the analysis capabilities.
Function description:
Applied to: raw time series, literal numbers and function outputs.
Applicable data types: all types except TIMESTAMP and TEXT.
Applicable operators: at least five binary operators ( + , - , * , / , % ) and two unary operator (+ , -).
Usage examples:
- raw queries
SELECT -a FROM root.sg.d;
SELECT a, b, c, b * b - 4 * a * c FROM root.sg.d WHERE b > 0;
SELECT a, b, -(bool_value * (a - b)) FROM root.sg.d;
SELECT -3.14 + a / 15 + 926 FROM root.sg.d;
SELECT +a % 3.14 FROM root.sg.d WHERE a < 0;
- function queries
SELECT a + abs(a), sin(a) * cos(a) FROM root.sg.d;
SELECT a, b, sqrt(a) * sqrt(b) / (a * b) FROM FROM root.sg.d WHERE a < 0;
- nested queries
select a, b, a + b + udf(sin(a) * sin(b), cos(a) * cos(b)) FROM root.sg.d;
select a, a + a, sin(sin(sin(a + a))) FROM root.sg.d WHERE a < 0;
Additional requirements:
1. For performance reasons, it's better to perform as few disk read operations as possible.
Example:
SELECT a, sin(a + a) FROM root.sg.d WHERE a < 0;
The series root.sg.d.a should be read only once during the query.
2. For performance reasons, it's better to reuse intermediate calculation results as much as possible.
Example:
SELECT a + a, sin(a + a) FROM root.sg.d WHERE a < 0;
The intermediate calculation result a + a should only be evaluated once during the query.
3. Need to consider memory-constrained scenarios.
What knowledge you need to know:
1. Java
2. Basic database knowledge (such as SQL, etc.)
3. ANTLR
4. IoTDB query process
Links:
[1] iotdb.apache.org
Chaos Mesh [2] is a versatile chaos engineering solution that features all-around fault injection methods for complex systems on Kubernetes [3], covering faults in Pod, network, file system, and even the kernel.
We hope that Chaos Mesh can be used as a versatile chaos test tool for the IoTDB cluster module, so as to verify the reliability of the IoTDB cluster module in production environment.
You should define a series of failure simulations for the cluster using Chaos Mesh, such as Network partition, Network packet loss and Node collapse, and then define a series of operations and the expected results of those operations.
This task hopes that you can set up an automated framework for IoTDB cluster module chaos testing, so that we can detect potential problems of cluster module and and iteratively fix them.
Needed skills:
- Java
- Go
- Kubernetes
- Chao mesh
- Know iotdb-benchmark [4]
Apache IoTDB: use netty as the memory buffer pool to reduce GC problem and take full use of memory
Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.
Memory control is very very important for DBMS.
Currently, we are using a customized memory buffer pool, which contains a pool of int[], a pool of long[], a pool of boolean[] and as well as float[] and double[].
However, there are two things left:
- it is complex to implement a buffer pool for String[] or byte[][], as the size of String and byte[] are variable.
- We are using HeapByteBuffer, while in many cases, directByteBuffer is more efficient.
As Netty has provided a high efficient buffer pool, we'd like to try to migrate the current buffer pool to the Netty implementation.
Things you should know:
- Know Java well
- Know Netty well
- read codes of IoTDB (mainly in StorageEngine, StorageGroupProcessor and related classes)
Integrating Apache IoTDB
: integration with Chaos Meshand Apache Superset
Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.
Chaos Mesh Apache Superset [2] is a versatile chaos engineering solution that features all-around fault injection methods for complex systems on Kubernetes [3], covering faults in Pod, network, file system, and even the kernel.
We hope that Chaos Mesh can be used as a versatile chaos test tool for the IoTDB cluster module, so as to verify the reliability of the IoTDB cluster module in production environment.
You should define a series of failure simulations for the cluster using Chaos Mesh, such as Network partition, Network packet loss and Node collapse, and then define a series of operations and the expected results of those operations.
This task hopes that you can set up an automated framework for IoTDB cluster module chaos testing, so that we can detect potential problems of cluster module and and iteratively fix them.
Needed skills:
- Java
- Go
- Kubernetes
- Chao mesh
- Know iotdb-benchmark [4]
fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data, from simple line charts to highly detailed geospatial charts.
We hope that Superset can be used as a data display and analysis tool of IoTDB, which will bring great convenience to analysts of the IoT and IIoT.
For a database engine to be supported in Superset, it requires having a Python compliant SQLAlchemy dialect [3] as well as a DBAPI driver [4] defined. The current Python client of IoTDB is packaged by Apache Thrift generated code and does not follow a certain interface specification. Therefore, the first thing you need to do is to implement a standard SQLAlchemy connector based on the current Python client (or some new interfaces defined and generated by Thrift).
Next, you need to explore how to integrate IoTDB and Superset and document the usage in a user-friendly way. The integration documentation for Apache Kylin and Superset is here [5] for your reference.
What knowledge you need to know:
- Basic database knowledge (SQL)
- Python
[1] https://iotdb.apache.[1] https://iotdb.apache.org
[2] https://chaos-meshsuperset.apache.org/
[3] https://kubernetes.iodocs.sqlalchemy.org/en/13/dialects/
[4] https://github.com/thulab/iotdb-benchmarkwww.python.org/dev/peps/pep-0249/
[5] http://kylin.apache.org/blog/2018/01/01/kylin-and-superset/
TrafficControl
GSOC: Varnish Cache support in Apache Traffic Control
Background
Apache Traffic Control is a Content Delivery Network (CDN) control plane for large scale content distribution.
Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.
There are multiple aspects to this project:
- Configuration Generation: Write software to build Varnish configuration files (VCL). This code will be implemented in our Traffic Ops and cache client side utilities, both written in Go.
- Health Monitoring: Implement monitoring of the Varnish cache health and performance. This code will run both in the Traffic Monitor component and within Varnish. Traffic Monitor is written in Go and Varnish is written in C.
- Testing: Adding automated tests for new code
Skills:
- Proficiency in Go is required
- A basic knowledge of HTTP and caching is preferred, but not required for this project.
Apache IoTDB: use netty as the memory buffer pool to reduce GC problem and take full use of memory
Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.
Memory control is very very important for DBMS.
Currently, we are using a customized memory buffer pool, which contains a pool of int[], a pool of long[], a pool of boolean[] and as well as float[] and double[].
However, there are two things left:
- it is complex to implement a buffer pool for String[] or byte[][], as the size of String and byte[] are variable.
- We are using HeapByteBuffer, while in many cases, directByteBuffer is more efficient.
As Netty has provided a high efficient buffer pool, we'd like to try to migrate the current buffer pool to the Netty implementation.
Things you should know:
- Know Java well
- Know Netty well
- read codes of IoTDB (mainly in StorageEngine, StorageGroupProcessor and related classes)
DolphinScheduler
Apache DolphinScheduler-Parameter coverage
Apache DolphinScheduler
Apache DolphinScheduler is a distributed and extensible workflow scheduler platform with powerful DAG visual interfaces, dedicated to solving complex job dependencies in the data pipeline and providing various types of jobs available out of the box.
Page:https://dolphinscheduler.apache.org
GitHub: https://github.com/apache/incubator-dolphinscheduler
Background:
Configuration parameter override
At present, our parameter configuration is mainly based on configuration files: you can refer to PropertiesUtils,
But usually important parameters will be injected through the form of Java runtime virtual machine parameters, so we need to support this way of parameter injection. At the same time, because different ways of parameter injection have different priorities, we need to achieve configuration coverage. There are two main situations at present, SystemProperties and LocalFile. The priority of SystemProperties should be the highest, followed by LocalFile (that is, our various configuration files, such as master.properties).
issue:
https://github.com/apache/incubator-dolphinscheduler/issues/5164
for example:
1: Configure master.max.cpuload.avg=-1 in master.prperties
2: Java runtime virtual machine parameters -Dmaster.max.cpuload.avg=1
3:PropertiesUtils.get("master.max.cpuload.avg") = 1
Task: realize configuration parameter coverage
Mentor: CalvinKirs kirs@apache.org
Integrating Apache IoTDB and Apache Superset
Apache IoTDB [1] is an Open Source IoT database designed to meet the rigorous data, storage, and analytics requirements of large-scale Internet of Things (IoT) and Industrial Internet of Things (IIoT) applications.
Apache Superset [2] is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data, from simple line charts to highly detailed geospatial charts.
We hope that Superset can be used as a data display and analysis tool of IoTDB, which will bring great convenience to analysts of the IoT and IIoT.
For a database engine to be supported in Superset, it requires having a Python compliant SQLAlchemy dialect [3] as well as a DBAPI driver [4] defined. The current Python client of IoTDB is packaged by Apache Thrift generated code and does not follow a certain interface specification. Therefore, the first thing you need to do is to implement a standard SQLAlchemy connector based on the current Python client (or some new interfaces defined and generated by Thrift).
Next, you need to explore how to integrate IoTDB and Superset and document the usage in a user-friendly way. The integration documentation for Apache Kylin and Superset is here [5] for your reference.
What knowledge you need to know:
- Basic database knowledge (SQL)
- Python
[1] https://iotdb.apache.org
[2] https://superset.apache.org/
[3] https://docs.sqlalchemy.org/en/13/dialects/
[4] https://www.python.org/dev/peps/pep-0249/
[5] http://kylin.apache.org/blog/2018/01/01/kylin-and-superset/
...
CouchDB
GSoC:
Varnish Cache support in Apache Traffic ControlApache CouchDB and Debezium integration
Apache CouchDB software is a document-oriented database that can be queried and indexed in a MapReduce fashion using JavaScript. CouchDB also offers incremental replication with bi-directional conflict detection and resolution.
Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event, even when things go wrong.
CouchDB has a change capture feed as a public HTTP API endpoint. Integrating with Debezium would provide an easy way to translate the _changes feed into a Kafka topic which plugs us into a much larger ecosystem of tools and alleviates the need for every consumer of data in CouchDB to build a bespoke “follower” of the _changes feed.
The project for GSoC 2021 here is to design, implement and test a CouchDB connector for Debezium.
Required skills:
- Java
Nice-to-have skills:
- Erlang
Background
Apache Traffic Control is a Content Delivery Network (CDN) control plane for large scale content distribution.
Traffic Control currently requires Apache Traffic Server as the underlying cache. Help us expand the scope by integrating with the very popular Varnish Cache.
There are multiple aspects to this project:
- Configuration Generation: Write software to build Varnish configuration files (VCL). This code will be implemented in our Traffic Ops and cache client side utilities, both written in Go.
- Health Monitoring: Implement monitoring of the Varnish cache health and performance. This code will run both in the Traffic Monitor component and within Varnish. Traffic Monitor is written in Go and Varnish is written in C.
- Testing: Adding automated tests for new code
Skills:
DolphinScheduler
CloudStack
CloudStack GSoC 2021 - Clone a Virtual Machine (with all the data disks)
Hi there,
Here is the background of the proposed improvement in the CloudStack.
Currently, there is no straight way to clone / create a copy of the VM (with all the data disks) in CloudStack. Operator/Admin requires a series of steps/API cmds to be followed to achieve that in CloudStack, and also it takes considerable time (to wait and check each cmd response before proceeding to next step). Some hypervisors (Eg. VMware) already supports clone VM operation, and CloudStack can leverage that.
The support for this new functionality, can be integrated by introducing a new (admin-only) API to clone the VM, something like cloneVirtualMachine , which facilitates direct way to clone / create a copy of the VM (with all the data disks) can be . CloudStack internally performs all the required operations to create the copy of the VM (leverages the relevant hypervisor(s) operations if necessary), and returns the VM as response when success, otherwise throws the relevant error message.
This improvement will be a good addition to the VM operations supported in the CloudStack. It requires some virtualization/cloud domain knowledge & usage.
More details here:
Apache DolphinScheduler-Parameter coverage
Apache DolphinScheduler
Apache DolphinScheduler is a distributed and extensible workflow scheduler platform with powerful DAG visual interfaces, dedicated to solving complex job dependencies in the data pipeline and providing various types of jobs available out of the box.
Page:https://dolphinscheduler.apache.org
GitHub: https://github.com/apache/incubator-dolphinscheduler
Background:
Configuration parameter override
At present, our parameter configuration is mainly based on configuration files: you can refer to PropertiesUtils,
But usually important parameters will be injected through the form of Java runtime virtual machine parameters, so we need to support this way of parameter injection. At the same time, because different ways of parameter injection have different priorities, we need to achieve configuration coverage. There are two main situations at present, SystemProperties and LocalFile. The priority of SystemProperties should be the highest, followed by LocalFile (that is, our various configuration files, such as master.properties).
issue:
https://github.com/apache/incubator-dolphinschedulercloudstack/issues/5164
for example:
1: Configure master.max.cpuload.avg=-1 in master.prperties
2: Java runtime virtual machine parameters -Dmaster.max.cpuload.avg=1
3:PropertiesUtils.get("master.max.cpuload.avg") = 1
Task: realize configuration parameter coverage
Mentor: CalvinKirs kirs@apache.org
Skills Required:
- Java and Python
- Vue.js (for UI integration)
...
GSoC 2021
- Clone a Virtual Machine (with all the data disks)Ideas
Hello Students! We are the Apache CloudStack project. From our project website: "Apache CloudStack is open source software designed to deploy and manage large networks of virtual machines, as a highly available, highly scalable Infrastructure as a Service (IaaS) cloud computing platform. CloudStack is used by a number of service providers to offer public cloud services, and by many companies to provide an on-premises (private) cloud offering, or as part of a hybrid cloud solution."
2-min video on the Apache CloudStack project - https://www.youtube.com/watch?v=oJ4b8HFmFTc
Here's about an hour-long intro to what is CloudStack - https://www.youtube.com/watch?v=4qFFwyK9hos
The general skills student would need are - Java, Python, JavaScript/Vue. Idea-specific requirements are mentioned on the idea issue. We're a diverse and welcoming community and we encourage interested students to join the dev ML: http://cloudstack.apache.org/mailing-lists.html (dev@cloudstack.apache.org)
All our Apache CloudStack GSoC2021 ideas are tracked on the project's Github issue:
Hi there,
Here is the background of the proposed improvement in the CloudStack.
Currently, there is no straight way to clone / create a copy of the VM (with all the data disks) in CloudStack. Operator/Admin requires a series of steps/API cmds to be followed to achieve that in CloudStack, and also it takes considerable time (to wait and check each cmd response before proceeding to next step). Some hypervisors (Eg. VMware) already supports clone VM operation, and CloudStack can leverage that.
The support for this new functionality, can be integrated by introducing a new (admin-only) API to clone the VM, something like cloneVirtualMachine , which facilitates direct way to clone / create a copy of the VM (with all the data disks) can be . CloudStack internally performs all the required operations to create the copy of the VM (leverages the relevant hypervisor(s) operations if necessary), and returns the VM as response when success, otherwise throws the relevant error message.
This improvement will be a good addition to the VM operations supported in the CloudStack. It requires some virtualization/cloud domain knowledge & usage.
More details here: https://github.com/apache/cloudstack/issues/4818?q=is%3Aissue+is%3Aopen+label%3Agsoc2021
Feature | Skills Required |
---|
- Java and Python
- Vue.js (for UI integration)
Difficulty Level | Potential Mentor(s) | Details and Discussion | ||
---|---|---|---|---|
Support Multiple SSH Keys for VMs | Java, Javascript/Vue | Medium | David Jumani david.jumani@shapeblue.com | https://github.com/apache/cloudstack/issues/4813 |
Clone a Virtual Machine | Java, Javascript/Vue | Medium | Suresh Anaparti sureshanaparti@apache.org | https://github.com/apache/cloudstack/issues/4818 |
UI Shortcuts (UX improvements in the UI) | Javascript, Vue | Easy | Boris Stoyanov boris.stoyanov@shapeblue.com David Jumani david.jumani@shapeblue.com | https://github.com/apache/cloudstack/issues/4798 |
CloudStack OAuth2 Plugin | Java, Javascript/Vue | Medium | Nicolas Vazquez nicovazquez90@gmail.com Rohit Yadav rohit@apache.org | https://github.com/apache/cloudstack/issues/4834 |
Synchronization of network devices on newly added hosts for Persistent Networks | Java | Medium | Pearl Dsilva pearl.dsilva@shapeblue.com | https://github.com/apache/cloudstack/issues/4814 |
Add SPICE console for vms on KVM/XenServer | Java, Python, Javascript | Hard | Wei Zhou ustcweizhou@gmail.com |
CloudStack GSoC 2021 Ideas
Hello Students! We are the Apache CloudStack project. From our project website: "Apache CloudStack is open source software designed to deploy and manage large networks of virtual machines, as a highly available, highly scalable Infrastructure as a Service (IaaS) cloud computing platform. CloudStack is used by a number of service providers to offer public cloud services, and by many companies to provide an on-premises (private) cloud offering, or as part of a hybrid cloud solution."
2-min video on the Apache CloudStack project - https://www.youtube.com/watch?v=oJ4b8HFmFTc
Here's about an hour-long intro to what is CloudStack - https://www.youtube.com/watch?v=4qFFwyK9hos
The general skills student would need are - Java, Python, JavaScript/Vue. Idea-specific requirements are mentioned on the idea issue. We're a diverse and welcoming community and we encourage interested students to join the dev ML: http://cloudstack.apache.org/mailing-lists.html (dev@cloudstack.apache.org)
All our Apache CloudStack GSoC2021 ideas are tracked on the project's Github issue: https://github.com/apache/cloudstack/issues?q=is%3Aissue+is%3Aopen+label%3Agsoc2021
Feature | Skills Required | Difficulty Level | Potential Mentor(s) | Details and Discussion | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Support Multiple SSH Keys for VMs | Java, Javascript/Vue | Medium | David Jumani david.jumani@shapeblue.com | https://github.com/apache/cloudstack/issues/48134803 | |||||||||
Configuration parameters and APIs mappings | Java, Python | Hard | Harikrishna Patnala harikrishna@apache | Clone a Virtual Machine | Java, Javascript/Vue | Medium | Suresh Anaparti sureshanaparti@apache.org | https://github.com/apache/cloudstack/issues/48184825 | |||||
Add virt-v2v support in CloudStack for VM import to KVM | Java, Python, libvirt, libguestfs | Hard | Rohit Yadav rohit@apache.org | UI Shortcuts (UX improvements in the UI) | Javascript, Vue | Easy | Boris Stoyanov boris.stoyanov@shapeblue.com David Jumani david.jumani@shapeblue.com | https://github.com/apache/cloudstack/issues/4798 | CloudStack OAuth2 Plugin | Java, Javascript/Vue | Medium | Nicolas Vazquez nicovazquez90@gmail.com Rohit Yadav rohit@apache.org | 4696 |
We have an onboarding course for students to learn and get started with CloudStack:
https://github.com/
pearl.dsilva@shapeblue.com
Project wiki and other resources:
githubcom/cloudstackissues/4814ustcweizhou@gmail.com
https://github.com/apache/cloudstack
issues/4803harikrishna@apache.org
rohit@apache.org
We have an onboarding course for students to learn and get started with CloudStack:
https://github.com/shapeblue/hackerbook
Project wiki and other resources:
https://cwiki.apache.org/confluence/display/CLOUDSTACK
https://github.com/apache/cloudstack
...
Prevent and fail-fast any attempts to incremental repair cdc/mv tables
Running incremental repairs on CDC or MV tables breaks them.
Attempting to run incremental repair on such should fail-fast and be prevented, with a clear error message.
Per-node overrides for table settings
Add ability to ttl snapshots
It should be possible to add a TTL to snapshots, after which it automatically cleans itself up.
This will be useful together with the auto_snapshot option, where you want to keep an emergency snapshot in case of accidental drop or truncation but automatically remove it after a specified period when it's no longer useful. So in addition to allowing a user to specify a snapshot ttl on nodetool snapshot we should have a auto_snapshot_ttl option that allows a user to set a ttl for automatic snaphots on drop/truncate.
Add nodetool command to display or export the contents of a virtual table
Several virtual tables were recently added, but they're currently only accessible via cqlsh or programmatically. While this is valuable for many use cases, operators are accustomed with the convenience of querying system metrics with a simple nodetool command.
In addition to that, a relatively common request is to provide nodetool output in different formats (JSON, YAML and even XML) (CASSANDRA-5977, CASSANDRA-12035, CASSANDRA-12486, CASSANDRA-12698, CASSANDRA-12503). However this requires lots of manual labor as each nodetool subcommand needs to be adapted to support new output formats.
I propose adding a new nodetool command that will consistently print to the standard output the contents of a virtual table. By default the command will print the output in a human-readable tabular format similar to cqlsh, but a "--format" parameter can be specified to modify the output to some other format like JSON or YAML.
It should be possible to add a limit to the amount of rows displayed and filter to display only rows from a specific keyspace or table. The command should be flexible and provide simple hooks for registration and customization of new virtual tables.
I propose calling this command nodetool show <virtualtable> (naming bikeshedding welcome), for example:
nodetool show --list
caches
clients
There is a few cases where it's convenient to set some table parameters on only one of a few nodes. For instance, it's useful for experimenting with settings like caching options, compaction, compression, read repair chance, gcGrace ... Another case is when you want to completely migrate to a new setting, but want to do that node-per-node (mainly useful when switching compaction strategy, see CASSANDRA-10898).
I'll note that we can already do some of this through JMX for some of the settings as we have methods like ColumnFamilyStoreMBean.setCompactionParameters(), but:
- parameters settings are initially set in CQL. Having to go to JMX for this sounds less consistent to me. The fact we have both a ColumnFamilyStoreMBean.setCompactionParameters() and a ColumnFamilyStoreMBean.setCompactionParametersJson() (as I assume the former one is inconvenient to use) is also proof to me than JMX ain't terribly appropriate.
- I think this can be potentially useful for almost all table settings, but we don't expose JMX methods for all settings, and it would be annoying to have to. The method suggested below wouldn't have to be updated every time we add a new settings (if done right).
- Changing options through JMX is not persistent across restarts. This may arguably be fine in some cases, but if you're trying to migrate your compaction strategy node per node, or want to experiment with a setting over a mediumish time period, it's mostly a pain.
So what I suggest would be add node overrides in the normal table setting (which would be part of the schema as any other setting). In other words, if you want to set LCS for only one specific node, you'd do:
ALTER TABLE foo WITH node_overrides = { '192.168.0.1' : { 'compaction' : { 'class' : 'LeveledCompactionStrategy' } } }internode_inbound
I'll note that I already suggested that idea on CASSANDRA-10898, but as it's more generic than what that latter ticket is about, so creating its own ticket.
Expose application_name and application_version in virtual table system_views.clients
Recent java-driver's com.datastax.oss.driver.api.core.session.SessionBuilder respects properties ApplicationName and ApplicationVersion.
It would be helpful to exposed this information via virtual table system_views.clients and with nodetool clientstats.internode_outbound
settings
sstable_tasks
system_properties
thread_pools
nodetool show clients --format yaml
...
nodetool show internode_outboud --format json
...
nodetool show sstabletasks --keyspace my_ks --table -my_table
...
Script to autogenerate cassandra.yaml
It would be useful to have a script that can ask the user a few questions and generate a recommended cassandra.yaml based on their answers. This will help solve issues like selecting num_tokens. It can also be integrated into OS specific packaging tools such as debconf[1]. Rather than just documenting on the website, it is best to provide a simple script to auto-generate configuration based on common use-cases.
Add ability to disable schema changes, repairs, bootstraps, etc (during upgrades)
There are a lot of operations that aren't supposed to be run in a mixed version cluster: schema changes, repairs, topology changes, etc. However, it's easily possible for these operations to be accidentally run by a script, another user unaware of the upgrade, or an operator that's not aware of these rules.
We should make it easy to follow the rules by making it possible to prevent/disable all of these operations through nodetool commands. At the start of an upgrade, an operator can disable all of these until the upgrade has been completed.
Allow table property defaults (e.g. compaction, compression) to be specified for a cluster/keyspace
During an IRC discussion in cassandra-dev it was proposed that we could have table property defaults stored on a Keyspace or globally within the cluster. For example, this would allow users to specify "All new tables on this cluster should default to LCS with SSTable size of 320MiB" or "all new tables in Keyspace XYZ should have Zstd commpression with a 8 KiB block size" or "default_time_to_live should default to 3 days" etc ... This way operators can choose the default that makes sense for their organization once (e.g. LCS if they are running on fast SSDs), rather than requiring developers creating the Keyspaces/Tables to make the decision on every creation (often without context of which choices are right).
A few implementation options were discussed including:
- A YAML option
- Schema provided at the Keyspace level that would be inherited by any tables automatically
- Schema provided at the Cluster level that would be inherited by any Keyspaces or Tables automatically
In IRC it appears that rough consensus was found in having global -> keyspace -> table defaults which would be stored in schema (no YAML configuration since this isn't node level really, it's a cluster level config).
Global configuration parameter to reject repairs with anti-compaction
We have moved from Cassandra 2.1 to 3.0 and from an operational aspect, the Cassandra repair area changed significantly / got more complex. Beside incremental repairs not working reliably, also full repairs (-full command-line option) are running into anti-compaction code paths, splitting repaired / non-repaired data into separate SSTables, even with full repairs.
Casandra 4.x (with repair enhancements) is quite away for us (for production usage), thus we want to avoid anti-compactions with Cassandra 3.x at any cost. Especially for our on-premise installations at our customer sites, with less control over on how e.g. nodetool is used, we simply want to have a configuration parameter in e.g. cassandra.yaml, which we could use to reject any repair invocations that results in anti-compaction being active.
I know, such a flag still can be flipped then (by the customer), but as a first safety stage possibly sufficient enough to reject anti-compaction repairs, e.g. if someone executes nodetool repair ... the wrong way (accidentally).
Expose application_name and application_version in virtual table system_views.clients
Recent java-driver's com.datastax.oss.driver.api.core.session.SessionBuilder respects properties ApplicationName and ApplicationVersion.
It would be helpful to exposed this information via virtual table system_views.clients and with nodetool clientstats.
Add nodetool command to display or export the contents of a virtual table
Several virtual tables were recently added, but they're currently only accessible via cqlsh or programmatically. While this is valuable for many use cases, operators are accustomed with the convenience of querying system metrics with a simple nodetool command.
In addition to that, a relatively common request is to provide nodetool output in different formats (JSON, YAML and even XML) (CASSANDRA-5977, CASSANDRA-12035, CASSANDRA-12486, CASSANDRA-12698, CASSANDRA-12503). However this requires lots of manual labor as each nodetool subcommand needs to be adapted to support new output formats.
I propose adding a new nodetool command that will consistently print to the standard output the contents of a virtual table. By default the command will print the output in a human-readable tabular format similar to cqlsh, but a "--format" parameter can be specified to modify the output to some other format like JSON or YAML.
It should be possible to add a limit to the amount of rows displayed and filter to display only rows from a specific keyspace or table. The command should be flexible and provide simple hooks for registration and customization of new virtual tables.
I propose calling this command nodetool show <virtualtable> (naming bikeshedding welcome), for example:
nodetool show --list caches clients internode_inbound internode_outbound settings sstable_tasks system_properties thread_pools nodetool show clients --format yaml ... nodetool show internode_outboud --format json ... nodetool show sstabletasks --keyspace my_ks --table -my_table ...Script to autogenerate cassandra.yaml
It would be useful to have a script that can ask the user a few questions and generate a recommended cassandra.yaml based on their answers. This will help solve issues like selecting num_tokens. It can also be integrated into OS specific packaging tools such as debconf[1]. Rather than just documenting on the website, it is best to provide a simple script to auto-generate configuration based on common use-cases.
Per-node overrides for table settings
There is a few cases where it's convenient to set some table parameters on only one of a few nodes. For instance, it's useful for experimenting with settings like caching options, compaction, compression, read repair chance, gcGrace ... Another case is when you want to completely migrate to a new setting, but want to do that node-per-node (mainly useful when switching compaction strategy, see CASSANDRA-10898).
I'll note that we can already do some of this through JMX for some of the settings as we have methods like ColumnFamilyStoreMBean.setCompactionParameters(), but:
- parameters settings are initially set in CQL. Having to go to JMX for this sounds less consistent to me. The fact we have both a ColumnFamilyStoreMBean.setCompactionParameters() and a ColumnFamilyStoreMBean.setCompactionParametersJson() (as I assume the former one is inconvenient to use) is also proof to me than JMX ain't terribly appropriate.
- I think this can be potentially useful for almost all table settings, but we don't expose JMX methods for all settings, and it would be annoying to have to. The method suggested below wouldn't have to be updated every time we add a new settings (if done right).
- Changing options through JMX is not persistent across restarts. This may arguably be fine in some cases, but if you're trying to migrate your compaction strategy node per node, or want to experiment with a setting over a mediumish time period, it's mostly a pain.
So what I suggest would be add node overrides in the normal table setting (which would be part of the schema as any other setting). In other words, if you want to set LCS for only one specific node, you'd do:
ALTER TABLE foo WITH node_overrides = { '192.168.0.1' : { 'compaction' : { 'class' : 'LeveledCompactionStrategy' } }
}
I'll note that I already suggested that idea on CASSANDRA-10898, but as it's more generic than what that latter ticket is about, so creating its own ticket.
Global configuration parameter to reject repairs with anti-compaction
We have moved from Cassandra 2.1 to 3.0 and from an operational aspect, the Cassandra repair area changed significantly / got more complex. Beside incremental repairs not working reliably, also full repairs (-full command-line option) are running into anti-compaction code paths, splitting repaired / non-repaired data into separate SSTables, even with full repairs.
Casandra 4.x (with repair enhancements) is quite away for us (for production usage), thus we want to avoid anti-compactions with Cassandra 3.x at any cost. Especially for our on-premise installations at our customer sites, with less control over on how e.g. nodetool is used, we simply want to have a configuration parameter in e.g. cassandra.yaml, which we could use to reject any repair invocations that results in anti-compaction being active.
I know, such a flag still can be flipped then (by the customer), but as a first safety stage possibly sufficient enough to reject anti-compaction repairs, e.g. if someone executes nodetool repair ... the wrong way (accidentally).
Add ability to disable schema changes, repairs, bootstraps, etc (during upgrades)
There are a lot of operations that aren't supposed to be run in a mixed version cluster: schema changes, repairs, topology changes, etc. However, it's easily possible for these operations to be accidentally run by a script, another user unaware of the upgrade, or an operator that's not aware of these rules.
We should make it easy to follow the rules by making it possible to prevent/disable all of these operations through nodetool commands. At the start of an upgrade, an operator can disable all of these until the upgrade has been completed
Add ability to ttl snapshots
It should be possible to add a TTL to snapshots, after which it automatically cleans itself up.
This will be useful together with the auto_snapshot option, where you want to keep an emergency snapshot in case of accidental drop or truncation but automatically remove it after a specified period when it's no longer useful. So in addition to allowing a user to specify a snapshot ttl on nodetool snapshot we should have a auto_snapshot_ttl option that allows a user to set a ttl for automatic snaphots on drop/truncate.
...