Child pages
  • GSoC 2021 Ideas list
Skip to end of metadata
Go to start of metadata

Apache Hudi

[UMBRELLA] Checkstyle, formatting, warnings, spotless

Umbrella ticket to track all tickets related to checkstyle, spotless, warnings etc.

Difficulty: Major
Potential mentors:
sivabalan narayanan, mail: shivnarayan (at) apache.org
Project Devs, mail: dev (at) hudi.apache.org

[UMBRELLA] Improve CLI features and usabilities

(More details to be added)

Difficulty: Major
Potential mentors:
Raymond Xu, mail: xushiyan (at) apache.org
Project Devs, mail: dev (at) hudi.apache.org

[UMBRELLA] Support Apache Calcite for writing/querying Hudi datasets

(More details to be added)

Difficulty: Major
Potential mentors:
Raymond Xu, mail: xushiyan (at) apache.org
Project Devs, mail: dev (at) hudi.apache.org

[UMBRELLA] Improve source ingestion support in DeltaStreamer

(More details to be added)

Difficulty: Major
Potential mentors:
Raymond Xu, mail: rxu (at) apache.org
Project Devs, mail: dev (at) hudi.apache.org

[UMBRELLA] Survey indexing technique for better query performance

(More details to be added)

Difficulty: Major
Potential mentors:
Raymond Xu, mail: xushiyan (at) apache.org
Project Devs, mail: dev (at) hudi.apache.org

[UMBRELLA] Support schema inference for unstructured data

(More details to be added)

Difficulty: Major
Potential mentors:
Raymond Xu, mail: xushiyan (at) apache.org
Project Devs, mail: dev (at) hudi.apache.org

[UMBRELLA] Support Apache Beam for incremental tailing

(More details to be added)

Difficulty: Major
Potential mentors:
Vinoth Chandar, mail: vinoth (at) apache.org
Project Devs, mail: dev (at) hudi.apache.org

Beam

A Complex Event Processing (CEP) library/extension for Apache Beam

Apache Beam [1] is a unified and portable programming model for data processing jobs. The Beam model [2, 3, 4] has rich mechanisms to process endless streams of events.

Complex Event Processing [5] lets you match patterns of events in streams to detect important patterns in data and react to them.

Some examples of uses of CEP are fraud detection for example by detecting unusual behavior (patterns of activity), e.g. network intrusion, suspicious banking transactions, etc. Also trend detection is another interesting use case in the context of sensors and IoT.

The goal of this issue is to implement an efficient pattern matching library inspired by [6] and existing libraries like Apache Flink CEP [7] using the Apache Beam Java SDK and the Beam style guides [8]. Because of the time constraints of GSoC we will probably try to cover first simple patterns of the ‘a followed by b followed by c’ kind, and then if there is still time try to cover more advanced ones e.g. optional, atLeastOne, oneOrMore, etc.

[1] https://beam.apache.org/
[2] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
[4] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43864.pdf
[5] https://en.wikipedia.org/wiki/Complex_event_processing
[6] https://people.cs.umass.edu/~yanlei/publications/sase-sigmod08.pdf
[7] https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/cep.html
[8] https://beam.apache.org/contribute/ptransform-style-guide/


Difficulty: P3
Potential mentors:
Ismaël Mejía, mail: iemejia (at) apache.org
Project Devs, mail: dev (at) beam.apache.org

SkyWalking

Apache SkyWalking: Python agent supports profiling

Apache SkyWalking [1] is an application performance monitor (APM) tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, K8s, Mesos) architectures.

SkyWalking is based on agent to instrument (automatically) monitored services, for now, we have many agents for different languages, Python agent [2] is one of them, which supports automatic instrumentations.

The goal of this project is to extend the agent's features by supporting profiling [3] a function's invocation stack, help the users to analyze which method costs the most major time in a cross-services call.

To complete this task, you must be comfortable with Python, have some knowledge of tracing system, otherwise you'll have a hard time coming up to speed..

[1] http://skywalking.apache.org
[2] http://github.com/apache/skywalking-python
[3] https://thenewstack.io/apache-skywalking-use-profiling-to-fix-the-blind-spot-of-distributed-tracing/

Difficulty: Major
Potential mentors:
Zhenxu Ke, mail: kezhenxu94 (at) apache.org
Project Devs, mail: dev (at) skywalking.apache.org

ShardingSphere

Apache ShardingSphere: Proofread the SQL definitions for ShardingSphere Parser

Apache ShardingSphere

Apache ShardingSphere is a distributed database middleware ecosystem, including 2 independent products, ShardingSphere JDBC and ShardingSphere Proxy presently. They all provide functions of data sharding, distributed transaction, and database orchestration.
Page: https://shardingsphere.apache.org
Github: https://github.com/apache/shardingsphere

Background

ShardingSphere parser engine helps users parse a SQL to get the AST (Abstract Syntax Tree) and visit this tree to get SQLStatement (Java Object). At present, this parser engine can handle SQLs for `MySQL`, `PostgreSQL`, `SQLServer` and `Oracle`, which means we have to understand different database dialect SQLs.
More details: https://shardingsphere.apache.org/document/current/en/features/sharding/principle/parse/

Task

This issue is to proofread the DML(SELECT/UPDATE/DELETE/INSERT) SQL definitions for Oracle. As we have a basic Oracle SQL syntax definitions but do not keep in line with Oracle DOC, we need you to find out the vague SQL grammar definitions and correct them referring to Oracle DOC.

Notice, when you review these DML(SELECT/UPDATE/DELETE/INSERT) SQLs, you will find that these definitions will involve some basic elements of Oracle SQL. No doubt, these elements are included in this task as well.

Relevant Skills

1. Master JAVA language
2. Have a basic understanding of Antlr g4 file
3. Be familiar with Oracle SQLs

Targets files

1. DML SQLs g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/DMLStatement.g4
2. Basic elements g4 file: https://github.com/apache/shardingsphere/blob/master/shardingsphere-sql-parser/shardingsphere-sql-parser-dialect/shardingsphere-sql-parser-oracle/src/main/antlr4/imports/oracle/BaseRule.g4

References

1. Oracle SQL quick reference: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlqr/SQL-Statements.html#GUID-1FA35EAD-AED2-4619-BFEE-348FF05D1F4A
2. Detailed Oracle SQL info: https://docs.oracle.com/pls/topic/lookup?ctx=en/database/oracle/oracle-database/19/sqlqr&id=SQLRF008

Mentor

Juan Pan, PMC of Apache ShardingSphere, panjuan@apache.org

Difficulty: Major
Potential mentors:
Juan Pan, mail: panjuan (at) apache.org
Project Devs, mail: dev (at) shardingsphere.apache.org
  • No labels