Abstract
GraphAr is an open-source and language-independent data file format designed for efficient graph data storage and retrieval.
Proposal
GraphAr provides a standard data file format for graph data storage and exchange with the following features:
- Efficient format design:
- Chunk-based: GraphAr utilizes a chunk-based design, partitioning the graph data into chunks, with each stored in a separate file. This design allows for easy, parallelized access and distribution of the graph data across different machines.
- Columnar Storage: GraphAr leverages high-performance columnar storage formats, including Parquet and ORC, for organizing graph data files, which allows for optimized access patterns.
- Metadata Management: GraphAr maintains graph metadata using a set of YAML files, simplifying the process of understanding and utilizing the schema information.
- Maintain CSR/CSC Semantics: By sorting edges by source vertex ID and destination vertex ID, GraphAr maintains the CSR/CSC semantics of graph data. This eliminates the need to reconstruct the CSR/CSC structure when loading the graph data.
- Out-of-core queries: GraphAr is designed for out-of-core scenarios, enabling the storage and querying of large-scale graphs outside of memory, such as in data lakes.
- Cross-language support: GraphAr provides libraries in C++, Java, Scala with Spark, and Python with PySpark for generating, accessing, and transforming files in GraphAr format.
GraphAr currently supports cross-language operations, including C++, Java, Scala and Python, offering performance up to 6 times faster than CSV in loading graph data. It also facilitates the exchange of graph data between different graph systems, such as Nebula Graph, HugeGraph, and Neo4j.
Background
The proliferation of diverse graph processing systems has led to a multitude of unique graph data storage layouts, which complicate the exchange of graph data between different systems.
During the development of GraphScope, we encountered cases where multiple systems need to collaborate—for example, exporting data from a graph database and then importing it into GraphScope for analysis. Technical colleagues from other graph-related systems, such as Apache HugeGraph and Fabarta, have also expressed this pain point.
To address this gap, we decided to develop GraphAr as the standard file format for importing/exporting and persistently storing graph data that can be used by diverse existing systems.
Initially, the project was developed by the Alibaba GraphScope team and open-sourced on GitHub in December 2022. A number of systems at Alibaba, including GraphScope and vineyard, have adopted GraphAr as their standard graph data storage format.
Rationale
Numerous graph systems, such as Neo4j, Nebula Graph, and Apache HugeGraph, have been developed in recent years. Each of these systems has its own graph data storage format, complicating the exchange of graph data between different systems. The need for a standard data file format for large-scale graph data storage and processing that can be used by diverse existing systems is evident, as it would reduce overhead when various systems work together.
Our aim is to fill this gap and contribute to the open-source community by providing a standard data file format for graph data storage and exchange, as well as for out-of-core querying. This format, which we have named GraphAr, is engineered to be efficient, cross-language compatible, and to support out-of-core processing scenarios, such as those commonly found in data lakes. Furthermore, GraphAr's flexible design ensures that it can be easily extended to accommodate a broader array of graph data storage and exchange use cases in the future.
Initial Goals
- Build a more diverse community with contributors from different organizations.
- Facilitate the adoption and integration of GraphAr by ensuring its neutrality and durability.
- Collect feedback from the community to standardize the format and improve its efficiency.
Current Status
Meritocracy
This proposal intends to build a community around GraphAr following the ASF meritocracy model. Users and new contributors will be respected and welcomed. They will earn credit by participating in the community and providing quality contributions to move the project forward, the contributions not only include code contributions but also non-code contributions (documentation, discussion, testing, events, community development, etc). Those who make long-term and high-quality contributions will be encouraged to become committers.
Community
Currently, GraphAr is being developed by both the development team inside Alibaba Group and the individual developers from other companies, forming an initial community. By incorporating GraphAr into the Apache ecosystem, we anticipate further expansion of the community, benefiting from increased collaboration and adoption of the Apache Way.
Users
GraphAr currently serves a group of users, including Alibaba, Fabarta, and TuGraph. Here are some use cases of GraphAr:
- Alibaba GraphScope uses GraphAr as its graph data archive format. Details can be found in the GraphScope blog.
- In-memory data manager vineyard uses GraphAr as graph object archive format.
- Fabarta uses GraphAr as their graph data archive and source to load.
- TuGraph is considering using GraphAr as a data format to exchange graph data with neo4j, and they have a working pull request to add GraphAr support.
The Apache HugeGraph community also shows interest in GraphAr and is considering integrating it.
Developer
As for developers, although GraphAr has attracted 13 contributors since it was open-sourced, we have five core developers: Weibin Zeng, Xue Li, Zhe Wang, Semyon Sinchenko and Tao He. Weibin Zeng and Xue Li are the founders of GraphAr project, Zhe Wang is the author and maintainer of the GraphAr Java module, Sem is the author and maintainer of the GraphAr PySpark module, and Tao He is a core developer of GraphAr with a lot of contributions to integrating GraphAr with vineyard.
The community's size and diversity are indeed areas of concern. However, we expect to attract more contributors in the future to address these issues by evolving the software and adhering to the Apache Way.
The need for a standard graph data format is common in the graph computing area, and it provides the potential to form a bigger community.
Core Developers
- Weibin Zeng (GitHub ID: acezen): Co-founder and core developer of the project, GraphScope and vineyard (a CNCF project) committer, from Alibaba.
- Xue Li: (GitHub ID: lixueclaire): Co-founder and core developer of the project, GraphScope committer, from Alibaba.
- Zhe Wang: (GitHub ID: Thespica): Author of the GraphAr Java module, an open-source enthusiast from Southwest Minzu University.
- Semyon Sinchenko: (Github ID: SemyonSinchenko): Author of the GraphAr-PySpark module, an individual open-source enthusiast from Raiffeisen Bank International.
- Tao He: (Github ID: sighingnow): Core developer of the project, vineyard project maintainer, an open-source enthusiast, from Alibaba.
Known Risks
Project Name
GraphAr is the short for “Graph Archive”. Based on our search results, the term GraphAr is not used as a trademark under any class, so it is legal to use it as our project name.
Orphaned Products
GraphAr is used as a graph data archive format in Alibaba's graph system GraphScope and vineyard. The developers of these systems will continue to improve GraphAr to meet current and future requirements. Other organizations, such as Fabarta, also use GraphAr in their core products. Furthermore, TuGraph and Apache HugeGraph have shown interest in GraphAr and are considering using it as their graph data archive format. Given the extensive need in the graph community for a standardized file format, we believe the developer and user communities will continue to grow, mitigating the risk of GraphAr becoming an orphaned product.
Inexperience with Open Source
The creators of GraphAr have been working on open-source projects for many years. They have been dedicated to open-source projects, such as GraphScope and vineyard.
Homogenous Developers
Currently, GraphAr has five core developers from three different organizations. Meantime, we are working to further diversify our developer base. We have already received interest from other organizations, such as Fabarta, which are using GraphAr in their core products. We believe that the developer and user communities will continue to grow.
Reliance on Salaried Developers
Most of the developers are paid by their employers to contribute to this project. GraphAr is being used in Alibaba, with no internal forked versions. After the donation, Alibaba will continue to ensure the long-term commitment that developers of GraphScope will continue to improve GraphAr to meet current and future requirements. And we believe that once GraphAr enters the Apache incubator, we can attract more maintainers and developers from diverse backgrounds, to achieve a better development of the project with the Apache Way.
Relationships with Other Apache Products
GraphAr relies on Apache Parquet, Apache ORC, and Apache Arrow for data storage and exchange, and Apache Spark provides a connector to graph databases like Nebula Graph, Apache HugeGraph, and Neo4j.
GraphAr can also be integrated with several Apache projects for graph data storage and exchange, including:
- Apache HugeGraph (incubating): A convenient, efficient, and adaptable graph system (include graph database & graph computing & toolchains).
- Apache AGE: A PostgreSQL extension that provides graph database functionality.
- Apache TinkerPop: A graph computing framework for both graph databases (OLTP) and graph analytic systems (OLAP).
We believe that such integration could also be applied to and benefit the open-source community, and we have a plan to discuss with the community to make it happen.
An Excessive Fascination with the Apache Brand
We believe that the Apache Way and its neutrality, not just the brand, will help GraphAr grow. The need for a standard graph data format is common in the graph community and is relevant to many other graph system projects, not just those in Alibaba. A neutral organization like Apache will ultimately better serve the community than a single company.
Documentation
GraphAr documentation is provided here.
Initial Source
GraphAr has been under development since June 2022 by a team of engineers at Alibaba called GraphScope. It was open-sourced on GitHub in December 2022, with the project available at https://github.com/alibaba/GraphAr under the name GraphAr. The project is licensed under Apache License 2.0.
Source and Intellectual Property Submission Plan
As soon as GraphAr is approved to join Apache Incubator, our initial committers will submit iCLA(s), SGA, and CCLA(s). The codebase is already licensed under Apache License 2.0.
We will also deprecate the initial source repository and redirect it to the new incubator project repository after approval.
External Dependencies
GraphAr has several external dependencies with various licenses, including Apache 2.0, BSD, BSL-1.0, and MIT.
- Apache 2.0
- https://github.com/apache/arrow
- OpenSSL
- https://github.com/alibaba/fastFFI
- https://github.com/google/benchmark
- com.aliyun.odps:hadoop-fs-oss:jar:3.3.8-public
- com.aliyun.odps:odps-spark-datasource_2.11:jar:3.3.8-public
- com.aliyun.odps:cupid-sdk:jar:3.3.8-public
- com.vesoft:nebula-spark-connector_3.0:jar:3.6.0
- org.neo4j:neo4j-connector-apache-spark_2.12:jar:5.0.0_for_spark_3
- org.scala-lang.modules:scala-collection-compat_2.12:jar:2.1.1
- org.yaml:snakeyaml:jar:1.26
- org.apache.spark:spark-core_2.12:jar:3.2.2
- org.apache.spark:spark-mllib_2.12:jar:3.2.2
- org.apache.spark:spark-sql_2.12:jar:3.2.2
- org.apache.spark:spark-streaming_2.12:jar:3.2.2
- org.scala-lang:scala-library:jar:2.12.10
- org.scalatest:scalatest_2.12:jar:3.1.1
- com.alibaba.fastffi:annotation-processor:jar:0.1.2
- com.alibaba.fastffi:ffi:jar:0.1.2
- com.alibaba.fastffi:llvm:jar:0.1.2
- com.alibaba.fastffi:llvm4jni-runtime:jar:0.1.2
- org.apache.arrow:arrow-c-data:jar:13.0.0
- org.apache.arrow:arrow-dataset:jar:13.0.0
- org.apache.arrow:arrow-memory-netty:jar:13.0.0
- org.apache.arrow:arrow-vector:jar:13.0.0
- https://github.com/google/styleguide/blob/gh-pages/cpplint/cpplint.py
- pyspark
- BSD
- junit:junit:jar:4.13.2
- Python
- py4j
- BSL-1.0
- Catch2
- Boost
- MIT
- https://github.com/bitwizeshift/result
- https://github.com/jimmiebergmann/mini-yaml
- org.slf4j:slf4j-simple:jar:1.7.25
- pytest
- pytest-cov
- pyyaml
- poetry
- Ruff
- furo
Cryptography
N/A
Required Resources
Mailing lists
Subversion Directory
N/A
Git Repositories
Upon entering incubation, we want to transfer the existing repo to the Apache Software Foundation:
- https://github.com/alibaba/GraphAr to https://github.com/apache/incubator-graphar
- https://github.com/GraphScope/gar-test to https://github.com/apache/incubator-graphar-testing
Issue Tracking
The community would like to continue using GitHub Issues.
Other Resources
The community has chosen GitHub actions as its continuous integration tools.
Initial Committers
- Weibin Zeng (Alibaba, weibin.zen@gmail.com)
- Xue Li (Alibaba, lx_claire@qq.com)
- Zhe Wang (Southwest Minzu University, thespica@qq.com)
- Semyon Sinchenko (Raiffeisen Bank International, ssinchenko@protonmail.com)
- He Tao (Alibaba, sighingnow@gmail.com)
Sponsors
Champion
- Yu Li (liyu@apache.org)
Nominated Mentors
- Calvin Kirs (kirs@apache.org)
- tison (tison@apache.org)
- Xiaoqiao He (hexiaoqiao@apache.org)
- Yu Li (liyu@apache.org)
Sponsoring Entity
We are requesting the Incubator to sponsor this project.