Apache Hadoop Ozone - sub-project to Apache TLP proposal

Abstract

Apache Hadoop Ozone is a filesystem for on-prem and Cloud-native environments optimized for the Hadoop ecosystem.

It provides Object Storage semantics (like Amazon S3) and scales up to billions of objects.

It provides S3, Hadoop Compatible File System, and Container Storage Interface (CSI) interfaces.

The Hadoop community believes that further development on Ozone can be better done as a separate TLP as discussed on the Hadoop email lists.

History

Apache Hadoop Ozone development started on a feature branch of Hadoop repository (HDFS-7240). In October of 2017 a discussion was started to merge it to the Hadoop main branch. After a long discussion, it was merged to Hadoop trunk in March 2018. During the merge discussion, it was suggested multiple times to create a separate project for the Ozone. However, at that time:

  1. Ozone was tightly integrated with Hadoop/HDFS.
  2. There was an action plan to use Block layer of Ozone (HDDS or HDSL at that time) as the block level of HDFS.
  3. The community of Ozone was a subset of the HDFS community.
  4. The first beta release of Ozone was just released and it was agreed that the next release will be announced as GA release. It seems to be a good time before the first GA to make a decision about the future of the project.

(Note: if you need more information about the history of the project, you can check the detailed version from the source repository)

Move to a separated, Apache TLP

During the last two years, Ozone has become more and more independent wrt both community and code. The separation has been suggested againagain and again (for example by Owen (2) and Vinod (3)).

From the COMMUNITY point of view:

  1. Fortunately more and more new contributors are helping Ozone. Originally the Ozone community was a subset of HDFS project. But now a bigger and bigger part of the community is related to Ozone only.
  2. It seems to be easier to build the community as a separated project.
  3. A new, younger project might have different practices (communication, committer criteria, development style) compared to an more mature project.
  4. It's easier to communicate (and improve) these standards in a separated project with clean boundaries.
  5. Separated project/brand can help to increase the adoption rate and attract more individual contributor (AFAIK it has been seen in Apache Submarine after a similar move).
  6. Contribution process can be communicated more easily, we can make the first-time contribution easier
  7. The Apache Board is monitoring community activities. As the volume of Ozone contributions are slightly more than the 50% of all the Hadoop contributions, it seems to be reasonable to follow community development and activity on Ozone level (today it's hard to differentiate between the Ozone part and core Hadoop part)

From CODE point of view Ozone has become more and more independent:

  1. Ozone has a different release cycle.
  2. Code is already separated from the Hadoop codebase (apache/hadoop-ozone.git).
  3. It has separated CI (GitHub actions).
  4. Ozone uses different (more strict) coding style (zero toleration of unit test / checkstyle errors).
  5. The code itself became more and more independent of Hadoop on Maven level. Originally it was compiled together with the in-tree latest Hadoop snapshot. Now it depends on released Hadoop artifacts (RPC, Security, Configuration).
  6. It can use multiple versions of Hadoop clients.
  7. The volume of resolved issues is already very high on the Ozone side (Ozone has slightly more resolved issues than HDFS/YARN/MAPREDUCE/COMMON combined in the last few months)

Current Status of the project

Ozone is already part of a Top Level Apache Project and has already created multiple separated releases that are approved by the Hadoop PMC and operating according to the Apache guidelines without any reported issues. As such a project it's already passed the Apache project maturity model. Therefore, in this section, we ignore some obvious statements which are a usual part of the project adoptions (Ozone source code is already part of the Apache, and it's already governed by Apache PMC) and focus on what project did so far for building a stronger community.

  1. Ozone is a new (sub) project and it's a top priority to make it more inclusive and contributor friendly.
  2. Ozone Community Calls are weekly calls between Ozone developers and users. This is the same call where the active developers sync with each other and open for anybody.
  3. The meeting minutes are posted either to the mailing list (or added to the wiki and references to the mailing list). To have an open and async conversation about all the related topics.
  4. Recently a new survey is initialized to make this call more inclusive. (The current time of this call is not China friendly, and we tried to identify the best way to include all the contributors from all the timezones)
  5. Based on the feedback we started to record the meetings and an additional APAC friendly sync is initiated.
  6. The pull request queue is monitored frequently and it's the number one priority to keep the number of open pull requests low. All the pull requests are looked into within a reasonable time.
  7. The only possible way to close/abandon a Pull request is when the author is not responsive. Pull requests are not closed due to inactivity if a committer action is required.
  8. The documentation and the developer process tries to make the development process easy and developer-friendly. A "newbie" label is added to issues in Jira that can be taken up by first-time contributors.
  9. The community contributions are monitored. And new contributors are proposed based on pure merits.
  10. Ozone is a distributed storage project and as is, the initial contribution can be hard. To make it easier to understand the main architecture decisions a new video series has been started for developers and/or users.
  11. To make it easier to understand the earlier decisions, the design docs are added to the documentation page (from the next release https://github.com/apache/hadoop-ozone/tree/master/hadoop-hdds/docs/content/design). Ozone doesn't require a very formal proposal process (like Flink or Kafka), yet, but it's a continuous effort to make all the design discussions open and transparent.

Building a community is a continuous effort. We are at the beginning of a journey and moving Ozone to a separated TLP is a very important step. Some of the current challenges:

  1. The ratio of the paid developers: Today most of the full-time developers are paid by two companies: a vendor (Cloudera) and a user (Tencent). Based on the analyses of the Github contributions we do see an increasing number of new contributors from other companies and see increasing interest from other companies.
  2. One of the main goals for coming months is to make the community more diverse

PMC/Committers

Ozone as a new project requires an initial PMC and committer list. But first, we need to define how the lists are created/selected:

  1. PMC: As Ozone is a Hadoop subproject today, all the existing Hadoop PMCs with noticeable Ozone contributions are added to the initial list. (Definition of noticeable contribution: all the related GitHub / Jira content is downloaded, and we selected all the Hadoop PMCs with at least 30 comments AND/OR commits since the beginning of 2019.
  2. A discussion is started with these people about what are the important factors of being a committer / PMC.
  3. Who should be added to the initial list?
  4. Committer: similar to submarine Hadoop committers can get opt-in committer membership to the Ozone project (except PMC veto)

Some points which are named as an important factor of being PMC:

  1. Involvement in releases (being RM, validating and voting on releases, roadmap for future releases)
  2. Being involved constructively in design discussions, keeping the big picture, and project direction in mind.
  3. Investing in build/CI quality. Ensuring that contributors and committers have a solid infrastructure to develop the project.
  4. Responsiveness on security, trademark, copyright issues.
  5. Positive involvement in the community (mailing lists, raising committer candidates).
  6. Keeping an eye on what needs to go better in the project (documentation, test quality, wiki pages). A meta-view beyond regular contributions and releases.
  7. It's also found especially important to include the user community to the project governance. End-users and adopters – who are actively helping with the projects with feedback during the design discussions – should be invited to the PMC (even without code contribution). (During the discussion they are called as "user-seats" in PMC)

The initial selection rules and PMC list is shared on the ozone-dev mailing list (people who are nominated in 2b are added explanatio) where additional methods are suggested (add everybody to the PMC who are Hadoop PMC and contributed at least 10 patches in this year) and accepted.

Proposed Chair:

  1. Sammi Chen (sammichen) (Hadoop PMC)

Proposed PMC (Hadoop PMC)

  1. Arpit Agarwal (arp) (member, Hadoop PMC)
  2. Shashikant Banerjee (shashikant) (Hadoop PMC)
  3. Li Cheng (licheng) (Hadoop committer)
  4. Dinesh Chitlangia (dineshc) (Hadoop committer)
  5. Clay Baenziger
  6. Attila Doroszlai (adoroszlai) (Hadoop committer)
  7. Junping Du (junping_du) (member, Hadoop PMC)
  8. Márton Elek (elek) (Hadoop PMC)
  9. Anu Engineer (aengineer) (Hadoop PMC)
  10. Uma Maheswara Rao G (umamahesh) (member, Hadoop PMC)
  11. Lokesh Jain (ljain) (Hadoop PMC)
  12. Hanisha Koneru (hanishakoneru) (Hadoop PMC)
  13. Yiqun Lin (yqlin) (Hadoop PMC)
  14. Siyao Meng (siyao) (Hadoop committer)
  15. Jitendra Nath Pandey (jitendra) (member, Hadoop PMC)
  16. Rakesh Radhakrishnan (rakeshr) (Hadoop PMC)
  17. Matt Sharp
  18. Mukul Kumar Singh (msingh) (Hadoop PMC)
  19. Tsz-Wo Nicholas Sze (szetszwo) (member, Hadoop PMC)
  20. Xiaoyu Yao (xyao) (Hadoop PMC)
  21. Nandakumar Vadivelu (nanda) (Hadoop PMC)
  22. Bharat Viswanadham (bharat) (Hadoop PMC)
  23. Siddharth Wagle (swagle) (Hadoop committer)
  24. Stephen O'Donnell (sodonnell) (Hadoop committer)
  25. Vivek Ratnavel Subramanian (vivekratnavel) (Hadoop committer)
  26. Aravindan Vijayan (avijayan) (Hadoop committer)

Proposed committer list

  1. Wei-Chiu Chuang (weichiu) (Hadoop PMC)
  2. István Fajth
  3. Nilotpal Nandi (nilotpalnandi) (Hadoop committer)
  4. Yisheng Lien (yisheng) (Hadoop committer)
  5. Baoloong Mao (github.com/maobaolong)
  6. Neo Yang (github.com/cku328)
  7. WeiWei Yang (wwei) (Hadoop committer)
  8. Jie Wang (runzhiwang)
  9. Xiang Zhang (github.com/iamabug)
  10. Micah Zhao (github.com/captainzmc)
  11. Masatake Iwasaki (iwasakims) (Hadoop committer)
  12. Prabhu Joseph (prabhujoseph) (Hadoop committer)
  13. Ayush Saxena (ayushsaxena) (Hadoop PMC)
  14. He Xiaoqiao (hexiaoqiao) (Hadoop committer)
  15. Surendra Singh Lilhore (surendralilhore) (Hadoop PMC)
  16. Vinayakumar B (vinayakumarb) (Hadoop PMC)
  17. Bibin A Chundatt (bibinchundatt) (Hadoop PMC)
  18. Hemanth Boyina (hemanthboyina) (Hadoop Committer)
  19. Hui Fei
  20. Lisheng Sun

Affiliations

Induviduals of the initial PMC are employed by Cloudera, Tencent, Bloomberg, Ebay, Target, SirionLabs. Cloudera and Tencent are known for employing significant number of full-time Ozone developers.

The committer list also contains induvidals employed by Microsoft, Huawei and others (including induvidal contributors.)

Required Resources

  1. Mailing lists: ozone-dev mailing list already exists it can be moved to a separated TLP domain
  2. Source code: Source code is already separated from the main Hadoop repository (apache/hadoop-ozone and apache/hadoop-docker-ozone). It can be moved easily to separated project
  3. Issue tracker: Ozone already uses separated Jira subproject (HDDS)
  4. Github repositories
  5. Rename apache/hadoop-ozone to apache/ozone
  6. Rename/move apache/hadoop-docker-ozone to apache/ozone-docker-release



  • No labels