(done) Hadoop Ozone source tree split

This page doesn't have any information. It shared the plan to do the hadoop->hadoop-ozone repository move.

As discussed on the mailing list:

When Ozone was adopted as a new Hadoop subproject it was proposed[1] to 
be part of the source tree but with separated release cadence, mainly 
because it had the hadoop-trunk/SNAPSHOT as compile time dependency.

During the last Ozone releases this dependency is removed to provide 
more stable releases. Instead of using the latest trunk/SNAPSHOT build 
from Hadoop, Ozone uses the latest stable Hadoop (3.2.0 as of now).

As we have no more strict dependency between Hadoop trunk SNAPSHOT and 
Ozone trunk I propose to separate the two code base from each other with 
creating a new Hadoop git repository (apache/hadoop-ozone.git):

With moving Ozone to a separated git repository:

  * It would be easier to contribute and understand the build (as of now 
we always need `-f pom.ozone.xml` as a Maven parameter)
  * It would be possible to adjust build process without breaking 
Hadoop/Ozone builds.
  * It would be possible to use different Readme/.asf.yaml/github 
template for the Hadoop Ozone and core Hadoop. (For example the current 
github template [2] has a link to the contribution guideline [3]. Ozone 
has an extended version [4] from this guideline with additional 
information.)
  * Testing would be more safe as it won't be possible to change core 
Hadoop and Hadoop Ozone in the same patch.
  * It would be easier to cut branches for Hadoop releases (based on the 
original consensus, Ozone should be removed from all the release 
branches after creating relase branches from trunk)

Technical solution

There are two main methods to split repository:

(1) Keep the git history and push the last commit to a separatede repository

This approach doesn't require any special attention we can push the existing source to a separated repository (only the selected branches).

Advantages:

Easier to search for 5 years old changes

Disadvantages:

Git log/history will contain all the irrelevant commits (ancient mapreduce/yarn commits for example)
Repo size will remain very huge (~1G for Hadoop) which makes slower all of the CI steps

(2) Filter branch and create a new git history

This approach is more trikcy. In git, we can freely adjust the git history and remove any commits from the history. Practically it means that we would keep only the history of hadoop-ozone and hadoop-hdds subdirectories.

Advantages:

Clean, shorter, meaningful history
Smaller size (1542 HDDS Jira, 15Mb vs the 1g of Hadoop)

Disadvantages

All the changes before the commit "HDFS-13258. Ozone: restructure Hdsl/Ozone code to separated maven subprojects." will be available only from the Hadoop repository.

See this repository as an example: https://github.com/elek/hadoop-ozone

Note: I suggest to follow the second approach

(3) branching (use master)

As this is a new repository we don't need to follow the Hadoop "trunk" historical naming convention. We can go with the master which makes it easier to use it from external tools.

What should I do?

After the repository split you should use the new apache/hadoop-ozone repository.

git remote add new git@github.com:apache/hadoop-ozone.git
git fetch new

And you can start to use the master

git checkout -b master new/master

To create pull requests, you should fork the new hadoop-ozone repository.

How can I migrate my work?

(1) cherry-pick always works even if the source and destination branches have different history. Just add the old and new remote to the same repository and cherry-pick between the branches.

git checkout -b mywork new/master
git cherry-pick <original commit id>

(2) all of the open pull requests are migrated as a branch

This approach doesn't require to have both the old and new remotes. Just use your (migrated) branch and continue the work

# delete your branch if you have it
git branch -D HDDS-1234

#recreate it from the migrated branch
git checkout -b HDDS-1234 new/HDDS-1234 

#Push it to your fork and create a new pull request.
git push elek HDDS-1234

Please open a new pull request based on your new branches. If you had comments in the existing pull request, please add a link back to the original one.

(3) Rebase your branch

Similar to the cherry-pick you can use advanced rebase commands to migrate branches between different source trees:

git rebase HEAD~3 --onto new/master HDDS-2073

This command migrates my local branch (HDDS-2073) to the top of new/master, but only my last three commits (HEAD~3). With this approach multiple (in this case 3) commits can be migrated in one step.

(4) convert your history

It's almost sure that you don't need it, but technically you can transform your own repository. This is the exact transformation which will be applied to the hadoop source tree (It's about 1-2 hours)

echo 4e61bc431e297d93c93ede7b42be25259f3ca835 > .git/info/grafts
git filter-branch -f --tree-filter "ls -1 | egrep -v 'hadoop-ozone|hadoop-cblock|hadoop-hdds|hadoop-hdsl|pom.ozone.xml' | xargs -n1 rm -rf" --prune-empty -- --all

The first command makes the transformation faster: It lies that the 4e61bc commit (the first commit in our new history) doesn't have any parent. (we don't need to scan the whole hadoop history)

The second command applies the filter to all the commits of all the branches (-all). The filter deletes all the files/directories which are not matched (not hadoop-ozone and not hadoop-hdds,...). Commits which are empty after the transformation (eg. YARN only commits) will be removed (–prune empty).

This transformation always gives you the same result (in git everything is based on hash and it doesn't modify any metadata). If you have local, internal fork in your company, it will transform it to a branch which is compatible with the hadoop-ozone.

Space shortcuts

Page tree

Technical solution

What should I do?

How can I migrate my work?