This page doesn't have any information. It shared the plan to do the hadoop->hadoop-ozone repository move.
As discussed on the mailing list:
When Ozone was adopted as a new Hadoop subproject it was proposed[1] to be part of the source tree but with separated release cadence, mainly because it had the hadoop-trunk/SNAPSHOT as compile time dependency. During the last Ozone releases this dependency is removed to provide more stable releases. Instead of using the latest trunk/SNAPSHOT build from Hadoop, Ozone uses the latest stable Hadoop (3.2.0 as of now). As we have no more strict dependency between Hadoop trunk SNAPSHOT and Ozone trunk I propose to separate the two code base from each other with creating a new Hadoop git repository (apache/hadoop-ozone.git): With moving Ozone to a separated git repository: * It would be easier to contribute and understand the build (as of now we always need `-f pom.ozone.xml` as a Maven parameter) * It would be possible to adjust build process without breaking Hadoop/Ozone builds. * It would be possible to use different Readme/.asf.yaml/github template for the Hadoop Ozone and core Hadoop. (For example the current github template [2] has a link to the contribution guideline [3]. Ozone has an extended version [4] from this guideline with additional information.) * Testing would be more safe as it won't be possible to change core Hadoop and Hadoop Ozone in the same patch. * It would be easier to cut branches for Hadoop releases (based on the original consensus, Ozone should be removed from all the release branches after creating relase branches from trunk)
Technical solution
There are two main methods to split repository:
(1) Keep the git history and push the last commit to a separatede repository
This approach doesn't require any special attention we can push the existing source to a separated repository (only the selected branches).
Advantages:
- Easier to search for 5 years old changes
Disadvantages:
- Git log/history will contain all the irrelevant commits (ancient mapreduce/yarn commits for example)
- Repo size will remain very huge (~1G for Hadoop) which makes slower all of the CI steps
(2) Filter branch and create a new git history
This approach is more trikcy. In git, we can freely adjust the git history and remove any commits from the history. Practically it means that we would keep only the history of hadoop-ozone and hadoop-hdds subdirectories.
Advantages:
- Clean, shorter, meaningful history
- Smaller size (1542 HDDS Jira, 15Mb vs the 1g of Hadoop)
Disadvantages
- All the changes before the commit "HDFS-13258. Ozone: restructure Hdsl/Ozone code to separated maven subprojects." will be available only from the Hadoop repository.
See this repository as an example: https://github.com/elek/hadoop-ozone
Note: I suggest to follow the second approach
(3) branching (use master)
As this is a new repository we don't need to follow the Hadoop "trunk" historical naming convention. We can go with the master which makes it easier to use it from external tools.
What should I do?
After the repository split you should use the new apache/hadoop-ozone repository.
git remote add new git@github.com:apache/hadoop-ozone.git git fetch new
And you can start to use the master
git checkout -b master new/master
To create pull requests, you should fork the new hadoop-ozone repository.
How can I migrate my work?
(1) cherry-pick always works even if the source and destination branches have different history. Just add the old and new remote to the same repository and cherry-pick between the branches.
git checkout -b mywork new/master git cherry-pick <original commit id>
(2) all of the open pull requests are migrated as a branch
This approach doesn't require to have both the old and new remotes. Just use your (migrated) branch and continue the work
# delete your branch if you have it git branch -D HDDS-1234 #recreate it from the migrated branch git checkout -b HDDS-1234 new/HDDS-1234 #Push it to your fork and create a new pull request. git push elek HDDS-1234
Please open a new pull request based on your new branches. If you had comments in the existing pull request, please add a link back to the original one.
(3) Rebase your branch
Similar to the cherry-pick you can use advanced rebase commands to migrate branches between different source trees:
git rebase HEAD~3 --onto new/master HDDS-2073
This command migrates my local branch (HDDS-2073) to the top of new/master, but only my last three commits (HEAD~3). With this approach multiple (in this case 3) commits can be migrated in one step.
(4) convert your history
It's almost sure that you don't need it, but technically you can transform your own repository. This is the exact transformation which will be applied to the hadoop source tree (It's about 1-2 hours)
echo 4e61bc431e297d93c93ede7b42be25259f3ca835 > .git/info/grafts git filter-branch -f --tree-filter "ls -1 | egrep -v 'hadoop-ozone|hadoop-cblock|hadoop-hdds|hadoop-hdsl|pom.ozone.xml' | xargs -n1 rm -rf" --prune-empty -- --all
The first command makes the transformation faster: It lies that the 4e61bc commit (the first commit in our new history) doesn't have any parent. (we don't need to scan the whole hadoop history)
The second command applies the filter to all the commits of all the branches (-all). The filter deletes all the files/directories which are not matched (not hadoop-ozone and not hadoop-hdds,...). Commits which are empty after the transformation (eg. YARN only commits) will be removed (–prune empty).
This transformation always gives you the same result (in git everything is based on hash and it doesn't modify any metadata). If you have local, internal fork in your company, it will transform it to a branch which is compatible with the hadoop-ozone.