开发者

git merge with renamed files

开发者 https://www.devze.com 2022-12-27 19:24 出处:网络
I have a large website that I am moving into a new framework and in the process adding git. The current site doesn\'t have any version control on it.

I have a large website that I am moving into a new framework and in the process adding git. The current site doesn't have any version control on it.

I started by copying the site into a new git repository. I made a new branch and made all of the changes that were needed to make it work with the new framework. One of those steps was changing the file extension of all of the pages.

Now in the time that I have been working on the new site changes have been made to files on the old site. So I switched to master and copied all of those changes in.

The problem is when I merge the branch with the new framework back onto master there is a conflict on every file that was changed on the master branch.

I wouldn't be to worried about it but there are a couple of hundred files with changes. I have tried git rebase and 开发者_如何学Cgit rebase --merge with no luck.

How can I merge these 2 branches without dealing with every file?


Since git 1.7.4, you can specify the rename threshold for merge as git merge -X rename-threshold=25 in order to control that a similarity of 25% is already enough to consider two files rename candidates. This, depending on the case together with -X ignore-space-change may make rename detection more reliable.

However, I wanted to have more direct control and was cooking up a related script the last days. Maybe it helps - let me know.

https://gist.github.com/894374


Should have work automatically, thanks to rename detection. Below there is sample session:

$ git init test
Initialized empty Git repository in /tmp/jnareb/test/.git/
$ cp ~/git/README .    # example file, large enough so that rename detection works
$ git add .
$ git commit -m 'Initial commit'
[master (root-commit) b638320] Initial commit
 1 files changed, 54 insertions(+), 0 deletions(-)
 create mode 100644 README
$ git checkout -b new-feature        
Switched to a new branch 'new-feature'
$ git mv README README.txt
$ git commit -m 'Renamed README to README.txt'
[new-feature ce7b731] Renamed README to README.txt
 1 files changed, 0 insertions(+), 0 deletions(-)
 rename README => README.txt (100%)
$ git checkout master
Switched to branch 'master'
$ sed -e 's/UNIX/Unix/g' <README >README+ && mv -f README+ README
$ git commit -a -m 'README changed'
[master 57b1114] README changed
 1 files changed, 1 insertions(+), 1 deletions(-)
$ git merge new-feature 
Merge made by recursive.
 README => README.txt |    0
 1 files changed, 0 insertions(+), 0 deletions(-)
 rename README => README.txt (100%)

If you were doing "git merge master" on 'new-feature' branch instead of, like above, "git merge new-feature" on 'master', you would get:

$ git merge master
Merge made by recursive.
 README.txt |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

Could you tell what you were doing differently?

Note that ordinary "git rebase" (and "git pull --rebase") do not pick up renames: you need to run "git rebase -m" or interactive rebase.


Ten+ years later, with Git 2.31 (Q1 2021), a new merge strategy should help: ORT ("Ostensibly Recursive's Twin").

It includes a lot of performance optimization work on the rename detection, which is going beyond the simple rename-threshold.

See commit f78cf97, commit 07c9a7f, commit bd24aa2, commit da09f65, commit a35df33, commit f384525, commit 829514c (14 Feb 2021), and commit f15eb7c (03 Feb 2021) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit 12bd175, 01 Mar 2021)

For instance:

diffcore-rename: compute basenames of source and dest candidates

Signed-off-by: Elijah Newren

We want to make use of unique basenames among remaining source and destination files to help inform rename detection, so that more likely pairings can be checked first.
(src/moduleA/foo.txt and source/module/A/foo.txt are likely related if there are no other 'foo.txt' files among the remaining deleted and added files.)

Add a new function, not yet used, which creates a map of the unique basenames within rename_src and another within rename_dst, together with the indices within rename_src/rename_dst where those basenames show up.

Non-unique basenames still show up in the map, but have an invalid index (-1).

This function was inspired by the fact that in real world repositories, files are often moved across directories without changing names.

Here are some sample repositories and the percentage of their historical renames (as of early 2020) that preserved basenames:

  • linux: 76%
  • gcc: 64%
  • gecko: 79%
  • webkit: 89%

These statistics alone don't prove that an optimization in this area will help or how much it will help, since there are also unpaired adds and deletes, restrictions on which basenames we consider, etc., but it certainly motivated the idea to try something in this area.

And:

diffcore-rename: complete find_basename_matches()

Signed-off-by: Elijah Newren

It is not uncommon in real world repositories for the majority of file renames to not change the basename of the file; i.e.

most "renames" are just a move of files into different directories.

We can make use of this to avoid comparing all rename source candidates with all rename destination candidates, by first comparing sources to destinations with the same basenames.

  • If two files with the same basename are sufficiently similar, we record the rename; -if not, we include those files in the more exhaustive matrix comparison.

This means we are adding a set of preliminary additional comparisons, but for each file we only compare it with at most one other file.

For example, if there was a include/media/device.h that was deleted and a src/module/media/device.h that was added, and there are no other device.h files in the remaining sets of added and deleted files after exact rename detection, then these two files would be compared in the preliminary step.

This commit does not yet actually employ this new optimization, it merely adds a function which can be used for this purpose.

Note that this optimization might give us different results than without the optimization, because it's possible that despite files with the same basename being sufficiently similar to be considered a rename, there's an even better match between files without the same basename.

I think that is okay for four reasons:

  • (1) it's easy to explain to the users what happened if it does ever occur (or even for them to intuitively figure out),
  • (2) as the next patch will show it provides such a large performance boost that it's worth the tradeoff, and
  • (3) it's somewhat unlikely that despite having unique matching basenames that other files serve as better matches.
    Reason (4) takes a full paragraph to explain...

If the previous three reasons aren't enough, consider what rename detection already does.
Break detection is not the default, meaning that if files have the same _fullname_, then they are considered related even if they are 0% similar.
In fact, in such a case, we don't even bother comparing the files to see if they are similar let alone comparing them to all other files to see what they are most similar to.

Basically, we override content similarity based on sufficient filename similarity.
Without the filename similarity (currently implemented as an exact match of filename), we swing the pendulum the opposite direction and say that filename similarity is irrelevant and compare a full N x M matrix of sources and destinations to find out which have the most similar contents.
This optimization just adds another form of filename similarity comparison, but augments it with a file content similarity check as well.
Basically, if two files have the same basename and are sufficiently similar to be considered a rename, mark them as such without comparing the two to all other rename candidates.

That leads to:

diffcore-rename: guide inexact rename detection based on basenames

Signed-off-by: Elijah Newren

Make use of the new find_basename_matches() function added in the last two patches, to find renames more rapidly in cases where we can match up files based on basenames.

As a quick reminder (see the last two commit messages for more details), this means for example that docs/extensions.txt and docs/config/extensions.txt are considered likely renames if there are no remaining 'extensions.txt' files elsewhere among the added and deleted files, and if a similarity check confirms they are similar, then they are marked as a rename without looking for a better similarity match among other files.

This is a behavioral change, as covered in more detail in the previous commit message.

We do not use this heuristic together with either break or copy detection.

The point of break detection is to say that filename similarity does not imply file content similarity, and we only want to know about file content similarity.

The point of copy detection is to use more resources to check for additional similarities, while this is an optimization that uses far less resources but which might also result in finding slightly fewer similarities.

So the idea behind this optimization goes against both of those features, and will be turned off for both.

For the testcases mentioned in commit 557ac03 ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28, Git v2.31.0-rc0 -- merge listed in batch #8), this change improves the performance as follows:

Before                  After
 s ±  0.062 s    13.294 s ±  0.103 s
 s ±  0.493 s   187.248 s ±  0.882 s
 s ±  0.019 s     5.557 s ±  0.017 s

With Git 2.32 (Q2 2021), the rename detection rework continues.

See commit 81afdf7, commit 333899e, commit 1ad69eb, commit b147301, commit b6e3d27, commit cd52e00, commit 0c4fd73, commit ae8cf74, commit bde8b9f, commit 37a2514 (27 Feb 2021) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit dd4048d, 22 Mar 2021)

diffcore-rename: compute dir_rename_guess from dir_rename_counts

Reviewed-by: Derrick Stolee
Signed-off-by: Elijah Newren

dir_rename_counts has a mapping of a mapping, in particular, it has

old_dir => { new_dir => count }

We want a simple mapping of

old_dir => new_dir

based on which new_dir had the highest count for a given old_dir.
Compute this and store it in dir_rename_guess.

This is the final piece of the puzzle needed to make our guesses at which directory files have been moved to when basenames aren't unique.

(Final piece based on commit 37a2514)

For the testcases mentioned in commit 557ac03 ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28, Git v2.31.0-rc0 -- merge listed in batch #8), this change improves the performance as follows:

Before                  After
 s ±  0.062 s    12.596 s ±  0.061 s
 s ±  0.284 s   130.465 s ±  0.259 s
 s ±  0.019 s     3.958 s ±  0.010 s

Still with Git 2.32 (Q2 2021), the ort merge backend has been optimized by skipping irrelevant renames.

See commit e4fd06e, commit f89b4f2, commit 174791f, commit 2fd9eda, commit a68e6ce, commit beb0614, commit 32a56df, commit 9799889 (11 Mar 2021) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit 1b31224, 08 Apr 2021)

merge-ort: skip rename detection entirely if possible

Signed-off-by: Elijah Newren

diffcore_rename_extended() will do a bunch of setup, then check for exact renames, then abort before inexact rename detection if there are no more sources or destinations that need to be matched.
It will sometimes be the case, however, that either

  • we start with neither any sources or destinations
  • we start with no relevant sources> In the first of these two cases, the setup and exact rename detection will be very cheap since there are 0 files to operate on.
    In the second case, it is quite possible to have thousands of files with none of the source ones being relevant.
    Avoid calling diffcore_rename_extended() or even some of the setup before diffcore_rename_extended() when we can determine that rename detection is unnecessary.

For the testcases mentioned in commit 557ac03 ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28, Git v2.31.0-rc0 -- merge listed in batch #8), this change improves the performance as follows:

Before                  After
 s ±  0.048 s     5.708 s ±  0.111 s
 s ±  0.236 s   102.171 s ±  0.440 s
 s ±  0.017 s     3.471 s ±  0.015 s

The work continues with Git 2.33 (Q3 2021), where repeated rename detection in a sequence of mergy operations are optimized.

See commit 25e65b6, commit cbdca28, commit 86b41b3, commit d509802, commit 19ceb48, commit 64aceb6, commit 2734f2e, commit d29bd6d, commit a22099f, commit f950026, commit caba91c, commit bb80333 (20 May 2021), and commit 15f3e1e (04 May 2021) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit 169914e, 14 Jun 2021)

merge-ort, diffcore-rename: employ cached renames when possible

Signed-off-by: Elijah Newren

When there are many renames between the old base of a series of commits and the new base, the way sequencer.c, merge-recursive.c, and diffcore-rename.c have traditionally split the work resulted in re-detecting the same renames with each and every commit being transplanted.
To address this, the last several commits have been creating a cache of rename detection results, determining when it was safe to use such a cache in subsequent merge operations, adding helper functions, and so on.

For the testcases mentioned in commit 557ac03 ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28, Git v2.31.0-rc0 -- merge listed in batch #8), this change improves the performance as follows:

Before                  After
 s ±  0.129 s     5.622 s ±  0.059 s
 s ±  0.158 s    10.127 s ±  0.073 s
ms ±  6.1  ms   500.3  ms ±  3.8  ms

That's a fairly small improvement, but mostly because the previous optimizations were so effective for these particular testcases; this optimization only kicks in when the others don't.
If we undid the basename-guided rename detection and skip-irrelevant-renames optimizations, then we'd see that this series by itself improved performance as follows:

Before Basename Series   After Just This Series
  13.815 s ±  0.062 s      5.697 s ±  0.080 s
1799.937 s ±  0.493 s    205.709 s ±  0.457 s

Since this optimization kicks in to help accelerate cases where the previous optimizations do not apply, this last comparison shows that this cached-renames optimization has the potential to help significantly in cases that don't meet the requirements for the other optimizations to be effective.


With Git 2.33 (Q3 2021), more fix-ups and optimization to "merge -sort".

See commit ef68c3d, commit 356da0f, commit 61bf449, commit 5a3743d (08 Jun 2021) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit 89efac8, 16 Jul 2021)

merge-ort: replace string_list_df_name_compare with faster alternative

Signed-off-by: Elijah Newren
Reviewed-by: Derrick Stolee

Rewrite the comparison function in a way that does not require finding out the lengths of the strings when comparing them.
While at it, tweak the code for our specific case -- no need to handle a variety of modes, for example.
The combination of these changes reduced the time spent in "plist special sort" by ~25% in the mega-renames case.

For the testcases mentioned in commit 557ac03 ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28, Git v2.31.0-rc0 -- merge listed in batch #8), this change improves the performance as follows:

Before                  After
 s ±  0.059 s     5.235 s ±  0.042 s
 s ±  0.073 s     9.419 s ±  0.107 s
ms ±  3.8  ms   480.1  ms ±  3.9  ms

And, still with git 2.33:

See commit 2bff554, commit 1aedd03, commit d331dd3, commit c75c423 (22 Jun 2021), and commit 78cfdd0 (15 Jun 2021) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit fdbcdfc, 16 Jul 2021)

diffcore-rename: use a different prefetch for basename comparisons

Signed-off-by: Elijah Newren

merge-ort was designed to minimize the amount of data needed and used, and several changes were made to diffcore-rename to take advantage of extra metadata to enable this data minimization (particularly the relevant_sources variable for skipping "irrelevant" renames).
This effort obviously succeeded in drastically reducing computation times, but should also theoretically allow partial clones to download much less information.
Previously, though, the "prefetch" command used in diffcore-rename had never been modified and downloaded many blobs that were unnecessary for merge-ort.
This commit corrects that.

When doing basename comparisons, we want to fetch only the objects that will be used for basename comparisons.
If after basename fetching this leaves us with no more relevant sources (or no more destinations), then we won't need to do the full inexact rename detection and can skip downloading additional source and destination files.
Even if we have to do that later full inexact rename detection, irrelevant sources are culled after basename matching and before the full inexact rename detection, so we can still avoid downloading the blobs for irrelevant sources.
Rename prefetch() to inexact_prefetch(), and introduce a new basename_prefetch() to take advantage of this.

If we modify the testcase from commit 557ac03 ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2021-01-23, Git v2.31.0-rc0 -- merge listed in batch #8) to pass

--sparse --filter=blob:none

to the clone command, and use the new trace2 "fetch_count" output from a few commits ago to track both the number of fetch subcommands invoked and the number of objects fetched across all those fetches, then for the mega-renames testcase we observe the following:

BEFORE this commit, rebasing 35 patches:

strategy     # of fetches    total # of objects fetched
---------    ------------    --------------------------
recursive    62              11423
ort          30              11391

AFTER this commit, rebasing the same 35 patches:

ort          32                 63

This means that the new code only needs to download less than 2 blobs per patch being rebased.
That is especially interesting given that the repository at the start only had approximately half a dozen TOTAL blobs downloaded to start with (because the default sparse-checkout of just the toplevel directory was in use).

So, for this particular linux kernel testcase that involved ~26,000 renames on the upstream side (drivers/ -> pilots/) across which 35 patches were being rebased, this change reduces the number of blobs that need to be downloaded by a factor of ~180.

Also in Git 2.33:

With Git 2.33 (Q3 2021), further optimization on "merge -sort" backend.

See commit 8b09a90, commit 7bee6c1, commit 5e1ca57, commit e0ef578, commit d478f56, commit 528fc51, commit 785bf20 (16 Jul 2021) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit 1a6fb01, 04 Aug 2021)

merge-ort: restart merge with cached renames to reduce process entry cost

Signed-off-by: Elijah Newren

The merge algorithm mostly consists of the following three functions:

collect_merge_info()
detect_and_process_renames()
process_entries()

Prior to the trivial directory resolution optimization of the last half dozen commits, process_entries() was consistently the slowest, followed by collect_merge_info(), then detect_and_process_renames().
When the trivial directory resolution applies, it often dramatically decreases the amount of time spent in the two slower functions.

However, staring at this list of functions and noticing that process_entries() is the most expensive and knowing I could avoid it if I had cached renames suggested a simple idea: change

collect_merge_info()
detect_and_process_renames()
process_entries()

into

collect_merge_info()
detect_and_process_renames()
<cache all the renames, and restart>
collect_merge_info()
detect_and_process_renames()
process_entries()

This may seem odd and look like more work.
However, note that although we run collect_merge_info() twice, the second time we get to employ trivial directory resolves, which makes it much faster, so the increased time in collect_merge_info() is small.
While we run detect_and_process_renames() again, all renames are cached so it's nearly a no-op (we don't call into diffcore_rename_extended() but we do have a little bit of data structure checking and fixing up).
And the big payoff comes from the fact that process_entries(), will be much faster due to having far fewer entries to process.

This restarting only makes sense if we can save recursing into enough directories to make it worth our while.
Introduce a simple heuristic to guide this.

For the testcases mentioned in commit 557ac03 ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28, Git v2.31.0-rc0 -- merge listed in batch #8), this change improves the performance as follows:

Before                  After
ms ±  3.8  ms   204.2  ms ±  3.0  ms
 s ±  0.010 s     1.076 s ±  0.015 s
ms ±  3.9  ms   364.1  ms ±  7.0  ms

With Git 2.34 (Q4 2021), final batch for "merge -sort" optimization.

See commit 62a1516 (31 Jul 2021), and commit 092e511, commit f239fff, commit a8791ef, commit 6697ee0, commit 4137c54, commit cdf2241, commit fa0e936, commit 7afc0b0 (30 Jul 2021) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit 08ac213, 24 Aug 2021)

merge-ort: switch our strmaps over to using memory pools

Signed-off-by: Elijah Newren

For all the strmaps (including strintmaps and strsets) whose memory is unconditionally freed as part of clear_or_reinit_internal_opts(), switch them over to using our new memory pool.

For the testcases mentioned in commit 557ac03 ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28, Git v2.31.0-rc0 -- merge listed in batch #8), this change improves the performance as follows:

Before                  After
ms ±  3.2  ms    198.1 ms ±  2.6 ms
 s ±  0.012 s    715.8 ms ±  4.0 ms
ms ±  3.9  ms    276.8 ms ±  4.2 ms


I figured out a fix. Since the renaming of the files was done by a script I was able to copy the new .php files and rerun the script before the merge. Since the files had the same name the merge worked without conflicts.

Here are the steps for the whole process.

  1. Create git repo git init
  2. Copy existing files in
  3. Commit
  4. Run script to rename files
  5. Commit
  6. Create a branch but don't check it out
  7. Make fixes committing changes as you go
  8. Checkout the branch you made in step 6
  9. Copy the new versions of the files
  10. Run the script to rename the files (this should replace the ones from the first run)
  11. Commit
  12. Checkout master
  13. merge the branch into master

This works because to git the changes were made to the files with the new name.


In my case when rename detection failed, I found that during merge resolution I could do the following:

Given:

fileA: A modified file that was moved to the new place but is currently in the old place.
destB: The location where fileB was moved to. This could include a new filename.

Run these commands:

git add fileA
git mv fileA destB

Thats all I had to do. Then I committed and the rebase continued.


Adding to @Tilman's answer, with the recent git the rename option is -X find-renames=<n>

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号