Mergiraf: Syntax-Aware Merging for Git

160 points by Velocifyer 13 days ago

paulirish 3 days ago

Have been using Mergiraf for the past 4 months. It's automatically solved about 70% of my conflicts and, luckily, I've never contested any of them. Pretty pleased.

goku12 3 days ago

> luckily, I've never contested any of them.
That's to be expected. The philosophy behind git merges is that it will merge only if it is absolutely and unambiguously sure that the resolution is correct. That's when there is only one solution for the merge. It will just throw it's hands up and leave it to the developer if there is any ambiguity - that's if there's more than one way to do the merge.
Every single chunk of merge is a potential conflict. But have you ever contested the regular merge algorithm (ort by default) when it did work? Like when the merge was fully successful, or the successfully merged chunks within a conflicted merge? You can expect the same experience with any merge algorithm that sticks to the git philosophy of being a git [1]. Problems will happen only if they start using some complex heuristics or LLM or something unpredictable like that for the merge.
> It's automatically solved about 70% of my conflicts
At the risk of explaining the obvious, I'm going to try to explain this. (So please don't get angry at me if you already know this.) Imagine that you're trying to manually merge 2 branches without any sort of merge algorithm. For the first case, just assume that you don't know the programming language (imagine that it's in some foreign script). All you have to go by is the record of when each line was added in each branch. The best 'dumb' strategy you have to go with, is the 3-way merge [2]. The referenced page illustrates this. It clearly shows you the advantage of the 3-way merge algorithm over the traditional 2-way merge that we all are familiar with.
But this method still has a disadvantage. You are looking at the source files simply as a bunch of lines, without the knowledge of its more granular structures like the syntax. (Note: That assumption itself may be wrong. That's why merges and git in general doesn't work well on binary files.) At best, all you can hope for is that the two branches don't contain any edits on the same or the adjacent lines. You won't even know the order in which the lines should be arranged. Now you have a conflict - a merge that you're leaving for someone else to solve.
Now assume a second case. You know the programming language this time. But you have no idea what the program does - it's not your project. Even with that limitation, you'll still be able to do a better job than just comparing the lines blindly. Mergiraf docs has a page full of these examples [3]. You can see how obvious the merges look - there is no way you can go wrong. See if you can resolve them just by looking at the lines. That's why mergiraf gives you much better performance without any errors.
There is of course a deeper level of knowledge - the semantic level. The knowledge of what the program does. You need that knowledge to resolve 100% of the merges. And that ultimate merge algorithm is ... you.
> Pretty pleased.
Understandable. But I see a potential problem here. As you are aware, the files to submit to mergiraf are specified in the gitattributes file. There are two ways this can go wrong. First, someone else with your repo may not have or even know about mergiraf. The second, even bigger problem is that some people have global gitattributes files [4] where you place your default attributes. It's possible to setup mergiraf there. But if you do so, your colleagues may not even get a clue as to why certain merges succeed for you, but fail for all of them.
The above problem becomes a bigger issue because merge and rebase conflicts sometimes reappear in later merges or rebases. If that's something mergiraf can solve and you have it, then everything's fine. But if the conflict reappears for someone without mergiraf, they will have to repeat the manual resolution again and again. This happens because git simply wont commit a merge or rebase until we resolve the conflict manually. Therefore, git has no idea what we did in between to resolve it - that is not recorded anywhere. (Well, git-rerere [5] records it if we ask it to. But that's a local-only solution. Everyone will have to do it once on their system.)
There is actually a known solution to the problem. It's called 'first class conflicts' [6]. The idea is to record the conflicts and its resolution in the repo itself (the same info that rerere stores, but in the shared repo). This means that a conflict once resolved will not come back again, because the structured information to resolve it is available in the repo. This means not everyone needs mergiraf and nobody needs to repeat a completed manual resolution. It has other advantages too. You can just continue working after a conflicted merge and leave the resolution for later. Or you could send the conflicts to someone else more specialized in that area of the code.
I have seen this feature in Jujutsu [6] and Pijul [7]. Git doesn't have it probably because this wasn't around when it was developed. But Jujutsu uses git repository format and they somehow managed to implement first-class conflicts on it. Meanwhile, the concept is already there in git as rerere. So perhaps first-class conflicts are possible in Git too. It would be awesome if we had that in Git too. So if anybody who sees this knows how to do it, please please take it up as a wish!
[1] https://github.com/git/git/blob/e83c5163316f89bfbde7d9ab23ca...
[2] https://blog.git-init.com/the-magic-of-3-way-merge/
[3] https://mergiraf.org/conflicts.html
[4] https://git-scm.com/docs/git-config#Documentation/git-config...
[5] https://git-scm.com/docs/git-rerere
[6] https://jj-vcs.github.io/jj/latest/conflicts/
[7] https://pijul.com/manual/why_pijul.html#modeling-conflicts
- 1718627440 3 days ago
  
  > But have you ever contested the regular merge algorithm (ort by default) when it did work?
  Depends on what you mean by 'contested', but yes. You can have "merge conflicts", that are even correct as far as the syntax is concerned, but are garbage on a semantic level.
  - goku12 3 days ago
    
    I'm not talking about the conflicts. I'm talking about the hunks that were resolved successfully. Sometimes they're part of successful merges. Sometimes they're part of conflicted merges where some other hunk was in conflict.
    
    1718627440 3 days ago
    
    Me too. A merge can be entirely without merge conflicts and still wrong, because it has (semantic or architectural) "merge conflicts".
Sesse__ 3 days ago

This is my experience as well. Not a gamechanger, but definitely on the positive side.

mentalgear 3 days ago

- Related in fine-grained diffing approach: Git heatmap: diff viewer for code reviews

> Heatmap color-codes every diff line/token by how much human attention it probably needs. Unlike PR-review bots, we try to flag not just by “is it a bug?” but by “is it worth a second look?” (examples: hard-coded secret, weird crypto mode, gnarly logic).

https://0github.com/

worldsayshi 3 days ago

Hmm, it would be nice to just see a heatmap over how many times a line has been changed. There must be some easy-ish way to do that right?
- Cthulhu_ 3 days ago
  
  I think you'd need to write a tool that goes through all revisions of a file and does a count, but if that's cached then it's doable. There's a few tools to view that by file though, including some Git commands, it's a valuable tool to determine which files are edited the most (see also the word "churn").
Valodim 3 days ago

The idea is cool but boy does it make you blind to anything the AI doesn't deem noteworthy. Comes down to whether you trust a human reviewer more, or the LLM

pavelai 13 days ago

Very impressive enhancement. Not a panacea though. It uses tree-sitter approach to solve situations when two users change the same line of code. For example one change function name and other adds a new argument. It will merge it without conflicts. It still has some troubles to solve complex issues, without knowing author intensions. But can significantly simplify developers' lives. Not sure if it would land into git very soon. It requires all git to know all the parsers you need. But definitely worth adding.

Velocifyer 13 days ago

This is a seprate tool that one can tell git to use.
1718627440 3 days ago

What does it do, when the change in function name mean that the number of spaces before each parameter (alignment) changed?
- pavelai 2 days ago
  
  If one of developers changed function name and the other changes alignment of parameters in the same line, this tool would recognize the changes and merge this line without conflicts. Regular git algorithm would turn it into a conflict because the changes happened on the same line
  - 1718627440 2 days ago
    
    The idea is that the alignment and function name change happens on the same side, since the alignment is caused by the function name. The other side e.g. adds another parameter. Does the new parameter get the correct alignment, or that of the old function name?

Valodim 3 days ago

fyi, comes configured in jj by default. Just `jj resolve --tool mergiraf` and some conflicts go away :)

indentit 3 days ago

I tried using Mergiraf a year or so ago, and ended up with so many weird problems that I eventually tracked down to being caused by it, that I disabled and uninstalled it and never looked back - it was more hassle than it was worth

0x7cfe 2 days ago

What kind of problems did you encounter? Could you provide an example?

jayd16 3 days ago

I wish there were a lot more syntax aware merges built into git (et al). Why are separate columns on the same row of a CSV or multiple appends to a list (in any language you don't want a trailing comma) so annoying to merge?

It could be so much better.

gritzko 3 days ago

> After extracting a list of every merge conflict in the kernel's Git history, I tried using Mergiraf to resolve them. 6,987 still resulted in conflicts, but 428 were resolved successfully. A much larger fraction of merge conflicts were still partially resolved.

bjackman 3 days ago

Take this with a grain of salt as I haven't tested this claim, but I think C might be a pretty weak language for this tool because you can't really parse it without running the whole preprocessor, which it can't do:
https://codeberg.org/mergiraf/mergiraf/issues/612#issuecomme...
So I think in a more sensible language you might get much better results than this.
- gritzko 3 days ago
  
  Another aspect is the fact this repo reflects Torvalds’ view of the world. He operates in large-ish changesets.

mnemonet 3 days ago

This is a very interesting idea that could save a lot of time and pain in big projects.

The example shown reminds me pf Zed's CRDTs [1], and their journey to build a fine-grained version control system for agentic development [2]—I imagine this work could prove useful to the Zed/Cursor team, and likely shares a lot of functionality with DeltaDB [2].

- [1]: https://zed.dev/blog/crdts

- [2]: https://zed.dev/blog/sequoia-backs-zed

vinnyhaps 3 days ago

I’m pretty sure one of the Zed founders wrote tree-sitter, so I’m sure there’s some overlap
It’s really cool to see tree-sitter unlock so many of these use cases. I love using [difftastic] for my diffing tool to get context aware diffs. So in the example from the article, the diff would highlight the `void` and `int` changes with a heavier background of red and green respectively
[difftastic]: https://github.com/Wilfred/difftastic
- conartist6 3 days ago
  
  Max Brunsfeld in fact, yep. He went along to Zed from the Atom team.
  But curiously Zed hasn't been very interested in Tree-sitter. They don't seem to see it as having much strategic value to their company, which is odd because lots of other people do see it as a valuable platform. You have Tweag building code formatting on it, you had GitHub building stack graphs on it, you have Merigraph. You even have sone really "out there" stuff like the Software Evolution Library!
  - olejorgenb 3 days ago
    
    They use it quite a bit in Zed though. What do you count as "not very interested"?
    
    conartist6 2 days ago
    
    It comes down to tree sitter being the heart of a semantic IDE. If you use Tree sitter's data to apply a fix for a formatting problem or a lint error you are making a semantic edit to your code using it: you aren't describing that change in terms of the line/col in a text buffer then, but first in terms of the path to the node you wish to adjust in the syntax tree and the semantic rules used to target it.
    Zed doesn't want to build a semantic IDE. They've said it a million times, they want to build a text editor, so they just aren't going to put the tree representation at the center of the experience. A text editor's UX is built around the text buffer so that it emulates experience of coding while sitting at a typewriter filling out punch cards. We can do better than the typewriter as the anchoring metaphor for all UX!
    I think those projects I listed that build on top of Tree-sitter (all ignored by Zed) all see the potential of semantic changes and of Tree-sitter as a platform for making them.
    
    conartist6 2 days ago
    
    Think about it. Tree-sitter is an IDE.
    I don't mean a standalone syntax highlighter, I mean it's a whole environment in which you can write software and in which things integrate. An Integrated Development Environment.
    But Zed doesn't want that product. That product, if they cared that they owned it, would compete with Zed

1718627440 3 days ago

> Therefore, this merge conflict can be resolved automatically by putting the lines in any order. The resulting merged program has the same behavior either way.

That means that if I the programmer care about the order, I must now review lines, where no merge conflict is indicated. I am not sure I would like that.

PoignardAzur a day ago

Yeah, that's a bad example, there's a bunch of ways field order matters in Rust.
Import order would have been a better example (they're always supposed to be sorted).

scoodah 3 days ago

Way back in the day when I primarily wrote c# I used to use a tool called SemanticMerge. It was pretty cool, it actually parsed the code and could pick up refactors like moving a method to a different class and what not. This kinda reminds me of that a bit.

Cthulhu_ 3 days ago

Yeah, the article mentions a similar project for Java; I'm a bit surprised / disappointed that there's no more language specific merge tools tbh, or a super-tool that has plugins for individual languages. Maybe this article will attract more attention though.

James_K 3 days ago

Very interesting to see what Tree Sitter starting to get used for more things.

virajk_31 3 days ago

I really liked the last section of your article, thanks for the numbers

sysguest 3 days ago

finally...

I've been using 1-arg-1-line to avoid most conflicts

Cthulhu_ 3 days ago
I've been doing some SQL again and one technique I learned years ago was having each thing on its own line, both to reduce churn in version control and allow for easier reordering and commenting out.
Instead of
```
    SELECT foo, bar, quux FROM baz WHERE storge = 'grault';
```
do
```
    SELECT
       foo
      ,bar
      ,quux
    FROM
      baz
    WHERE
      storge = 'grault'
    ;
```
It's pretty hideous in this example but for bigger queries maintained over a long period of time it can be beneficial. I assume, it's been nearly 20 years since I did anything more serious with SQL.

KuhaLeyka 2 days ago

Don't use it in when you write code for critical infrastructure or aviation please. :)

hamonrye 3 days ago

[dead]

ltbarcly3 3 days ago

claude "resolve merge conflicts"

littlestymaar 3 days ago

Using 30s worth of H100 GPU instead of <10ms worth of an entry-level CPU, for a worse result.
Well done.
- ltbarcly3 3 days ago
  
  "Compositing text into graphical data to display it on a 2D array of millions of 32bit RGB pixels instead of just using a pencil and a 50 cent notebook."
  Actually I've done this a hundred times now and it has yet to make a single mistake. I don't give a crap how much GPU it uses, grandpa.
Cthulhu_ 3 days ago

OK, I'm going to try and resolve these merge conflicts for you!
First, let me pull up the diff and git status
......
....
...
.
Hmm, that didn't quite work, let me try that again!
- ltbarcly3 3 days ago
  
  I've resolved hundreds of conflicted merges this way and I don't remember it making a single mistake.
  - n4r9 3 days ago
    
    Might that be because LLMs are potentially negatively impacting your memory?
    
    ltbarcly3 3 days ago
    
    Yea that's probably it. Or you're wrong? One of those for sure.