Bug 6528 - On the fly corruption of file-revision with copy information during exchange with 5.8
Summary: On the fly corruption of file-revision with copy information during exchange ...
Status: RESOLVED FIXED
Alias: None
Product: Mercurial
Classification: Unclassified
Component: Mercurial (show other bugs)
Version: 5.8
Hardware: PC Linux
: critical bug
Assignee: Bugzilla
URL:
Keywords: regression
: 6603 (view as bug list)
Depends on:
Blocks:
 
Reported: 2021-06-08 23:56 UTC by Gary Kramlich
Modified: 2022-12-20 05:00 UTC (History)
5 users (show)

See Also:
Python Version: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gary Kramlich 2021-06-08 23:56:56 UTC
There was a lot of discussion on this in #mercurial on freenode but I'll do my best to summarize it. I don't expect this issue to be widespread as based on the commit I bisected too, it sounds like the patch just wasn't ready for some of the crazy stuff that has happened in the pidgin repository as it went from cvs -> svn -> mtn -> hg with some manual patching in between some of those steps as well.

I was bootstrapping a Google Summer of Code student today and noticed that there were more modified files in the repository than there should have been. 26 files in fact that are very old and haven't been touched in a long time.

I had created the repo from them using `hg init ssh://keep.imfreedom.org/prateek/pidgin` and then pushed a fresh clone of https://keep.imfreedom.org/pidgin/pidgin into it. keep.imfreedom.org is running my hgkeeper servering solution for mercurial. While I did all of this, keep was running hg 5.8.

Anyways, I just did a commit by specifying the files we changed and then started debugging where these files came from. (This is important for later).

I started trying to clone from other machines thinking it was something local at first. Interestingly enough, our mac build agent running hg 5.3.2 with python 2.7.17 never reproduced the issue, but a linux machine running hg 4.8.2 with python 2.7.18 did. I had a few others try with various combinations and all of them reproduced the issue as well, 26 modified files after a clone. However, Zash in #mercurial only saw 25 modified files.

I ended up creating a tarball of the main pidgin repo which is currently available at https://data.imfreedom.org/pidgin/pidgin-raw.tar.xz. Using the raw repo I did a local directory clone `hg clone pidgin-raw pidgin-local` and verified the issue was not present. I then served the repo with `hg --cwd pidgin-raw serve` and cloned it with `hg clone http://localhost:8000 pidgin-local` and verified the issue was present.

With that data point in hand I downgraded my local hg to 5.7.1 and performed the same test. This time, both clone did not express the issue. I then downgraded hg on keep to 5.7.1 and have been unable to reproduce for any non-poisoned repo there. I say non-poisoned because the repo I created for the GSoC student still shows the modified files on a fresh clone even with 5.7.1.  That repo is https://keep.imfreedom.org/prateek/pidgin.

I then bisected starting with tag:5.7.1 marked as good and tag:5.8 marked as bad. During each step, I ran `pip install -U --user .` in the hg source directory. After installation, I served a clean repo with `hg serve` and clone it with `hg clone http://localhost:8000 local` until bisect identified revision https://www.mercurial-scm.org/repo/hg/rev/49fd21f32695. The description seems like a likely candidated, but mharbison in #mercurial hasn't been able to reproduce on that revision.

So as of right now, I have hg 5.7.1 running on keep.imfreedom.org to stop the spread of the poison repos.  The poisoning isn't horrible, but it's not great either.

So a remote init and push from a poisoned repo results in a poisoned repo. That is `hg init ssh://foo; hg push ssh://foo`.  However, a local init and push does not poison. That is `hg init ../foo; hg push ../foo`.  And as you may expect, a local clone of a poisoned repo poisons the new repo. That is `hg clone . ../foo`.
Comment 1 Anton Shestakov 2021-06-13 22:04:44 UTC
AFAIU it was confirmed in the IRC, and is related to storing copy/move metadata for files and the order of parents (kudos to nbjoerg for investigating).
Comment 2 Pierre-Yves David 2021-07-06 13:57:37 UTC
This is actively corrupting revision on the fly. So we need:

1) an immediate fix or backout of the faulty code
2) a script to audit an existing repository and fix up the parent information.
Comment 3 HG Bot 2021-07-08 20:55:09 UTC
Fixed by https://mercurial-scm.org/repo/hg/rev/bd0a2a919bf8
Pierre-Yves David <pierre-yves.david@octobus.net>
corruption: add a test for issue6528

The initial reproduction script was provided by Charles Chamberlain from Jane
Street.

Differential Revision: https://phab.mercurial-scm.org/D10996

(please test the fix)
Comment 4 HG Bot 2021-07-08 20:55:14 UTC
Fixed by https://mercurial-scm.org/repo/hg/rev/411dc27fd9fd
Pierre-Yves David <pierre-yves.david@octobus.net>
corruption: backout changeset 49fd21f32695 (issue6528)

inverting the parent is masking copy information leading to bad content being
fetched and bad status result.

Since 49fd21f32695, exchange can actively swap these parent corrupting existing
changesets and triggering the corruption.

Data corruption are considered critical so backing this out and doing and
unscheduled release seems in order.

Differential Revision: https://phab.mercurial-scm.org/D10995

(please test the fix)
Comment 5 Pierre-Yves David 2021-07-12 16:06:25 UTC
I am discussing the next steps here: https://bz.mercurial-scm.org/show_bug.cgi?id=6538
Comment 6 Bugzilla 2021-07-21 04:00:05 UTC
Bug was set to TESTING for 7 days, resolving
Comment 7 Pierre-Yves David 2021-07-26 20:17:42 UTC
This bug affect version Mercurial 5.8 only and is fixed in Mercurial 5.8.1

Corruption that happened using 5.8 will however remains and keep propagating with the associated changes.
Comment 8 HG Bot 2021-08-18 14:40:06 UTC
Fixed by https://mercurial-scm.org/repo/hg/rev/b30a53ffbf9b
Raphaël Gomès <rgomes@octobus.net>
debugcommands: introduce a debug command to repair repos affected by issue6528

This command is quite basic and slow, it will loop over the entirety of the
filelogs in the repository and check each revision for corruption, then fixes
the affected filelogs. It takes under 25 minutes for Mozilla-Central on my
not-top-of-the-line laptop, using the `--to-report` and `--from-report` options
will make this pretty tolerable to use, I think.

This change also introduces a test for the fix.

Differential Revision: https://phab.mercurial-scm.org/D11239

(please test the fix)
Comment 9 HG Bot 2021-08-18 14:40:11 UTC
Fixed by https://mercurial-scm.org/repo/hg/rev/32e21ac3adb1
Raphaël Gomès <rgomes@octobus.net>
repair: improve performance of detection of revisions affected by issue6528

Explanations inside the patch. I've tested this on Mozilla-Central and it's
5 times faster than the naive approach on my laptop.

Differential Revision: https://phab.mercurial-scm.org/D11262

(please test the fix)
Comment 10 HG Bot 2021-08-18 14:40:16 UTC
Fixed by https://mercurial-scm.org/repo/hg/rev/5b046c2e3000
Pierre-Yves David <pierre-yves.david@octobus.net>
issue6528: implement _is_revision_affected using callback

The delta comming from a bundle/stream does not exists in the revlog yet, so we
will need other way to retrieve the same information.

To prepare for this we split the function to use callbacks in the core logic.

Differential Revision: https://phab.mercurial-scm.org/D11267

(please test the fix)
Comment 11 HG Bot 2021-08-18 14:40:20 UTC
Fixed by https://mercurial-scm.org/repo/hg/rev/c02ce6def30c
Pierre-Yves David <pierre-yves.david@octobus.net>
issue6528: implement _is_revision_affected_fast using callback

The delta comming from a bundle/stream does not exists in the revlog yet, so we
will need other way to retrieve the same information.

To prepare for this we split the function to use callbacks in the core logic.

Differential Revision: https://phab.mercurial-scm.org/D11268

(please test the fix)
Comment 12 HG Bot 2021-08-18 14:40:25 UTC
Fixed by https://mercurial-scm.org/repo/hg/rev/c30ca163b45e
Pierre-Yves David <pierre-yves.david@octobus.net>
issue6528: also filter delta on the fly when applying a changegroup

This ensure that corrupted clone does not spread corruption to "fixed" version.

This might come at a performance cost, we will had a config option to control
this behavior in the next changesets.

Differential Revision: https://phab.mercurial-scm.org/D11270

(please test the fix)
Comment 13 HG Bot 2021-08-18 14:40:29 UTC
Fixed by https://mercurial-scm.org/repo/hg/rev/2813d406b036
Pierre-Yves David <pierre-yves.david@octobus.net>
issue6528: add a config option to control the fixing on the fly

This will allow people who know to be safe to avoid any performance overhead
(and other potential issue).

Differential Revision: https://phab.mercurial-scm.org/D11271

(please test the fix)
Comment 14 Bugzilla 2021-08-26 04:00:04 UTC
Bug was set to TESTING for 7 days, resolving
Comment 15 HG Bot 2021-10-12 19:35:32 UTC
Fixed by https://mercurial-scm.org/repo/hg/rev/b669e40fbbd6
Raphaël Gomès <rgomes@octobus.net>
help: update help text for debug-repair-issue6528

The changegroup fix was put in 5.9.1, this is now out of date. Alson, this can
maybe encourage people to upgrade?

Differential Revision: https://phab.mercurial-scm.org/D11392

(please test the fix)
Comment 16 Georges Racinet 2021-10-15 13:55:01 UTC
*** Bug 6603 has been marked as a duplicate of this bug. ***
Comment 17 Bugzilla 2021-10-23 04:00:47 UTC
Bug was set to TESTING for 7 days, resolving
Comment 18 HG Bot 2021-12-16 22:40:09 UTC
Fixed by https://mercurial-scm.org/repo/hg/rev/3eb3aef6d3bf
Simon Sapin <simon.sapin@octobus.net>
rhg: Mark it as expected that the issue6528 bug is not reproduced

Differential Revision: https://phab.mercurial-scm.org/D11909

(please test the fix)
Comment 19 Bugzilla 2021-12-24 05:01:10 UTC
Bug was set to TESTING for 7 days, resolving
Comment 20 HG Bot 2022-04-07 08:35:07 UTC
Fixed by https://mercurial-scm.org/repo/hg/rev/5b65721a75eb
Joerg Sonnenberger <joerg@bec.de>
revlog: recommit 49fd21f32695 with a fix for issue6528

`filelog.size` currently special cases two forms of metadata encoding:
- copy data via the parent order as flag bit
- censor data by peaking into the raw delta
All other forms of metadata encoding including the empty metadata block
are mishandled. In `basefilectx.cmp` the empty metadata block is
explicitly checked to compensate for this.

Restore 49fd21f32695, but disable it for filelog, so that the original
flag bit use contines to work. Document all this mess for now in
preparation of a proper rework.

Differential Revision: https://phab.mercurial-scm.org/D11203

(please test the fix)
Comment 21 Bugzilla 2022-04-15 04:00:34 UTC
Bug was set to TESTING for 7 days, resolving
Comment 22 HG Bot 2022-11-30 20:50:10 UTC
Fixed by https://mercurial-scm.org/repo/hg/rev/e92de86cf4f8
Pierre-Yves David <pierre-yves.david@octobus.net>
emitrevision: consider ancestors revision to emit as available base

This should make more delta base valid. This notably affects:

* case where we skipped some parent with empty delta to directly delta against
  an ancestors
* case where an intermediate snapshots is stored.

This change means we could sent largish intermediate snapshots over the wire.
However this is actually a sub goal here. Sending snapshots over the wire means
the client have a high odd of simply storing the pre-computed delta instead of
doing a lengthy process that will… end up doing the same intermediate snapshot.

In addition the overall size of snapshot (or any level) is "only" some or the
overall delta size. (0.17% for my mercurial clone, 20% for my clone of Mozilla
try). So Sending them other the wire is unlikely to change large impact on the
bandwidth used.

If we decide that minimising the bandwidth is an explicit goal, we should
introduce new logic to filter-out snapshot as delta. The current code has no
notion explicite of snapshot so far, they just tended to fall into the wobbly
filtering options.

In some cases, this patch can yield large improvement to the bundling time:

    ### data-env-vars.name      = mozilla-try-2019-02-18-zstd-sparse-revlog
      # benchmark.name          = perf-bundle
      # benchmark.variants.revs = last-100000
    before: 68.787066 seconds
    after:  47.552677 seconds (-30.87%)

That translate to large improvement to the pull time :

    ### data-env-vars.name           = mozilla-try-2019-02-18-zstd-sparse-revlog
      # benchmark.name               = pull
      # benchmark.variants.issue6528 = disabled
      # benchmark.variants.revs      = last-100000
    before: 142.186625 seconds
    after:   75.897745 seconds  (-46.62%)

No significant negative impact have been observed.

(please test the fix)
Comment 23 Bugzilla 2022-12-08 05:00:13 UTC
Bug was set to TESTING for 7 days, resolving
Comment 24 HG Bot 2022-12-12 13:40:10 UTC
Fixed by https://mercurial-scm.org/repo/hg/rev/f463eb675e85
Pierre-Yves David <pierre-yves.david@octobus.net>
emitrevision: consider ancestors revision to emit as available base

This should make more delta base valid. This notably affects:

* case where we skipped some parent with empty delta to directly delta against
  an ancestors
* case where an intermediate snapshots is stored.

This change means we could sent largish intermediate snapshots over the wire.
However this is actually a sub goal here. Sending snapshots over the wire means
the client have a high odd of simply storing the pre-computed delta instead of
doing a lengthy process that will… end up doing the same intermediate snapshot.

In addition the overall size of snapshot (or any level) is "only" some or the
overall delta size. (0.17% for my mercurial clone, 20% for my clone of Mozilla
try). So Sending them other the wire is unlikely to change large impact on the
bandwidth used.

If we decide that minimising the bandwidth is an explicit goal, we should
introduce new logic to filter-out snapshot as delta. The current code has no
notion explicite of snapshot so far, they just tended to fall into the wobbly
filtering options.

In some cases, this patch can yield large improvement to the bundling time:

    ### data-env-vars.name      = mozilla-try-2019-02-18-zstd-sparse-revlog
      # benchmark.name          = perf-bundle
      # benchmark.variants.revs = last-100000
    before: 68.787066 seconds
    after:  47.552677 seconds (-30.87%)

That translate to large improvement to the pull time :

    ### data-env-vars.name           = mozilla-try-2019-02-18-zstd-sparse-revlog
      # benchmark.name               = pull
      # benchmark.variants.issue6528 = disabled
      # benchmark.variants.revs      = last-100000
    before: 142.186625 seconds
    after:   75.897745 seconds  (-46.62%)

No significant negative impact have been observed.

(please test the fix)
Comment 25 Bugzilla 2022-12-20 05:00:12 UTC
Bug was set to TESTING for 7 days, resolving