Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] repack: add --filter=<filter-spec> #1206

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

john-cai
Copy link
Contributor

@john-cai john-cai commented Jan 26, 2022

This patch series makes partial clone more useful by making it possible to run repack to remove objects from a repository (replacing it with promisor objects). This is useful when we want to offload large blobs from a git server onto another git server, or even use an http server through a remote helper.

This was originally submitted as a patch series [A] that did not have a clear explanation of the goal or motivation for the change. Based on Stolee's feedback, added RFC to the subject title to indicate that feedback is needed on the direction of this patch before diving into the details of implementation.

In [B], a --refilter option on fetch and fetch-pack is being discussed where either a less restrictive or more restrictive filter can be used. In the more restrictive case, the objects that already exist will not be deleted. But, one can imagine that users might want the ability to delete objects when they apply a more restrictive filter in order to save space, and this patch series would also allow that.

There are a couple of things we need to adjust to make this possible. This patch has three parts.

  1. Allow --filter in pack-objects without --stdout
  2. Add a --filter flag for repack
  3. Allow missing promisor objects in upload-pack
  4. Tests that demonstrate the ability to offload objects onto an http remote

Changes since v2:

  • nothing. just fixed up the CC line

Changes since v1:

  • Added patch to allow missing promisor objects in upload-pack.
  • Clarified motivation in commit messages.
  • Added test that demonstrates process of offloading objects onto http remote.

A. https://lore.kernel.org/git/a62a007f-7c61-68eb-c0e6-548dc9b6f671@gmail.com/
B. https://lore.kernel.org/git/pull.1138.git.1643730593.gitgitgadget@gmail.com/

cc: Christian Couder christian.couder@gmail.com
cc: Derrick Stolee stolee@gmail.com
cc: Robert Coup robert@coup.net.nz
cc: Robert Coup robert.coup@koordinates.com
cc: Taylor Blau me@ttaylorr.com

@john-cai john-cai force-pushed the jc-repack-filter branch 3 times, most recently from 4d032f6 to 5efc2fa Compare Jan 27, 2022
@john-cai
Copy link
Contributor Author

@john-cai john-cai commented Jan 27, 2022

/preview

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Jan 27, 2022

@john-cai
Copy link
Contributor Author

@john-cai john-cai commented Jan 27, 2022

/preview

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Jan 27, 2022

@john-cai
Copy link
Contributor Author

@john-cai john-cai commented Jan 27, 2022

/submit

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Jan 27, 2022

Submitted as pull.1206.git.git.1643248180.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-git-1206/john-cai/jc-repack-filter-v1

To fetch this version to local tag pr-git-1206/john-cai/jc-repack-filter-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-git-1206/john-cai/jc-repack-filter-v1

@@ -126,6 +126,11 @@ depth is 4095.
a larger and slower repository; see the discussion in

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/26/2022 8:49 PM, John Cai via GitGitGadget wrote:
> From: John Cai <johncai86@gmail.com>
> 
> Currently, repack does not work with partial clones. When repack is run
> on a partially cloned repository, it grabs all missing objects from
> promisor remotes. This also means that when gc is run for repository
> maintenance on a partially cloned repository, it will end up getting
> missing objects, which is not what we want.

This shouldn't be what is happening. Do you have a demonstration of
this happening? repack_promisor_objects() should be avoiding following
links outside of promisor packs so we can safely 'git gc' in a partial
clone without downloading all reachable blobs.

> In order to make repack work with partial clone, teach repack a new
> option --filter, which takes a <filter-spec> argument. repack will skip
> any objects that are matched by <filter-spec> similar to how the clone
> command will skip fetching certain objects.

This is a bit misleading, since 'git clone' doesn't "skip fetching",
but instead requests a filter and the server can choose to write a
pack-file using that filter. I'm not sure if it's worth how pedantic
I'm being here.

The thing that I find confusing here is that you are adding an option
that could be run on a _full_ repository. If I have a set of packs
and none of them are promisor (I have every reachable object), then
what is the end result after 'git repack -adf --filter=blob:none'?
Those existing pack-files shouldn't be deleted because they have
objects that are not in the newly-created pack-file.

I'd like to see some additional clarity on this before continuing
to review this series.

> The final goal of this feature, is to be able to store objects on a
> server other than the regular git server itself.
> 
> There are several scripts added so we can test the process of using a
> remote helper to upload blobs to an http server:
> 
> - t/lib-httpd/list.sh lists blobs uploaded to the http server.
> - t/lib-httpd/upload.sh uploads blobs to the http server.
> - t/t0410/git-remote-testhttpgit a remote helper that can access blobs
>   onto from an http server. Copied over from t/t5801/git-remote-testhttpgit
>   and modified to upload blobs to an http server.
> - t/t0410/lib-http-promisor.sh convenience functions for uploading
>   blobs

I think these changes to the tests should be extracted to a new
patch where this can be discussed in more detail. I didn't look
too closely at them because I want to focus on whether this
--filter option is a good direction for 'git repack'.

>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
> @@ -819,6 +824,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>  		if (line.len != the_hash_algo->hexsz)
>  			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
>  		string_list_append(&names, line.buf);
> +		if (po_args.filter) {
> +			char *promisor_name = mkpathdup("%s-%s.promisor", packtmp,
> +							line.buf);
> +			write_promisor_file(promisor_name, NULL, 0);

This code is duplicated in repack_promisor_objects(), so it would be
good to extract that logic into a helper method called by both places.

> +		}
>  	}
>  	fclose(out);
>  	ret = finish_command(&cmd);

> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index e489869dd94..78cc1858cb6 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -237,6 +237,26 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
>  	test_must_be_empty actual
>  '
>  
> +test_expect_success 'repack with filter does not fetch from remote' '
> +	rm -rf server client &&
> +	test_create_repo server &&
> +	git -C server config uploadpack.allowFilter true &&
> +	git -C server config uploadpack.allowAnySHA1InWant true &&
> +	echo content1 >server/file1 &&
> +	git -C server add file1 &&
> +	git -C server commit -m initial_commit &&
> +	expected="?$(git -C server rev-parse :file1)" &&
> +	git clone --bare --no-local server client &&

You could use "file:://$(pwd)/server" here instead of "server".

> +	git -C client config remote.origin.promisor true &&
> +	git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
This isn't testing what you want it to test, because your initial
clone doesn't use --filter=blob:none, so you already have all of
the objects in the client. You would never trigger a need for a
fetch from the remote.

> +	git -C client rev-list --objects --all --missing=print >objects &&
> +	grep "$expected" objects &&
> +	git -C client repack -a -d &&
> +	expected="$(git -C server rev-parse :file1)" &&

This is signalling to me that you are looking for a remote fetch
now that you are repacking everything, and that can only happen
if you deleted objects from the client during your first repack.
That seems incorrect.

> +	git -C client rev-list --objects --all --missing=print >objects &&
> +	grep "$expected" objects
> +'

Based on my current understanding, this patch seems unnecessary (repacks
should already be doing the right thing when in the presence of a partial
clone) and incorrect (we should not delete existing reachable objects
when repacking with a filter).

I look forward to hearing more about your intended use of this feature so
we can land on a better way to solve the problems you are having.

Thanks,
-Stolee

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, John Cai wrote (reply to this):

Hi Stolee,

Thanks for taking the time to review this patch! I added some points of clarification
down below.

On 27 Jan 2022, at 10:03, Derrick Stolee wrote:

> On 1/26/2022 8:49 PM, John Cai via GitGitGadget wrote:
>> From: John Cai <johncai86@gmail.com>
>>
>> Currently, repack does not work with partial clones. When repack is run
>> on a partially cloned repository, it grabs all missing objects from
>> promisor remotes. This also means that when gc is run for repository
>> maintenance on a partially cloned repository, it will end up getting
>> missing objects, which is not what we want.
>
> This shouldn't be what is happening. Do you have a demonstration of
> this happening? repack_promisor_objects() should be avoiding following
> links outside of promisor packs so we can safely 'git gc' in a partial
> clone without downloading all reachable blobs.

You're right, sorry I was mistaken about this detail of how partial clones work.
>
>> In order to make repack work with partial clone, teach repack a new
>> option --filter, which takes a <filter-spec> argument. repack will skip
>> any objects that are matched by <filter-spec> similar to how the clone
>> command will skip fetching certain objects.
>
> This is a bit misleading, since 'git clone' doesn't "skip fetching",
> but instead requests a filter and the server can choose to write a
> pack-file using that filter. I'm not sure if it's worth how pedantic
> I'm being here.

Thanks for the more precise description of the mechanics of partial clone.
I'll improve the wording in the next version of this patch series.

>
> The thing that I find confusing here is that you are adding an option
> that could be run on a _full_ repository. If I have a set of packs
> and none of them are promisor (I have every reachable object), then
> what is the end result after 'git repack -adf --filter=blob:none'?
> Those existing pack-files shouldn't be deleted because they have
> objects that are not in the newly-created pack-file.
>
> I'd like to see some additional clarity on this before continuing
> to review this series.

Apologies for the lack of clarity. Indeed, I can see why this is the most important
detail of this patch to provide enough context on, as it involves deleting
objects from a full repository as you said.

To back up a little, the goal is to be able to offload large
blobs to a separate http server. Christian Couder has a demo [1] that shows
this in detail.

If we had the following:
A. an http server to use as a generalized object store
B. a server update hook that uploads large blobs to 1.
C. a git server
D. a regular job that runs `git repack --filter` to remove large
blobs from C.

Clients would need to configure both C) and A) as promisor remotes to
be able to get everything. When they push new large blobs, they can
still push them to C), as B) will upload them to A), and D) will
regularly remove those large blobs from C).

This way with a little bit of client and server configuration, we can have
a native way to support offloading large files without git LFS.
It would be more flexible as you can easily tweak which blobs are considered large
files by tweaking B) and D).

>
>> The final goal of this feature, is to be able to store objects on a
>> server other than the regular git server itself.
>>
>> There are several scripts added so we can test the process of using a
>> remote helper to upload blobs to an http server:
>>
>> - t/lib-httpd/list.sh lists blobs uploaded to the http server.
>> - t/lib-httpd/upload.sh uploads blobs to the http server.
>> - t/t0410/git-remote-testhttpgit a remote helper that can access blobs
>>   onto from an http server. Copied over from t/t5801/git-remote-testhttpgit
>>   and modified to upload blobs to an http server.
>> - t/t0410/lib-http-promisor.sh convenience functions for uploading
>>   blobs
>
> I think these changes to the tests should be extracted to a new
> patch where this can be discussed in more detail. I didn't look
> too closely at them because I want to focus on whether this
> --filter option is a good direction for 'git repack'.
>
>>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>> @@ -819,6 +824,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>>  		if (line.len != the_hash_algo->hexsz)
>>  			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
>>  		string_list_append(&names, line.buf);
>> +		if (po_args.filter) {
>> +			char *promisor_name = mkpathdup("%s-%s.promisor", packtmp,
>> +							line.buf);
>> +			write_promisor_file(promisor_name, NULL, 0);
>
> This code is duplicated in repack_promisor_objects(), so it would be
> good to extract that logic into a helper method called by both places.

Thanks for pointing this out. I'll incorporate this into the next version.
>
>> +		}
>>  	}
>>  	fclose(out);
>>  	ret = finish_command(&cmd);
>
>> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
>> index e489869dd94..78cc1858cb6 100755
>> --- a/t/t7700-repack.sh
>> +++ b/t/t7700-repack.sh
>> @@ -237,6 +237,26 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
>>  	test_must_be_empty actual
>>  '
>>
>> +test_expect_success 'repack with filter does not fetch from remote' '
>> +	rm -rf server client &&
>> +	test_create_repo server &&
>> +	git -C server config uploadpack.allowFilter true &&
>> +	git -C server config uploadpack.allowAnySHA1InWant true &&
>> +	echo content1 >server/file1 &&
>> +	git -C server add file1 &&
>> +	git -C server commit -m initial_commit &&
>> +	expected="?$(git -C server rev-parse :file1)" &&
>> +	git clone --bare --no-local server client &&
>
> You could use "file:://$(pwd)/server" here instead of "server".

good point, thanks

>
>> +	git -C client config remote.origin.promisor true &&
>> +	git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
> This isn't testing what you want it to test, because your initial
> clone doesn't use --filter=blob:none, so you already have all of
> the objects in the client. You would never trigger a need for a
> fetch from the remote.

right, so this test is actually testing that repack --filter would shed objects to show
that it can be used as D) as a regular cleanup job for git servers that utilize another
http server to host large blobs.

>
>> +	git -C client rev-list --objects --all --missing=print >objects &&
>> +	grep "$expected" objects &&
>> +	git -C client repack -a -d &&
>> +	expected="$(git -C server rev-parse :file1)" &&
>
> This is signalling to me that you are looking for a remote fetch
> now that you are repacking everything, and that can only happen
> if you deleted objects from the client during your first repack.
> That seems incorrect.
>
>> +	git -C client rev-list --objects --all --missing=print >objects &&
>> +	grep "$expected" objects
>> +'
>
> Based on my current understanding, this patch seems unnecessary (repacks
> should already be doing the right thing when in the presence of a partial
> clone) and incorrect (we should not delete existing reachable objects
> when repacking with a filter).
>
> I look forward to hearing more about your intended use of this feature so
> we can land on a better way to solve the problems you are having.

Thanks for the callouts on the big picture of this proposed change. Looking
forward to getting your thoughts on this!
>
> Thanks,
> -Stolee

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Christian Couder wrote (reply to this):

On Sat, Jan 29, 2022 at 8:14 PM John Cai <johncai86@gmail.com> wrote:

> Apologies for the lack of clarity. Indeed, I can see why this is the most important
> detail of this patch to provide enough context on, as it involves deleting
> objects from a full repository as you said.
>
> To back up a little, the goal is to be able to offload large
> blobs to a separate http server. Christian Couder has a demo [1] that shows
> this in detail.

You might have forgotten to provide a link for [1], also I am not sure
if you wanted to link to the repo:

https://gitlab.com/chriscool/partial-clone-demo/

or the demo itself in the repo:

https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt

> If we had the following:
> A. an http server to use as a generalized object store
> B. a server update hook that uploads large blobs to 1.

s/1./A./

> C. a git server
> D. a regular job that runs `git repack --filter` to remove large
> blobs from C.
>
> Clients would need to configure both C) and A) as promisor remotes to

Maybe s/C)/C./ and s/A)/A./

Also note that configuring A. as a promisor remote requires a remote helper.

> be able to get everything. When they push new large blobs, they can
> still push them to C), as B) will upload them to A), and D) will
> regularly remove those large blobs from C).
>
> This way with a little bit of client and server configuration, we can have
> a native way to support offloading large files without git LFS.
> It would be more flexible as you can easily tweak which blobs are considered large
> files by tweaking B) and D).

Yeah, that's the idea of the demo.

Thanks for working on this!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, John Cai wrote (reply to this):

Sorry forgot to include the link to Christian's demo. included below

On 29 Jan 2022, at 14:14, John Cai wrote:

> Hi Stolee,
>
> Thanks for taking the time to review this patch! I added some points of clarification
> down below.
>
> On 27 Jan 2022, at 10:03, Derrick Stolee wrote:
>
>> On 1/26/2022 8:49 PM, John Cai via GitGitGadget wrote:
>>> From: John Cai <johncai86@gmail.com>
>>>
>>> Currently, repack does not work with partial clones. When repack is run
>>> on a partially cloned repository, it grabs all missing objects from
>>> promisor remotes. This also means that when gc is run for repository
>>> maintenance on a partially cloned repository, it will end up getting
>>> missing objects, which is not what we want.
>>
>> This shouldn't be what is happening. Do you have a demonstration of
>> this happening? repack_promisor_objects() should be avoiding following
>> links outside of promisor packs so we can safely 'git gc' in a partial
>> clone without downloading all reachable blobs.
>
> You're right, sorry I was mistaken about this detail of how partial clones work.
>>
>>> In order to make repack work with partial clone, teach repack a new
>>> option --filter, which takes a <filter-spec> argument. repack will skip
>>> any objects that are matched by <filter-spec> similar to how the clone
>>> command will skip fetching certain objects.
>>
>> This is a bit misleading, since 'git clone' doesn't "skip fetching",
>> but instead requests a filter and the server can choose to write a
>> pack-file using that filter. I'm not sure if it's worth how pedantic
>> I'm being here.
>
> Thanks for the more precise description of the mechanics of partial clone.
> I'll improve the wording in the next version of this patch series.
>
>>
>> The thing that I find confusing here is that you are adding an option
>> that could be run on a _full_ repository. If I have a set of packs
>> and none of them are promisor (I have every reachable object), then
>> what is the end result after 'git repack -adf --filter=blob:none'?
>> Those existing pack-files shouldn't be deleted because they have
>> objects that are not in the newly-created pack-file.
>>
>> I'd like to see some additional clarity on this before continuing
>> to review this series.
>
> Apologies for the lack of clarity. Indeed, I can see why this is the most important
> detail of this patch to provide enough context on, as it involves deleting
> objects from a full repository as you said.
>
> To back up a little, the goal is to be able to offload large
> blobs to a separate http server. Christian Couder has a demo [1] that shows
> this in detail.
>
> If we had the following:
> A. an http server to use as a generalized object store
> B. a server update hook that uploads large blobs to 1.
> C. a git server
> D. a regular job that runs `git repack --filter` to remove large
> blobs from C.
>
> Clients would need to configure both C) and A) as promisor remotes to
> be able to get everything. When they push new large blobs, they can
> still push them to C), as B) will upload them to A), and D) will
> regularly remove those large blobs from C).
>
> This way with a little bit of client and server configuration, we can have
> a native way to support offloading large files without git LFS.
> It would be more flexible as you can easily tweak which blobs are considered large
> files by tweaking B) and D).
>

[1] https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt

>>
>>> The final goal of this feature, is to be able to store objects on a
>>> server other than the regular git server itself.
>>>
>>> There are several scripts added so we can test the process of using a
>>> remote helper to upload blobs to an http server:
>>>
>>> - t/lib-httpd/list.sh lists blobs uploaded to the http server.
>>> - t/lib-httpd/upload.sh uploads blobs to the http server.
>>> - t/t0410/git-remote-testhttpgit a remote helper that can access blobs
>>>   onto from an http server. Copied over from t/t5801/git-remote-testhttpgit
>>>   and modified to upload blobs to an http server.
>>> - t/t0410/lib-http-promisor.sh convenience functions for uploading
>>>   blobs
>>
>> I think these changes to the tests should be extracted to a new
>> patch where this can be discussed in more detail. I didn't look
>> too closely at them because I want to focus on whether this
>> --filter option is a good direction for 'git repack'.
>>
>>>  		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>>> @@ -819,6 +824,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
>>>  		if (line.len != the_hash_algo->hexsz)
>>>  			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
>>>  		string_list_append(&names, line.buf);
>>> +		if (po_args.filter) {
>>> +			char *promisor_name = mkpathdup("%s-%s.promisor", packtmp,
>>> +							line.buf);
>>> +			write_promisor_file(promisor_name, NULL, 0);
>>
>> This code is duplicated in repack_promisor_objects(), so it would be
>> good to extract that logic into a helper method called by both places.
>
> Thanks for pointing this out. I'll incorporate this into the next version.
>>
>>> +		}
>>>  	}
>>>  	fclose(out);
>>>  	ret = finish_command(&cmd);
>>
>>> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
>>> index e489869dd94..78cc1858cb6 100755
>>> --- a/t/t7700-repack.sh
>>> +++ b/t/t7700-repack.sh
>>> @@ -237,6 +237,26 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
>>>  	test_must_be_empty actual
>>>  '
>>>
>>> +test_expect_success 'repack with filter does not fetch from remote' '
>>> +	rm -rf server client &&
>>> +	test_create_repo server &&
>>> +	git -C server config uploadpack.allowFilter true &&
>>> +	git -C server config uploadpack.allowAnySHA1InWant true &&
>>> +	echo content1 >server/file1 &&
>>> +	git -C server add file1 &&
>>> +	git -C server commit -m initial_commit &&
>>> +	expected="?$(git -C server rev-parse :file1)" &&
>>> +	git clone --bare --no-local server client &&
>>
>> You could use "file:://$(pwd)/server" here instead of "server".
>
> good point, thanks
>
>>
>>> +	git -C client config remote.origin.promisor true &&
>>> +	git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
>> This isn't testing what you want it to test, because your initial
>> clone doesn't use --filter=blob:none, so you already have all of
>> the objects in the client. You would never trigger a need for a
>> fetch from the remote.
>
> right, so this test is actually testing that repack --filter would shed objects to show
> that it can be used as D) as a regular cleanup job for git servers that utilize another
> http server to host large blobs.
>
>>
>>> +	git -C client rev-list --objects --all --missing=print >objects &&
>>> +	grep "$expected" objects &&
>>> +	git -C client repack -a -d &&
>>> +	expected="$(git -C server rev-parse :file1)" &&
>>
>> This is signalling to me that you are looking for a remote fetch
>> now that you are repacking everything, and that can only happen
>> if you deleted objects from the client during your first repack.
>> That seems incorrect.
>>
>>> +	git -C client rev-list --objects --all --missing=print >objects &&
>>> +	grep "$expected" objects
>>> +'
>>
>> Based on my current understanding, this patch seems unnecessary (repacks
>> should already be doing the right thing when in the presence of a partial
>> clone) and incorrect (we should not delete existing reachable objects
>> when repacking with a filter).
>>
>> I look forward to hearing more about your intended use of this feature so
>> we can land on a better way to solve the problems you are having.
>
> Thanks for the callouts on the big picture of this proposed change. Looking
> forward to getting your thoughts on this!
>>
>> Thanks,
>> -Stolee

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Jan 27, 2022

User Derrick Stolee <stolee@gmail.com> has been added to the cc: list.

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 7, 2022

There are issues in commit 7fec293:
repack: add --filter=<filter-spec> option
Lines in the body of the commit messages should be wrapped between 60 and 76 characters.

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 7, 2022

There are issues in commit 7fec293:
repack: add --filter=<filter-spec> option
Lines in the body of the commit messages should be wrapped between 60 and 76 characters.

@john-cai john-cai force-pushed the jc-repack-filter branch 2 times, most recently from 9dbfdea to 307deba Compare Feb 8, 2022
@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 8, 2022

There are issues in commit 844dc6e:
repack: add --filter=<filter-spec> option
Lines in the body of the commit messages should be wrapped between 60 and 76 characters.

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 9, 2022

There are issues in commit 844dc6e:
repack: add --filter=<filter-spec> option
Lines in the body of the commit messages should be wrapped between 60 and 76 characters.

@john-cai
Copy link
Contributor Author

@john-cai john-cai commented Feb 9, 2022

/preview

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 9, 2022

There are issues in commit 844dc6e:
repack: add --filter=<filter-spec> option
Lines in the body of the commit messages should be wrapped between 60 and 76 characters.

john-cai added 4 commits Feb 9, 2022
9535ce7 taught pack-objects to use filtering, but added a requirement of
the --stdout since a partial clone mechanism was not yet in place to
handle missing objects. Since then, changes like 9e27bea and others
added support to dynamically fetch objects that were missing.

Remove the --stdout requirement so that in the next commit, repack can
pass --filter to pack-objects to omit certain objects from the packfile.

Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: John Cai <johncai86@gmail.com>
In order to use a separate http server as a remote to offload large
blobs, imagine the following:

A. an http server to use as a generalized object store.
B. a server update hook that uploads large blobs to (A).
C. a git server
D. a remote helper that knows how to download objects from the http
server
E. a regular job that runs `git repack --filter` to remove large
blobs from (C).

Clients would need to configure both (C) and (A) as promisor remotes to
be able to get everything. When they push new large blobs, they can
still push them to (C), as (B) will upload them to (A), and (E) will
regularly remove those large blobs from (C).

This way with a little bit of client and server configuration, we can
have a native way to support offloading large files without git LFS.
It would be more flexible as you can easily tweak which blobs are
considered large files by tweaking (B) and (E).

A fuller demo can be found at http://tiny.cc/object_storage_demo

Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: John Cai <johncai86@gmail.com>
When a git server (A) is being used alongside an http server (B) remote
that stores large blobs, and a client fetches objects from both (A) as
well as (B), we do not want (A) to fetch missing objects during object
traversal.

Add a config value uploadpack.allowmissingpromisor that, when set to
true, will allow (A) to skip fetching missing objects.

Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: John Cai <johncai86@gmail.com>
This patch adds tests to test both repack --filter functionality in
isolation (in t7700-repack.sh) as well as how it can be used to offload
large blobs (in t0410-partial-clone.sh)

There are several scripts added so we can test the process of using a
remote helper to upload blobs to an http server.

- t/lib-httpd/list.sh lists blobs uploaded to the http server.
- t/lib-httpd/upload.sh uploads blobs to the http server.
- t/t0410/git-remote-testhttpgit a remote helper that can access blobs
  onto from an http server. Copied over from t/t5801/git-remote-testhttpgit
  and modified to upload blobs to an http server.
- t/t0410/lib-http-promisor.sh convenience functions for uploading
  blobs

Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: John Cai <johncai86@gmail.com>
@john-cai
Copy link
Contributor Author

@john-cai john-cai commented Feb 9, 2022

/preview

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 9, 2022

@john-cai john-cai changed the title repack: add --filter=<filter-spec> [RFC] repack: add --filter=<filter-spec> Feb 9, 2022
@john-cai
Copy link
Contributor Author

@john-cai john-cai commented Feb 9, 2022

/submit

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 9, 2022

Submitted as pull.1206.v2.git.git.1644372606.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-git-1206/john-cai/jc-repack-filter-v2

To fetch this version to local tag pr-git-1206/john-cai/jc-repack-filter-v2:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-git-1206/john-cai/jc-repack-filter-v2

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 9, 2022

On the Git mailing list, John Cai wrote (reply to this):

Hi Johannes

I'm not sure where I went wrong on GGG. Somehow the cc list didn't get translated into
cc fields. Here's the PR: https://github.com/git/git/pull/1206. Thanks!

cc'ing folks I meant to cc for this patch series

On 8 Feb 2022, at 21:10, John Cai via GitGitGadget wrote:

> This patch series makes partial clone more useful by making it possible to
> run repack to remove objects from a repository (replacing it with promisor
> objects). This is useful when we want to offload large blobs from a git
> server onto another git server, or even use an http server through a remote
> helper.
>
> In [A], a --refilter option on fetch and fetch-pack is being discussed where
> either a less restrictive or more restrictive filter can be used. In the
> more restrictive case, the objects that already exist will not be deleted.
> But, one can imagine that users might want the ability to delete objects
> when they apply a more restrictive filter in order to save space, and this
> patch series would also allow that.
>
> There are a couple of things we need to adjust to make this possible. This
> patch has three parts.
>
>  1. Allow --filter in pack-objects without --stdout
>  2. Add a --filter flag for repack
>  3. Allow missing promisor objects in upload-pack
>  4. Tests that demonstrate the ability to offload objects onto an http
>     remote
>
> cc: Christian Couder christian.couder@gmail.com cc: Derrick Stolee
> stolee@gmail.com cc: Robert Coup robert@coup.net.nz
>
> A.
> https://lore.kernel.org/git/pull.1138.git.1643730593.gitgitgadget@gmail.com/
>
> John Cai (4):
>   pack-objects: allow --filter without --stdout
>   repack: add --filter=<filter-spec> option
>   upload-pack: allow missing promisor objects
>   tests for repack --filter mode
>
>  Documentation/git-repack.txt   |   5 +
>  builtin/pack-objects.c         |   2 -
>  builtin/repack.c               |  22 +++--
>  t/lib-httpd.sh                 |   2 +
>  t/lib-httpd/apache.conf        |   8 ++
>  t/lib-httpd/list.sh            |  43 +++++++++
>  t/lib-httpd/upload.sh          |  46 +++++++++
>  t/t0410-partial-clone.sh       |  81 ++++++++++++++++
>  t/t0410/git-remote-testhttpgit | 170 +++++++++++++++++++++++++++++++++
>  t/t7700-repack.sh              |  20 ++++
>  upload-pack.c                  |   5 +
>  11 files changed, 395 insertions(+), 9 deletions(-)
>  create mode 100644 t/lib-httpd/list.sh
>  create mode 100644 t/lib-httpd/upload.sh
>  create mode 100755 t/t0410/git-remote-testhttpgit
>
>
> base-commit: 38062e73e009f27ea192d50481fcb5e7b0e9d6eb
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1206%2Fjohn-cai%2Fjc-repack-filter-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1206/john-cai/jc-repack-filter-v2
> Pull-Request: https://github.com/git/git/pull/1206
>
> Range-diff vs v1:
>
>  1:  0eec9b117da = 1:  f43b76ca650 pack-objects: allow --filter without --stdout
>  -:  ----------- > 2:  6e7c8410b8d repack: add --filter=<filter-spec> option
>  -:  ----------- > 3:  40612b9663b upload-pack: allow missing promisor objects
>  2:  a3166381572 ! 4:  d76faa1f16e repack: add --filter=<filter-spec> option
>      @@ Metadata
>       Author: John Cai <johncai86@gmail.com>
>
>        ## Commit message ##
>      -    repack: add --filter=<filter-spec> option
>      +    tests for repack --filter mode
>
>      -    Currently, repack does not work with partial clones. When repack is run
>      -    on a partially cloned repository, it grabs all missing objects from
>      -    promisor remotes. This also means that when gc is run for repository
>      -    maintenance on a partially cloned repository, it will end up getting
>      -    missing objects, which is not what we want.
>      -
>      -    In order to make repack work with partial clone, teach repack a new
>      -    option --filter, which takes a <filter-spec> argument. repack will skip
>      -    any objects that are matched by <filter-spec> similar to how the clone
>      -    command will skip fetching certain objects.
>      -
>      -    The final goal of this feature, is to be able to store objects on a
>      -    server other than the regular git server itself.
>      +    This patch adds tests to test both repack --filter functionality in
>      +    isolation (in t7700-repack.sh) as well as how it can be used to offload
>      +    large blobs (in t0410-partial-clone.sh)
>
>           There are several scripts added so we can test the process of using a
>      -    remote helper to upload blobs to an http server:
>      +    remote helper to upload blobs to an http server.
>
>           - t/lib-httpd/list.sh lists blobs uploaded to the http server.
>           - t/lib-httpd/upload.sh uploads blobs to the http server.
>      @@ Commit message
>           Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org>
>           Signed-off-by: John Cai <johncai86@gmail.com>
>
>      - ## Documentation/git-repack.txt ##
>      -@@ Documentation/git-repack.txt: depth is 4095.
>      - 	a larger and slower repository; see the discussion in
>      - 	`pack.packSizeLimit`.
>      -
>      -+--filter=<filter-spec>::
>      -+	Omits certain objects (usually blobs) from the resulting
>      -+	packfile. See linkgit:git-rev-list[1] for valid
>      -+	`<filter-spec>` forms.
>      -+
>      - -b::
>      - --write-bitmap-index::
>      - 	Write a reachability bitmap index as part of the repack. This
>      -
>      - ## builtin/repack.c ##
>      -@@ builtin/repack.c: struct pack_objects_args {
>      - 	const char *depth;
>      - 	const char *threads;
>      - 	const char *max_pack_size;
>      -+	const char *filter;
>      - 	int no_reuse_delta;
>      - 	int no_reuse_object;
>      - 	int quiet;
>      -@@ builtin/repack.c: static void prepare_pack_objects(struct child_process *cmd,
>      - 		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
>      - 	if (args->max_pack_size)
>      - 		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
>      -+	if (args->filter)
>      -+		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
>      - 	if (args->no_reuse_delta)
>      - 		strvec_pushf(&cmd->args, "--no-reuse-delta");
>      - 	if (args->no_reuse_object)
>      -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
>      - 				N_("limits the maximum number of threads")),
>      - 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
>      - 				N_("maximum size of each packfile")),
>      -+		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
>      -+				N_("object filtering")),
>      - 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
>      - 				N_("repack objects in packs marked with .keep")),
>      - 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>      -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
>      - 		if (line.len != the_hash_algo->hexsz)
>      - 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
>      - 		string_list_append(&names, line.buf);
>      -+		if (po_args.filter) {
>      -+			char *promisor_name = mkpathdup("%s-%s.promisor", packtmp,
>      -+							line.buf);
>      -+			write_promisor_file(promisor_name, NULL, 0);
>      -+		}
>      - 	}
>      - 	fclose(out);
>      - 	ret = finish_command(&cmd);
>      -
>        ## t/lib-httpd.sh ##
>       @@ t/lib-httpd.sh: prepare_httpd() {
>        	install_script error-smart-http.sh
>      @@ t/t0410-partial-clone.sh: test_expect_success 'fetching of missing objects from
>       +	git -C server rev-list --objects --all --missing=print >objects &&
>       +	grep "$sha" objects
>       +'
>      ++
>      ++test_expect_success 'fetch does not cause server to fetch missing objects' '
>      ++	rm -rf origin server client &&
>      ++	test_create_repo origin &&
>      ++	dd if=/dev/zero of=origin/file1 bs=801k count=1 &&
>      ++	git -C origin add file1 &&
>      ++	git -C origin commit -m "large blob" &&
>      ++	sha="$(git -C origin rev-parse :file1)" &&
>      ++	expected="?$(git -C origin rev-parse :file1)" &&
>      ++	git clone --bare --no-local origin server &&
>      ++	git -C server remote add httpremote "testhttpgit::${PWD}/server" &&
>      ++	git -C server config remote.httpremote.promisor true &&
>      ++	git -C server config --remove-section remote.origin &&
>      ++	git -C server rev-list --all --objects --filter-print-omitted \
>      ++		--filter=blob:limit=800k | perl -ne "print if s/^[~]//" \
>      ++		>large_blobs.txt &&
>      ++	upload_blobs_from_stdin server <large_blobs.txt &&
>      ++	git -C server -c repack.writebitmaps=false repack -a -d \
>      ++		--filter=blob:limit=800k &&
>      ++	git -C server config uploadpack.allowmissingpromisor true &&
>      ++	git clone -c remote.httpremote.url="testhttpgit::${PWD}/server" \
>      ++	-c remote.httpremote.fetch='+refs/heads/*:refs/remotes/httpremote/*' \
>      ++	-c remote.httpremote.promisor=true --bare --no-local \
>      ++	--filter=blob:limit=800k server client &&
>      ++	git -C client rev-list --objects --all --missing=print >client_objects &&
>      ++	grep "$expected" client_objects &&
>      ++	git -C server rev-list --objects --all --missing=print >server_objects &&
>      ++	grep "$expected" server_objects
>      ++'
>       +
>        # DO NOT add non-httpd-specific tests here, because the last part of this
>        # test script is only executed when httpd is available and enabled.
>
> -- 
> gitgitgadget

@john-cai
Copy link
Contributor Author

@john-cai john-cai commented Feb 14, 2022

/preview

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 14, 2022

@john-cai john-cai changed the title [RFC] repack: add --filter=<filter-spec> repack: add --filter=<filter-spec> Feb 14, 2022
@john-cai john-cai changed the title repack: add --filter=<filter-spec> [RFC] repack: add --filter=<filter-spec> Feb 14, 2022
@john-cai
Copy link
Contributor Author

@john-cai john-cai commented Feb 14, 2022

/preview

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 14, 2022

@john-cai
Copy link
Contributor Author

@john-cai john-cai commented Feb 14, 2022

/submit

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 14, 2022

Error: d76faa1 was already submitted

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 16, 2022

On the Git mailing list, Robert Coup wrote (reply to this):

Hi John,

On Wed, 9 Feb 2022 at 02:41, John Cai via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This patch series makes partial clone more useful by making it possible to
> run repack to remove objects from a repository (replacing it with promisor
> objects). This is useful when we want to offload large blobs from a git
> server onto another git server, or even use an http server through a remote
> helper.
>
> In [A], a --refilter option on fetch and fetch-pack is being discussed where
> either a less restrictive or more restrictive filter can be used. In the
> more restrictive case, the objects that already exist will not be deleted.
> But, one can imagine that users might want the ability to delete objects
> when they apply a more restrictive filter in order to save space, and this
> patch series would also allow that.

This all makes sense to me, and the implementation is remarkably short -
gluing together capabilities that are already there, and writing tests.

*But*, running `repack --filter` drops objects from the object db.
That seems like
a capability Git shouldn't idly expose without people understanding the
consequences - mostly that they really have another copy elsewhere or they
will lose data, and it won't necessarily be obvious for a long time. Otherwise
it is a footgun.

I don't know whether that is just around naming (--delete-filter /
--drop-filter /
--expire-filter ?), and/or making the documentation very explicit that
this isn't so
much "omitting certain objects from a packfile" as irretrievably
deleting objects.

Rob :)

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 16, 2022

User Robert Coup <robert.coup@koordinates.com> has been added to the cc: list.

@@ -136,6 +136,8 @@ prepare_httpd() {
install_script error-smart-http.sh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Robert Coup wrote (reply to this):

Hi John,

Minor, but should we use oid rather than sha1 in the list.sh/upload.sh
scripts? wrt sha256 slowly coming along the pipe.

> diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
> index e489869dd94..78cc1858cb6 100755
> --- a/t/t7700-repack.sh
> +++ b/t/t7700-repack.sh
> @@ -237,6 +237,26 @@ test_expect_success 'auto-bitmaps do not complain if unavailable' '
>         test_must_be_empty actual
>  '
>
> +test_expect_success 'repack with filter does not fetch from remote' '
> +       rm -rf server client &&
> +       test_create_repo server &&
> +       git -C server config uploadpack.allowFilter true &&
> +       git -C server config uploadpack.allowAnySHA1InWant true &&
> +       echo content1 >server/file1 &&
> +       git -C server add file1 &&
> +       git -C server commit -m initial_commit &&
> +       expected="?$(git -C server rev-parse :file1)" &&
> +       git clone --bare --no-local server client &&
> +       git -C client config remote.origin.promisor true &&
> +       git -C client -c repack.writebitmaps=false repack -a -d --filter=blob:none &&

Does writing bitmaps have any effect/interaction here?

> +       git -C client rev-list --objects --all --missing=print >objects &&
> +       grep "$expected" objects &&

This is testing the object that was cloned initially is gone after the
repack, ok.

> +       git -C client repack -a -d &&
> +       expected="$(git -C server rev-parse :file1)" &&
> +       git -C client rev-list --objects --all --missing=print >objects &&
> +       grep "$expected" objects

But I'm not sure what you're testing here? A repack wouldn't fetch
missing objects for a promisor pack anyway... and because there's no
'^' in the pattern the grep will succeed regardless of whether the
object is missing/present.

Rob :)

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 26, 2022

On the Git mailing list, John Cai wrote (reply to this):

Thank you for everyone's feedback. Really appreciate the collaboration!

On 23 Feb 2022, at 14:31, Junio C Hamano wrote:

> Christian Couder <christian.couder@gmail.com> writes:
>
>>> For what it's worth, I am fine having a mode of repack which allows us
>>> to remove objects that we know are stored by a promisor remote. But this
>>> series doesn't do that, so users could easily run `git repack -d
>>> --filter=...` and find that they have irrecoverably corrupted their
>>> repository.
>>
>> In some cases we just know the objects we are removing are stored by a
>> promisor remote or are replicated on different physical machines or
>> both, so you should be fine with this.
>
> So, we need to decide if an object we have that is outside the
> narrowed filter definition was (and still is, but let's keep the
> assumption the whole lazy clone mechanism makes: promisor remotes
> will never shed objects that they once served) available at the
> promisor remote, but I suspect we have too little information to
> reliably do so.  It is OK to assume that objects in existing packs
> taken from the promisor remotes and everything reachable from them
> (but missing from our object store) will be available to us from
> there.  But if we see an object that is outside of _new_ filter spec
> (e.g. you fetched with "max 100MB", now you are refiltering with
> "max 50MB", narrowing the spec, and you need to decide for an object
> that weigh 70MB), can we tell if that can be retrieved from the
> promisor or is it unique to our repository until we push it out?  I
> am not sure.  For that matter, do we even have a way to compare if
> the new filter spec is a subset, a superset, or neither, of the
> original filter spec?

let me try to summarize (perhaps over simplify) the main concern folks have
with this feature, so please correct me if I'm wrong!

As a user, if I apply a filter that ends up deleting objects that it turns
out do not exist anywhere else, then I have irrecoverably corrupted my
repository.

Before git allows me to delete objects from my repository, it should be pretty
certain that I have path to recover those objects if I need to.

Is that correct? It seems to me that, put another way, we don't want to give
users too much rope to hang themselves.

I can see why we would want to do this. In this case, there have been a couple
of alternative ideas proposed throughout this thread that I think are viable and
I wanted to get folks thoughts.

1. split pack file - (Rob gave this idea and Taylor provided some more detail on
   how using pack-objects would make it fairly straightforward to implement)

when a user wants to apply a filter that removes objects from their repository,
split the packfile into one containing objects that are filtered out, and
another packfile with objects that remain.

pros: simple to implement
cons: does not address the question "how sure am I that the objects I want to
filter out of my repository exist on a promsior remote?"

2. check the promisor remotes to see if they contain the objects that are about
   to get deleted. Only delete objects that we find on promisor remotes.

pros: provides assurance that I have access to objects I am about to delete from
a promsior remote.
cons: more complex to implement. [*]

Out of these two, I like 2 more for the aforementioned pros.

* I am beginning to look into how fetches work and am still pretty new to the
codebase so I don't know if this is even feasible, but I was thinking perhaps
we could write a function that fetches with a --filter and create a promisor
packfile containing promisor objects (this operaiton would have to somehow
ignore the presence of the actual objects in the repository). Then, we would
have a record of objects we have access to. Then, repack --filter can remove
only the objects contained in this promisor packfile.

>
>> If you are not fine with this because sometimes a user might use it
>> without knowing, then why are you ok with commands deleting refs not
>> checking that there isn't a regular repack removing dangling objects?
>
> Sorry, I do not follow this argument.  Your user may do "branch -D"
> because the branch deleted is no longer needed, which may mean that
> objects only reachable from the deleted branch are no longer needed.
> I do not see what repack has anything to do with that.  As long as
> the filter spec does not change (in other words, before this series
> is applied), the repack that discards objects that are known to be
> reachable from objects in packs retrieved from promisor remote, the
> objects that are no longer reachable may be removed and that will
> not lose objects that we do not know to be retrievable from there
> (which is different from objects that we know are unretrievable).
> But with filter spec changing after the fact, I am not sure if that
> is safe.  IOW, "commands deleting refs" may have been OK without
> this series, but this series may be what makes it not OK, no?
>
> Puzzled.

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 26, 2022

On the Git mailing list, Taylor Blau wrote (reply to this):

On Sat, Feb 26, 2022 at 11:01:46AM -0500, John Cai wrote:
> let me try to summarize (perhaps over simplify) the main concern folks
> have with this feature, so please correct me if I'm wrong!
>
> As a user, if I apply a filter that ends up deleting objects that it
> turns out do not exist anywhere else, then I have irrecoverably
> corrupted my repository.
>
> Before git allows me to delete objects from my repository, it should
> be pretty certain that I have path to recover those objects if I need
> to.
>
> Is that correct? It seems to me that, put another way, we don't want
> to give users too much rope to hang themselves.

I wrote about my concerns in some more detail in [1], but the thing I
was most unclear on was how your demo script[2] was supposed to work.

Namely, I wasn't sure if you had intended to use two separate filters to
"re-filter" a repository, one to filter objects to be uploaded to a
content store, and another to filter objects to be expunged from the
repository. I have major concerns with that approach, namely that if
each of the filters is not exactly the inverse of the other, then we
will either upload too few objects, or delete too many.

My other concern was around what guarantees we currently provide for a
promisor remote. My understanding is that we expect an object which was
received from the promisor remote to always be fetch-able later on. If
that's the case, then I don't mind the idea of refiltering a repository,
provided that you only need to specify a filter once.

So the suggestion about splitting a repository into two packs was a
potential way to mediate the "two filter" problem, since the two packs
you get exactly correspond to the set of objects that match the filter,
and the set of objects that _don't_ match the filter.

In either case, I tried to use the patches in [1] and was able to
corrupt my local repository (even when fetching from a remote that held
onto the objects I had pruned locally).

Thanks,
Taylor

[1]: https://lore.kernel.org/git/YhUeUCIetu%2FaOu6k@nand.local/
[2]: https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt#L47-52

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 26, 2022

User Taylor Blau <me@ttaylorr.com> has been added to the cc: list.

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 26, 2022

On the Git mailing list, John Cai wrote (reply to this):

Hi Taylor,

(resending this because my email client misbehaved and set the mime type to html)

On 26 Feb 2022, at 12:29, Taylor Blau wrote:

> On Sat, Feb 26, 2022 at 11:01:46AM -0500, John Cai wrote:
>> let me try to summarize (perhaps over simplify) the main concern folks
>> have with this feature, so please correct me if I'm wrong!
>>
>> As a user, if I apply a filter that ends up deleting objects that it
>> turns out do not exist anywhere else, then I have irrecoverably
>> corrupted my repository.
>>
>> Before git allows me to delete objects from my repository, it should
>> be pretty certain that I have path to recover those objects if I need
>> to.
>>
>> Is that correct? It seems to me that, put another way, we don't want
>> to give users too much rope to hang themselves.
>
> I wrote about my concerns in some more detail in [1], but the thing I
> was most unclear on was how your demo script[2] was supposed to work.
>
> Namely, I wasn't sure if you had intended to use two separate filters to
> "re-filter" a repository, one to filter objects to be uploaded to a
> content store, and another to filter objects to be expunged from the
> repository. I have major concerns with that approach, namely that if
> each of the filters is not exactly the inverse of the other, then we
> will either upload too few objects, or delete too many.

Thanks for bringing this up again. I meant to write back regarding what you raised
in the other part of this thread. I think this is a valid concern. To attain the
goal of offloading certain blobs onto another server(B) and saving space on a git
server(A), then there will essentially be two steps. One to upload objects to (B),
and one to remove objects from (A). As you said, these two need to be the inverse of each
other or else you might end up with missing objects.

Thinking about it more, there is also an issue of timing. Even if the filters
are exact inverses of each other, let's say we have the following order of
events:

- (A)'s large blobs get upload to (B)
- large blob (C) get added to (A)
- (A) gets repacked with a filter

In this case we could lose (C) forever. So it does seem like we need some built in guarantee
that we only shed objects from the repo if we know we can retrieve them later on.

>
> My other concern was around what guarantees we currently provide for a
> promisor remote. My understanding is that we expect an object which was
> received from the promisor remote to always be fetch-able later on. If
> that's the case, then I don't mind the idea of refiltering a repository,
> provided that you only need to specify a filter once.

Could you clarify what you mean by re-filtering a repository? By that I assumed
it meant specifying a filter eg: 100mb, and then narrowing it by specifying a
50mb filter.
>
> So the suggestion about splitting a repository into two packs was a
> potential way to mediate the "two filter" problem, since the two packs
> you get exactly correspond to the set of objects that match the filter,
> and the set of objects that _don't_ match the filter.
>
> In either case, I tried to use the patches in [1] and was able to
> corrupt my local repository (even when fetching from a remote that held
> onto the objects I had pruned locally).
>
> Thanks,
> Taylor

Thanks!
John

>
> [1]: https://lore.kernel.org/git/YhUeUCIetu%2FaOu6k@nand.local/
> [2]: https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt#L47-52

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 26, 2022

On the Git mailing list, Taylor Blau wrote (reply to this):

On Sat, Feb 26, 2022 at 03:19:11PM -0500, John Cai wrote:
> Thanks for bringing this up again. I meant to write back regarding what you raised
> in the other part of this thread. I think this is a valid concern. To attain the
> goal of offloading certain blobs onto another server(B) and saving space on a git
> server(A), then there will essentially be two steps. One to upload objects to (B),
> and one to remove objects from (A). As you said, these two need to be the inverse of each
> other or else you might end up with missing objects.

Do you mean that you want to offload objects both from a local clone of
some repository, _and_ the original remote it was cloned from?

I don't understand what the role of "another server" is here. If this
proposal was about making it easy to remove objects from a local copy of
a repository based on a filter provided that there was a Git server
elsewhere that could act as a promisor remote, than that makes sense to
me.

But I think I'm not quite understanding the rest of what you're
suggesting.

> > My other concern was around what guarantees we currently provide for a
> > promisor remote. My understanding is that we expect an object which was
> > received from the promisor remote to always be fetch-able later on. If
> > that's the case, then I don't mind the idea of refiltering a repository,
> > provided that you only need to specify a filter once.
>
> Could you clarify what you mean by re-filtering a repository? By that I assumed
> it meant specifying a filter eg: 100mb, and then narrowing it by specifying a
> 50mb filter.

I meant: applying a filter to a local clone (either where there wasn't a
filter before, or a filter which matched more objects) and then removing
objects that don't match the filter.

But your response makes me think of another potential issue. What
happens if I do the following:

    $ git repack -ad --filter=blob:limit=100k
    $ git repack -ad --filter=blob:limit=200k

What should the second invocation do? I would expect that it needs to do
a fetch from the promisor remote to recover any blobs between (100, 200]
KB in size, since they would be gone after the first repack.

This is a problem not just with two consecutive `git repack --filter`s,
I think, since you could cook up the same situation with:

    $ git clone --filter=blob:limit=100k git@github.com:git
    $ git -C git repack -ad --filter=blob:limit=200k

I don't think the existing patches handle this situation, so I'm curious
whether it's something you have considered or not before.

(Unrelated to the above, but please feel free to trim any quoted parts
of emails when responding if they get overly long.)

Thanks,
Taylor

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 26, 2022

On the Git mailing list, John Cai wrote (reply to this):

Hi Taylor,

On 26 Feb 2022, at 15:30, Taylor Blau wrote:

> On Sat, Feb 26, 2022 at 03:19:11PM -0500, John Cai wrote:
>> Thanks for bringing this up again. I meant to write back regarding what you raised
>> in the other part of this thread. I think this is a valid concern. To attain the
>> goal of offloading certain blobs onto another server(B) and saving space on a git
>> server(A), then there will essentially be two steps. One to upload objects to (B),
>> and one to remove objects from (A). As you said, these two need to be the inverse of each
>> other or else you might end up with missing objects.
>
> Do you mean that you want to offload objects both from a local clone of
> some repository, _and_ the original remote it was cloned from?

yes, exactly. The "another server" would be something like an http server, OR another remote
which hosts a subset of the objects (let's say the large blobs).
>
> I don't understand what the role of "another server" is here. If this
> proposal was about making it easy to remove objects from a local copy of
> a repository based on a filter provided that there was a Git server
> elsewhere that could act as a promisor remote, than that makes sense to
> me.
>
> But I think I'm not quite understanding the rest of what you're
> suggesting.

Sorry for the lack of clarity here. The goal is to make it easy for a remote to offload a subset
of its objects to __another__ remote (either a Git server or an http server through a remote helper).
>
>>> My other concern was around what guarantees we currently provide for a
>>> promisor remote. My understanding is that we expect an object which was
>>> received from the promisor remote to always be fetch-able later on. If
>>> that's the case, then I don't mind the idea of refiltering a repository,
>>> provided that you only need to specify a filter once.
>>
>> Could you clarify what you mean by re-filtering a repository? By that I assumed
>> it meant specifying a filter eg: 100mb, and then narrowing it by specifying a
>> 50mb filter.
>
> I meant: applying a filter to a local clone (either where there wasn't a
> filter before, or a filter which matched more objects) and then removing
> objects that don't match the filter.
>
> But your response makes me think of another potential issue. What
> happens if I do the following:
>
>     $ git repack -ad --filter=blob:limit=100k
>     $ git repack -ad --filter=blob:limit=200k
>
> What should the second invocation do? I would expect that it needs to do
> a fetch from the promisor remote to recover any blobs between (100, 200]
> KB in size, since they would be gone after the first repack.
>
> This is a problem not just with two consecutive `git repack --filter`s,
> I think, since you could cook up the same situation with:
>
>     $ git clone --filter=blob:limit=100k git@github.com:git
>     $ git -C git repack -ad --filter=blob:limit=200k
>
> I don't think the existing patches handle this situation, so I'm curious
> whether it's something you have considered or not before.

I have not-will have to think through this case, but this sound similar to
what [1] is about.
is about.

>
> (Unrelated to the above, but please feel free to trim any quoted parts
> of emails when responding if they get overly long.)
>
> Thanks,
> Taylor

Thanks
John

1. https://lore.kernel.org/git/pull.1138.v2.git.1645719218.gitgitgadget@gmail.com/

@gitgitgadget-git
Copy link

@gitgitgadget-git gitgitgadget-git bot commented Feb 26, 2022

On the Git mailing list, Taylor Blau wrote (reply to this):

On Sat, Feb 26, 2022 at 04:05:37PM -0500, John Cai wrote:
> Hi Taylor,
>
> On 26 Feb 2022, at 15:30, Taylor Blau wrote:
>
> > On Sat, Feb 26, 2022 at 03:19:11PM -0500, John Cai wrote:
> >> Thanks for bringing this up again. I meant to write back regarding what you raised
> >> in the other part of this thread. I think this is a valid concern. To attain the
> >> goal of offloading certain blobs onto another server(B) and saving space on a git
> >> server(A), then there will essentially be two steps. One to upload objects to (B),
> >> and one to remove objects from (A). As you said, these two need to be the inverse of each
> >> other or else you might end up with missing objects.
> >
> > Do you mean that you want to offload objects both from a local clone of
> > some repository, _and_ the original remote it was cloned from?
>
> yes, exactly. The "another server" would be something like an http server, OR another remote
> which hosts a subset of the objects (let's say the large blobs).
> >
> > I don't understand what the role of "another server" is here. If this
> > proposal was about making it easy to remove objects from a local copy of
> > a repository based on a filter provided that there was a Git server
> > elsewhere that could act as a promisor remote, than that makes sense to
> > me.
> >
> > But I think I'm not quite understanding the rest of what you're
> > suggesting.
>
> Sorry for the lack of clarity here. The goal is to make it easy for a remote to offload a subset
> of its objects to __another__ remote (either a Git server or an http server through a remote helper).

Does the other server then act as a promisor remote in conjunction with
the Git server? I'm having trouble understanding why the _Git_ remote
you originally cloned from needs to offload its objects, too.

So I think the list would benefit from understanding some more of the
details and motivation there. But it would also benefit us to have some
understanding of how we'll ensure that any objects which are moved out
of a Git repository make their way to another server.

I am curious to hear Jonathan Tan's thoughts on this all, too.

Thanks,
Taylor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant