BuckEvent: provide a Buck Event Publisher proto #685

sluongng · 2024-06-17T13:59:03Z

Provide a BuckEvent Publisher service which emits BuckEvents to an
external server implementation.

Closes #226

Provide a BuckEvent Publisher service which emits BuckEvents to an external server implementation. Closes facebook#226

facebook-github-bot · 2024-06-25T17:23:05Z

@cjhopman has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

JakobDegen

Copying some comments left on phabricator:

I think we need the rest of the stack to accept this.
Is it not clear how this will be used.
(Also I'm not 100% sure grpc is the right choice, maybe simple piping to stdin would be sufficient).

I think I agree that I'd like to see a little more code before landing this just yet

JakobDegen · 2024-07-02T03:31:47Z

app/buck2_event_publisher_proto/event_publisher.proto

+
+message BuckEventResponse {
+  // A trace-unique 64-bit identifying the stream.
+  uint64 stream_id = 1;


Instead of having stream_id we can just create a new grpc call for each stream, that would be more natural.

@stepancheg

Actually, to partially respond to Stiopa on this: That seems hard to practically implement on the buck2 side, since we may create a number of different instances of the client for each command. And even if we didn't do that, it seems unwise to prevent ourselves from doing that in the future

JakobDegen · 2024-07-02T04:00:09Z

Actually, I have another thing to consider: Internally, we've found that the size of the event logs can be extremely substantial and that keeping the entire mechanism reliable hasn't been super straightforward. Now, while I don't if anyone else's repo is big enough to be affected by that, and I'm sure that some of the unreliability is a result of misbehaviors specific to our infra, I can also imagine that some of the lessons learned from that are more generally useful.

To go into detail a bit, the way our scribe client works is roughly that it allocates the "real" client as a global, and gets a handle to it for each instance of the type you see. The global then keeps a queue of messages that is used to 1) batch writes, and 2) rate limit itself. While the rate limiting is probably not necessary in OSS, the write batching might be of interest. That might be something to consider for this API. Hopefully that information also explains what send_now does. I'm going to try and get our client open sourced for you to at least look at and evaluate what parts of it you want to reuse or not

Summary: This won't be able to build in OSS, but we can at least make the code available. Of interest specifically to #685 Reviewed By: ndmitchell Differential Revision: D59262895 fbshipit-source-id: e4e72a402e174dedd08715a627a84e7b16d1a225

sluongng · 2024-07-03T08:57:14Z

There seem to be 2 concerns here:

Client initialization and stream_id:

I plan to mimic Bazel's approach here. In Bazel, each "invocation" gets a unique stream ID. As events get sent to the server, the server will reply with "ack" (identified via trace_id) for events that have persisted. Network disruption could happen and in such cases, the client might want to retry sending events that were not "ack" by the server. Upon resend, the events could be sent to a different server instance, thus the stream_id is needed to identify the retry stream is the same as the previous disrupted stream.

Batching:

I am not sure why is batching needed. Multiple GRPC streams could multiplex over a single HTTP/2 connection so the overhead of sending events one-by-one is relatively low. I could see potential benefit in compressing multiple events together to send as a batch, but I would argue that it should be left to the server implementation to optimize. (i.e. some implementations could provide a local forwarding proxy that handles the batching).

In Bazel, it's better to stream the events as soon as they arrive as some events contain the console log information, which enables the server to implement an alternative build UI on their website. I could imagine that we might be able to extend BuckEvent to support this use case (aka. the removed ControlEvent).

Could you please elaborate a bit more on why batch sending needs to be implemented on the client side? Perhaps you are thinking of potential non-grpc client implementations?

sluongng · 2024-07-03T09:30:20Z

I think we need the rest of the stack to accept this.
Is it not clear how this will be used.
(Also I'm not 100% sure grpc is the right choice, maybe simple piping to stdin would be sufficient).

From my perspective, it could be potentially a huge time investment to code the rest of the implementation without agreeing on the spec first. At least some agreements on the grpc service would be a positive signal for me to put more time into this.

The current design is modeled after PublishBuildToolEventStream https://github.com/bazelbuild/bazel/blob/0b280ac442aa51c7cca1c311658dbd1a6346dff5/third_party/googleapis/google/devtools/build/v1/publish_build_event.proto#L72-L84. Which is a battle-tested RPC with multiple server implementations available among the REAPI ecosystem. A sample client implementation is available in Bazel (java) as well as plenty of proxy implementations available on Github.

With that said, I do understand that this is going to be a Buck2-specific API. So if there are any alternative proposals, I would love to hear them.

JakobDegen · 2024-07-07T01:50:46Z

Could you please elaborate a bit more on why batch sending needs to be implemented on the client side?

So the thing to keep in mind is that our event sending is done synchronously in a bunch of latency sensitive places. While a benchmark would certainly be much better than guessing, my expectation is that doing a network op for each of these events might be prohibitively slow, at least in some of our builds.

That being said, I don't know that we have to figure this out now. Happy to let you experiment and see if it matters or not.

From my perspective, it could be potentially a huge time investment to code the rest of the implementation without agreeing on the spec first. At least some agreements on the grpc service would be a positive signal for me to put more time into this.

Yeah, sorry, I was a bit too brief before in just asking for more code. I think you're right to ask for alignment on use of a GRPC API in general, and I'll bring that up at our team meeting tomorrow to make sure we're all good to move forward in general.

The main thing I would ask for from your side is that the service definition itself is clearly marked as being unstable until we've gathered some experience with using it. The way we've reported events to our own service has changed quite a lot in the past, and I imagine that much of that configuration may at some point need room in the API (the first thing that comes to mind is that we only send a subset of events right now, and I imagine not all consumers are going to agree on which subset is the right one...).

Other than that, I don't think it makes much of a difference whether we merge this PR now or wait for some additional code on the client before doing so, but hopefully the things I mentioned above are enough for you to be able to continue work on your side

sluongng · 2024-07-08T12:21:40Z

Yeah, sorry, I was a bit too brief before in just asking for more code. I think you're right to ask for alignment on use of a GRPC API in general, and I'll bring that up at our team meeting tomorrow to make sure we're all good to move forward in general.

Thank you so much. Looking forward to this.

So the thing to keep in mind is that our event sending is done synchronously in a bunch of latency sensitive places. While a benchmark would certainly be much better than guessing, my expectation is that doing a network op for each of these events might be prohibitively slow, at least in some of our builds.

This is a good call out.

In Bazel, there is a flag that could switch sending build events asynchronously vs synchronously. The reason for synchronous is that in many OSS setups, the CI worker is an ephemeral CI container/VM that will get shut down right after the build is finished. In those scenarios, it might benefit folks to turn on synchronous event sending so that their builds would wait for all events to be sent before exiting and initializing container shutdown.

I think sending events in async/sync mode is tangent to sending events via stream/batch request. Let's not conflate the 2. There are ways to implement async event sink while using grpc bidi stream as well as batch request while supporting queue by-passing for send_now. For example, we could maintain an internal priority queue in Buck Daemon and send_now events will always get the highest priority. I think we could hash this out in future PRs when we dive into implementation details.

the service definition itself is clearly marked as being unstable until we've gathered some experience with using it.

I have no problem with this. If there is any existing convention in Buck2 to mark unstable service, I would be happy to follow. If there isn't, I will name the configuration key to activate this unstable_buck_event_service = grpcs://my.service.com/buckevent or experimental_buck_event_service = grpcs://my.service.com/buckevent. We could also rename the .proto file or the Service name itself, although I think it's a bit overkill.

Let me know what you prefer here.

Other than that, I don't think it makes much of a difference whether we merge this PR now or wait for some additional code on the client before doing so, but hopefully the things I mentioned above are enough for you to be able to continue work on your side

I think the main diff will be who has to pay the rebase cost.

If the review time for future PRs is short, I have no problem leaving this unmerged. However, if the lead time is long, I will have to rebase my PR stack and account for existing refactoring(codemod) in Buck2 repo, which could be a pain to deal with. So I would prefer to have this merged earlier rather than later.

sluongng · 2024-07-22T08:07:49Z

@JakobDegen friendly ping. I am looking for a confirmation on the gRPC API direction before investing more time into this.

JakobDegen · 2024-08-02T18:18:53Z

Alright, sorry for the delay. We had some back and forth on this internally and our ourselves not quite sure what we think the right answer is.

The reason for synchronous is that in many OSS setups, the CI worker is an ephemeral CI container/VM that will get shut down right after the build is finished. In those scenarios, it might benefit folks to turn on synchronous event sending so that their builds would wait for all events to be sent before exiting and initializing container shutdown.

So we actually have the same problem, which is why even though our event sending is asynchronous in the sense that it doesn't do a network op on each send call, we will still make sure to flush all our buffers at the end of the command before returning to the user. I don't think this is the reason you should be electing to go full sync.

I think sending events in async/sync mode is tangent to sending events via stream/batch request. Let's not conflate the 2. There are ways to implement async event sink while using grpc bidi stream as well as batch request while supporting queue by-passing for send_now. For example, we could maintain an internal priority queue in Buck Daemon and send_now events will always get the highest priority. I think we could hash this out in future PRs when we dive into implementation details.

👍

I have no problem with this. If there is any existing convention in Buck2 to mark unstable service

I think a comment at the top of the proto file would be more than enough

If the review time for future PRs is short, I have no problem leaving this unmerged. However, if the lead time is long, I will have to rebase my PR stack and account for existing refactoring(codemod) in Buck2 repo, which could be a pain to deal with. So I would prefer to have this merged earlier rather than later.

Yeah that's fair. Let's merge this then and we can iterate

I'll add one additional point that came up internally:

Given how big logs can sometimes be, and that users are often working from home on not-great internet (or god forbid, from a hotel or plane or something), the default state was that we had complete data for not even 90% of builds. Fixing that to get 99% complete data was a significant investment for us. The part where we currently have machine local queues in a couple different places that allow us to bridge any periods of reduced internet connectivity were a very big contributing part of that. It's not blocking, but I wanted to at least raise awareness of that so you get a chance to consider it.

If you're otherwise comfortable with this, let me know and I'll go ahead and merge it

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 17, 2024

sluongng force-pushed the sluongng/event-publisher branch from f2816de to 3925a3c Compare June 17, 2024 14:01

sluongng marked this pull request as ready for review June 17, 2024 14:02

sluongng mentioned this pull request Jun 17, 2024

API design for BuckEvent data #226

Open

BuckEvent: provide a Buck Event Publisher proto

3572e6c

Provide a BuckEvent Publisher service which emits BuckEvents to an external server implementation. Closes facebook#226

sluongng force-pushed the sluongng/event-publisher branch from 3925a3c to 3572e6c Compare June 17, 2024 16:05

JakobDegen reviewed Jul 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BuckEvent: provide a Buck Event Publisher proto #685

BuckEvent: provide a Buck Event Publisher proto #685

sluongng commented Jun 17, 2024

facebook-github-bot commented Jun 25, 2024

JakobDegen left a comment

JakobDegen Jul 2, 2024

JakobDegen Jul 2, 2024

JakobDegen commented Jul 2, 2024

sluongng commented Jul 3, 2024 •

edited

Loading

sluongng commented Jul 3, 2024

JakobDegen commented Jul 7, 2024

sluongng commented Jul 8, 2024

sluongng commented Jul 22, 2024

JakobDegen commented Aug 2, 2024

BuckEvent: provide a Buck Event Publisher proto #685

Are you sure you want to change the base?

BuckEvent: provide a Buck Event Publisher proto #685

Conversation

sluongng commented Jun 17, 2024

facebook-github-bot commented Jun 25, 2024

JakobDegen left a comment

Choose a reason for hiding this comment

JakobDegen Jul 2, 2024

Choose a reason for hiding this comment

JakobDegen Jul 2, 2024

Choose a reason for hiding this comment

JakobDegen commented Jul 2, 2024

sluongng commented Jul 3, 2024 • edited Loading

sluongng commented Jul 3, 2024

JakobDegen commented Jul 7, 2024

sluongng commented Jul 8, 2024

sluongng commented Jul 22, 2024

JakobDegen commented Aug 2, 2024

sluongng commented Jul 3, 2024 •

edited

Loading