[Bug]: PreparePubsubWriteDoFn does not guarantee that a message will not exceed hard request limits #31800

sjvanrossum · 2024-07-08T14:43:33Z

What happened?

While working on cleaning up PubsubIO.Write to support writes with ordering keys without explicit bucketing I noticed that PreparePubsubWriteDoFn does not guarantee that a single message fits within the limits of a PublishRequest for either gRPC or REST clients.

The validation routine in PreparePubsubWriteDoFn only validates explicit message size and counts 6 bytes of overhead for every attribute. This will work often enough if maxPublishBatchSize is set to <9MB for the gRPC client, but this is more likely to fail for REST clients since there's additional overhead to consider because the data field is sent as a base64 encoded string and string fields like attribute keys/values and ordering keys may require escape sequences (minimally required: \", \\, \b, \t, \n, \f, \r, \uXXXX for control characters 0-31). For JSON clients the data field must be <7.5 MB, attribute size depends on the overhead on escape sequences, and requests should not exceed 10 MiB.

For gRPC clients, the static overhead of 6 bytes per attribute is valid for any map entry <128 B in protobuf wire format. The TLV overhead may grow to 3 bytes for the entry, key and value (attribute field tag (1 B) + map entry length (1-2 B) + key field tag (1 B) + key length (1-2 B) + value field tag (1 B) + value length (1-2 B)). The total serialized size of any single message publish request must not exceed 10 MB currently. The API accepts requests up to 10 MiB for both JSON and protobuf and may omit the protobuf encoding overhead from the request validation at some point in the future in which case a single message will not exceed 10MiB since its encoding overhead is less than 1 KiB.

Batched requests should be validated again by PubsubClient to ensure a single PublishRequest does not exceed these limits. That's... Not as simple either, since a JSON PublishRequest not exceeding 10 MB or 10 MiB may exceed still exceed those limits when the request is transcoded into a protobuf PublishRequest before it hits Pub/Sub.

Filing this for visibility since I was working on fixes for this while collaborating on #31608, one to correct the validation in PreparePubsubWriteDoFn and one to refactor batching in PubsubClient to account for this.

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

The text was updated successfully, but these errors were encountered:

sjvanrossum added awaiting triage bug labels Jul 8, 2024

github-actions bot added java io P1 labels Jul 8, 2024

This was referenced Jul 8, 2024

Support writing to Pubsub with ordering key; Add PubsubMessage SchemaCoder #31608

Open

Request batching does not track encoding overhead accurately googleapis/java-pubsub#2106

Open

Additional patches for apache#31608 ahmedabu98/beam#427

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: PreparePubsubWriteDoFn does not guarantee that a message will not exceed hard request limits #31800

[Bug]: PreparePubsubWriteDoFn does not guarantee that a message will not exceed hard request limits #31800

sjvanrossum commented Jul 8, 2024

[Bug]: PreparePubsubWriteDoFn does not guarantee that a message will not exceed hard request limits #31800

[Bug]: PreparePubsubWriteDoFn does not guarantee that a message will not exceed hard request limits #31800

Comments

sjvanrossum commented Jul 8, 2024

What happened?

Issue Priority

Issue Components