Release Notes - Beam - Version 2.21.0 - HTML format

Sub-task

  • [BEAM-2902] - Add user state support for ParDo.Multi for the Dataflow runner
  • [BEAM-3543] - Fn API metrics in Java SDK Harness
  • [BEAM-3544] - Fn API metrics in Python SDK Harness
  • [BEAM-3545] - Fn API metrics in Go SDK harness
  • [BEAM-3742] - Support for running a streaming SDF in Python SDK
  • [BEAM-3837] - Python SDK harness should understand a BundleSplitRequest and respond with a BundleSplit before bundle finishes
  • [BEAM-4287] - SplittableDoFn: splitAtFraction() API for Java
  • [BEAM-4655] - Update pipeline translation for timers inside Python SDK
  • [BEAM-4657] - Python SDK harness should support user timers
  • [BEAM-4660] - Add well known timer coder for Python SDK
  • [BEAM-4737] - SplittableDoFn dynamic rebalancing in Dataflow
  • [BEAM-8925] - Beam Dependency Update Request: org.apache.tika:tika-core
  • [BEAM-9035] - BIP-1: Typed options for Row Schema and Fields
  • [BEAM-9044] - BIP-1: Convert protobuf options to Schema options
  • [BEAM-9056] - Staging artifacts from environment
  • [BEAM-9147] - [Java] PTransform that integrates Video Intelligence functionality
  • [BEAM-9430] - Migrate from ProcessContext#updateWatermark to WatermarkEstimators
  • [BEAM-9458] - Make Dataflow executed UnboundedSources using SDF as the default
  • [BEAM-9537] - Refactor FnApiRunner into its own package
  • [BEAM-9569] - Coder inference should be disabled for Row types
  • [BEAM-9604] - BIP-1: Remove schema metadata usage for Protobuf extension
  • [BEAM-9605] - BIP-1: Rename setRowOption to setOption on Option builder
  • [BEAM-9608] - Add context managers for FnApiRunner to manage execution of each bundle
  • [BEAM-9687] - Names of temporary files created by interactive runner include characters invalid on some platforms.

Bug

  • [BEAM-4582] - Incorrectly translates apache_beam.runners.dataflow.native_io.streaming_create.DecodeAndEmitDoFn when creating the Dataflow pipeline json description
  • [BEAM-6189] - Deprecate and cleanup BeamFnApi.Metrics from Dataflow Worker
  • [BEAM-6451] - Portability Pipeline eventually hangs on bundle registration
  • [BEAM-6661] - FnApi gRPC setup/teardown glitch
  • [BEAM-7074] - FnApiRunner fails to wire multiple timer specs in single pardo
  • [BEAM-8280] - re-enable IOTypeHints.from_callable
  • [BEAM-8458] - BigQueryIO.Read needs permissions to create datasets to be able to run queries
  • [BEAM-8645] - TimestampCombiner incorrect in beam python
  • [BEAM-9125] - Update BigQuery Storage API documentation
  • [BEAM-9294] - Failure to validate schema in RowJsonSerializer looks like it came from RowJsonDeserializer
  • [BEAM-9297] - ToJson and JsonToRow should fail earlier for schema validation errors
  • [BEAM-9319] - ResourceExhausted: topics-per-project
  • [BEAM-9360] - Schema FieldType should not consider metadata for equivalence
  • [BEAM-9382] - TestStreamTranscriptTest relies on non-deterministic behavior
  • [BEAM-9398] - Python type hints: AbstractDoFnWrapper does not wrap setup
  • [BEAM-9419] - KafkaIO fails when user specify request timeout as string value
  • [BEAM-9420] - Configurable timeout for Kafka setupInitialOffset()
  • [BEAM-9446] - FlinkRunner discards parallelism and execution_mode_for_batch pipeline options
  • [BEAM-9448] - Misleading log line: says "downloading" when using cache
  • [BEAM-9464] - WithKeys should respect fetching coders for parameterized types instead of relying on the raw type
  • [BEAM-9474] - Environment cleanup is not robust enough and may leak resources
  • [BEAM-9476] - KinesisIO DescribeStream transient errors are not retried
  • [BEAM-9481] - Expansion-service fails when executed with --verify
  • [BEAM-9488] - Python SDK sending unexpected MonitoringInfo
  • [BEAM-9490] - Environment may be cleaned up prematurely when using environment expiration
  • [BEAM-9495] - DataCatalogTableProvider creates DataCatalogClient that is never closed
  • [BEAM-9499] - test_multi_triggered_gbk_side_input is failing on head
  • [BEAM-9509] - Subprocess job server treats missing local file as remote URL
  • [BEAM-9511] - ArrayScanToUncollectConverter: ResolvedParameter cannot be cast to ResolvedLiteral
  • [BEAM-9512] - Anonymous structs have name collision in schema
  • [BEAM-9524] - ib.show() spins forever when cells are re-executed
  • [BEAM-9529] - Remove googledatastore package
  • [BEAM-9540] - Rename beam:source:runner:0.1/beam:sink:runner:0.1 to beam:runner:source:v1/beam:runner:sink:v1
  • [BEAM-9553] - portableWordCount test using incorrect job server
  • [BEAM-9574] - NamedTuple instances generated from schemas cannot be pickled
  • [BEAM-9578] - Enumerating artifacts is too expensive in Java
  • [BEAM-9580] - Nexmark Query 12 in streaming mode is stuck on Flink Runner with Flink 1.10
  • [BEAM-9596] - Flink metric results may not be populated on pipeline failures
  • [BEAM-9606] - Example in gradle's combine test lacks of parameters
  • [BEAM-9631] - Flink 1.10 test execution is broken due to premature test cluster shutdown
  • [BEAM-9638] - False positives in worker region & zone tests
  • [BEAM-9645] - Python flinkrunner cannot inspect container
  • [BEAM-9647] - No MQTT connection possible because clientId is too long
  • [BEAM-9648] - DirectRunner waitUntilFinish does not return null on timeout
  • [BEAM-9651] - StreamingDataflowWorker stuck waiting for org.apache.beam.runners.dataflow.worker.windmill.DirectStreamObserver.onNext
  • [BEAM-9652] - BigQueryIO MultiPartitionsWriteTables fails with ClassCastException: java.lang.Object cannot be cast to org.apache.beam.sdk.values.KV
  • [BEAM-9660] - StreamingDataflowWorker has confusing exception on commits over 2GB
  • [BEAM-9670] - CoGroup transform should allow widening nullability in key schemas
  • [BEAM-9677] - Fix ArtifactUrlPayload path -> url typo
  • [BEAM-9686] - SparkCommonPipelineOptions should not depend of a child class to resolve the tmp checkpoint dir
  • [BEAM-9691] - Ensure Dataflow BQ Native sink are not used on FnApi
  • [BEAM-9725] - Perfomance regression in reshuffle
  • [BEAM-9726] - Don't require --region for non-service Dataflow endpoints.
  • [BEAM-9733] - ImpulseSourceFunction does not emit a final watermark
  • [BEAM-9734] - Revert https://github.com/apache/beam/pull/11122 which is a potential regression
  • [BEAM-9735] - Performance regression in Python Batch pipeline in Reshuffle
  • [BEAM-9736] - NameError: name 'from_container_image' is not defined
  • [BEAM-9738] - Python Dataflow runner omits capabilities.
  • [BEAM-9744] - Python performance tests failing
  • [BEAM-9749] - beam_PostCommit_SQL failing (missing region)
  • [BEAM-9756] - beam_PostCommit_Java_Nexmark (non-Dataflow) failing
  • [BEAM-9764] - :sdks:java:container:generateThirdPartyLicenses failing
  • [BEAM-9765] - :vendor:calcite-1_20_0:validateVendoring fails
  • [BEAM-9769] - Ensure JSON imports are the default behavior for BigQuerySink and WriteToBigQuery in Python
  • [BEAM-9773] - Update Dataflow Debug Capture to use Google API client Jackson 2
  • [BEAM-9778] - beam_PostCommit_XVR_Spark failing
  • [BEAM-9794] - Flink pipeline with RequiresStableInput fails after Short.MAX_VALUE checkpoints.
  • [BEAM-9797] - license_script.sh calls pip install/uninstall in local env
  • [BEAM-9801] - Setting a timer from a timer callback fails
  • [BEAM-9812] - WriteToBigQuery issue causes pipelines with multiple load jobs to work erroneously
  • [BEAM-9870] - Beam protos created without fn_api set incompatible with Dataflow on FnAPI.
  • [BEAM-9874] - Portable timers can't be cleared in batch mode
  • [BEAM-9877] - Eager size estimation of large group-by-key iterables cause expensive / duplicate reads
  • [BEAM-9880] - touch: build/target/third_party_licenses/skip: No such file or directory
  • [BEAM-9913] - Cross-language ValidatesRunner tests are failing due to failure of ':sdks:java:container:pullLicenses'
  • [BEAM-9947] - Timer coder contains a faulty key coder leading to a corrupted encoding
  • [BEAM-10018] - Windowing katas are failing because timestamps are being calculated in local timezones

New Feature

  • [BEAM-2645] - Implement DisplayData translation to/from protos
  • [BEAM-3194] - Support annotating that a DoFn requires stable / deterministic input for replay/retry
  • [BEAM-4374] - Update existing metrics in the FN API to use new Metric Schema
  • [BEAM-5599] - Convert all Java BoundedSources into SplittableDoFns
  • [BEAM-5600] - Splitting for SplittableDoFn should be exposed within runner shared libraries
  • [BEAM-6597] - Put MonitoringInfos/metrics in the Java SDK ProcessBundleProgressResponse
  • [BEAM-9295] - Add Flink 1.10 build target and Make FlinkRunner compatible with Flink 1.10
  • [BEAM-9300] - parse struct literal in ZetaSQL
  • [BEAM-9411] - Use BigQuery DIRECT_READ by default for SQL
  • [BEAM-9432] - Create a separate expansion service package.
  • [BEAM-9433] - Create an expansion service artifact for common IOs
  • [BEAM-9562] - Remove timer from PCollection and treat timers as Elements
  • [BEAM-9654] - Update Jet Runner to 4.0
  • [BEAM-9770] - Add BigQuery DeadLetter pattern to Patterns Page
  • [BEAM-10003] - Need two PR to submit snippets to website

Improvement

  • [BEAM-950] - DoFn Setup and Teardown methods should have access to PipelineOptions
  • [BEAM-1210] - PipelineRunners should receive a copy of PipelineOptions rather than the original Options
  • [BEAM-3097] - Allow BigQuerySource to take a ValueProvider as a table input.
  • [BEAM-5422] - Update BigQueryIO DynamicDestinations documentation to clarify usage of getDestination() and getTable()
  • [BEAM-6142] - Improve ByteBuddy DoFnInvoker implementation wrt SplittableDoFns / BundleFinalization
  • [BEAM-7350] - Update Python Datastore example to use v1new Datastore IO
  • [BEAM-7387] - Clean-up URNs to be of the form beam:yyy:....
  • [BEAM-8015] - Get logs for SDK worker Docker containers
  • [BEAM-8201] - clean up the current container API
  • [BEAM-8382] - Add polling interval to KinesisIO.Read
  • [BEAM-8539] - Clearly define the valid job state transitions
  • [BEAM-8587] - Add TestStream support for Dataflow runner
  • [BEAM-8800] - Add advance_watermark_to_infinity to end of TestStreams
  • [BEAM-8841] - Add ability to perform BigQuery file loads using avro
  • [BEAM-9001] - Allow setting environment ID to all transforms in the SDK
  • [BEAM-9014] - Update CachingShuffleBatchReader to record weights by size in bytes
  • [BEAM-9199] - Make --region a required flag for DataflowRunner
  • [BEAM-9261] - Add LICENSE and NOTICE to Docker images
  • [BEAM-9279] - Make HBase.ReadAll based on Reads instead of HBaseQuery
  • [BEAM-9286] - Create validation tests for metrics based on MonitoringInfo if applicable
  • [BEAM-9325] - UnownedOutputStream not overriding Array write method.
  • [BEAM-9342] - Update bytebuddy to version 1.10.8
  • [BEAM-9384] - Add SchemaRegistry.getSchemaCoder to get SchemaCoders for registered types
  • [BEAM-9434] - Improve Spark runner reshuffle translation to maximize parallelism
  • [BEAM-9477] - RowCoder should be hashable and picklable
  • [BEAM-9498] - RowJson exception for unsupported types should list the relevant fields
  • [BEAM-9535] - Remove unused ParDoPayload.Parameters
  • [BEAM-9536] - Return type of window start/end functions is incorrectly inferred to be INT64
  • [BEAM-9544] - Add Pandas-compatible Dataframe API
  • [BEAM-9545] - MVP: DataframeTransform
  • [BEAM-9552] - TestPubsub ACK deadline is too short
  • [BEAM-9558] - Make end of data channel explicit
  • [BEAM-9568] - Move Beam SQL to use the schema join transforms
  • [BEAM-9609] - Upgrade to ZetaSQL 2020.03.2
  • [BEAM-9618] - Allow SDKs to pull process bundle descriptors.
  • [BEAM-9714] - [Go SDK] Require --region flag in Dataflow runner
  • [BEAM-9716] - Alias zone to worker_zone and warn
  • [BEAM-9718] - Update documentation for windowed value coder
  • [BEAM-9724] - Use consistent short package names for protos in Beam Go
  • [BEAM-9805] - Containers fail to stage when staging directory not present.
  • [BEAM-9945] - Use consistent element count for progress counter.
  • [BEAM-10028] - [Java SDK] Support state backed iterables within the SDK harness

Test

  • [BEAM-9287] - Python Validates runner tests for Unified Worker
  • [BEAM-9470] - :sdks:java:io:kinesis:test is flaky
  • [BEAM-9565] - WatermarkEstimatorsTest.testThreadSafeWatermarkEstimator is flaky
  • [BEAM-9827] - Test SplittableDoFnTest#testPairWithIndexBasicBounded is flaky

Task

  • [BEAM-9136] - Add LICENSES and NOTICES to docker images
  • [BEAM-9298] - Drop support for Flink 1.7
  • [BEAM-9299] - Upgrade Flink Runner to 1.8.3 and 1.9.2
  • [BEAM-9636] - Run Dataflow ValidatesRunner (:runners:google-cloud-dataflow-java:validatesRunnerLegacyWorkerTest) failing
  • [BEAM-9685] - Don't release Go SDK container until Go is officially supported.
  • [BEAM-9727] - Auto populate required feature experiment flags for enable dataflow runner v2
  • [BEAM-9876] - Migrate the Beam website from Jekyll to Hugo to enable localization of the site content

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.