Release Notes - Beam - Version 2.3.0 - HTML format

Sub-task

  • [BEAM-2528] - BeamSql: support create table
  • [BEAM-2600] - Python SDK harness container
  • [BEAM-2876] - Add provision api proto
  • [BEAM-2881] - GCS artifact server proxy
  • [BEAM-2904] - Python SDK support for portable progress reporting
  • [BEAM-2906] - Dataflow supports portable progress reporting
  • [BEAM-2949] - Initial splitting
  • [BEAM-2963] - Use portable ParDoPayload in DataflowRunner
  • [BEAM-3125] - Portable flattens in Java SDK Harness
  • [BEAM-3143] - Fix type inference in Python 3 for generators
  • [BEAM-3179] - [Nexmark][SQL] Refactor Generator
  • [BEAM-3181] - [Nexmark][SQL] Implement a basic pass-through SQL query
  • [BEAM-3422] - The jar files uploaded to maven do not work with Java 9
  • [BEAM-3427] - Upgrade Maven and Gradle to build using Java 8
  • [BEAM-3428] - Merge java and java8 examples all together
  • [BEAM-3430] - Update website documentation about Apache Beam 2.3.0+ being Java 8
  • [BEAM-3432] - Merge sdks/java/io/hadoop/jdk1.8-tests into sdks/java/io/hadoop/input-format
  • [BEAM-3461] - Drop redundant "beam-" from project names in gradle
  • [BEAM-3466] - Remove Java 7 and any related task from Jenkins
  • [BEAM-3467] - Remove Java 7 from the docker development (reproducible) build images / website
  • [BEAM-3534] - Add a spark validates runner test for metrics sink in streaming mode
  • [BEAM-3563] - Revise Fn API metrics protos

Bug

  • [BEAM-409] - Incorrect use of Math.ceil in ApproximateQuantiles
  • [BEAM-682] - Invoker Class should be created in Thread Context Classloader
  • [BEAM-793] - JdbcIO can create a deadlock when parallelism is greater than 1
  • [BEAM-1187] - GCP Transport not performing timed backoff after connection failure
  • [BEAM-1487] - BufferingStreamObserverTest, BeamFnLoggingClientTest repeatedly times out in precommit
  • [BEAM-1868] - CreateStreamTest testMultiOutputParDo is flaky on the Spark runner
  • [BEAM-2101] - Reading public GCS files requires authentication
  • [BEAM-2102] - deprecated use of pip --download command
  • [BEAM-2257] - KafkaIO write without key requires a producer fn
  • [BEAM-2270] - Examples archetype bundles Hadoop 2.6 in its jar for ApexRunner; cannot run on Hadoop 2.7?
  • [BEAM-2271] - Release guide or pom.xml needs update to avoid releasing Python binary artifacts
  • [BEAM-2273] - mvn clean doesn't fully clean up archetypes.
  • [BEAM-2304] - State declared with one class cannot be accessed as a superclass (applies to BagState|CombiningState <: GroupingState)
  • [BEAM-2320] - Update DataflowRunner dependency check for Google / Apache Beam Distribution
  • [BEAM-2498] - Dataflow runner should shade Runner/Fn API protos
  • [BEAM-2566] - Java SDK harness should not depend on any runner
  • [BEAM-2576] - Split up portability model artifacts
  • [BEAM-2607] - Enforce that SDF must return stop() after a failed tryClaim() call
  • [BEAM-2704] - KafkaIO: NPE without key serializer set
  • [BEAM-2779] - PipelineOptionsFactory should prevent non PipelineOptions interfaces from being constructed.
  • [BEAM-2870] - BQ Partitioned Table Write Fails When Destination has Partition Decorator
  • [BEAM-2872] - AvroIO.TypedWrite#to() method produces a compilation error under JDK 1.7
  • [BEAM-2957] - Fix flaky ElasticsearchIOTest.testSplit in beam-sdks-java-io-elasticsearch-tests-5
  • [BEAM-2996] - Metric names should not be null or empty
  • [BEAM-3013] - The Python worker should report lulls
  • [BEAM-3016] - BeamFnLoggingClient shutdown stability
  • [BEAM-3030] - watchForNewFiles() can emit a file multiple times if it's growing
  • [BEAM-3048] - RAND_RANGE in WindowedWordCount is not used
  • [BEAM-3049] - Java SDK Harness bundles non-relocated code, including Dataflow runner
  • [BEAM-3051] - no beta on gcloud
  • [BEAM-3052] - ReduceFnRunner sets end-of-window hold even when no data is buffered
  • [BEAM-3054] - org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest is flaky
  • [BEAM-3088] - BigQuery source should consider streaming buffer when determining estimated sizes of tables
  • [BEAM-3107] - Python Fnapi based workloads failing
  • [BEAM-3110] - The transform Read(UnboundedKafkaSource) is currently not supported
  • [BEAM-3118] - Fix thread leaks in ElasticsearchIO 5 write tests
  • [BEAM-3120] - Jenkins postcommit test suite triggered for a release branch uses code from master branch.
  • [BEAM-3121] - Dockerized jekyll server fails
  • [BEAM-3130] - View.asMap() causes a ClassCastException in Apex runner
  • [BEAM-3137] - BigQueryIO.write() should better verify user schemas
  • [BEAM-3139] - Update dataflow.version in beam root pom.xml
  • [BEAM-3155] - Python SDKHarness does not use separate thread for progress reporting
  • [BEAM-3160] - Type based coder inference incorrectly assumes that a coder for one type is equivalent to every other coder for that type.
  • [BEAM-3161] - Cannot output with timestamp XXXX
  • [BEAM-3174] - Master python sdk seems broken with test_harness_override_present_in_dataflow_distributions on Py 2.7.6
  • [BEAM-3186] - In-flight data loss when restoring from savepoint
  • [BEAM-3187] - Spark runner does not respect ParDo's lifecycle on case of exceptions
  • [BEAM-3196] - Python postcommit test_wordcount_fnapi_it failing on Dataflow
  • [BEAM-3202] - Multiple deserializations of PipelineOptions leaks memory
  • [BEAM-3206] - Spark runner does not shut down after pipeline completion
  • [BEAM-3219] - DataflowRunner: @Setup not called for batch stateful DoFn
  • [BEAM-3240] - Quickstart examples archetype dependencies conflict for Apex on YARN
  • [BEAM-3244] - Flink runner does not respect ParDo's lifecycle on case of exceptions
  • [BEAM-3282] - MqttIO reader should use receive with timeout
  • [BEAM-3284] - Python SDK, dataflow runner, the method modify_job_status is calling the wrong API endpoint
  • [BEAM-3336] - MqttIO read tests are flaky
  • [BEAM-3348] - Beam SQL DSL not support non-ascii characters in sql
  • [BEAM-3354] - Timer setRelative() does not reset previously set timer.
  • [BEAM-3357] - Python SDK head fails to run tests due to Requirement.parse('protobuf<=3.4.0,>=3.2.0')
  • [BEAM-3362] - Create an example pipeline that uses State.
  • [BEAM-3366] - FnApiDoFnRunner should be registered for the ParDo URN
  • [BEAM-3369] - Python HEAD fails tests due to a ValueError
  • [BEAM-3382] - Validate count for trigger AfterCount
  • [BEAM-3387] - Bigtable integration tests failed due to client version dependency change
  • [BEAM-3389] - Clean receive queue in dataplane
  • [BEAM-3391] - Upgrade apitools to >= 0.5.18
  • [BEAM-3397] - beam_PreCommit_Java_MavenInstall failing on Dataflow integration test because of too long commandline
  • [BEAM-3411] - Test apache_beam.examples.wordcount_it_test.WordCountIT times out
  • [BEAM-3416] - File is not properly closed in VcfSource when exception is thrown
  • [BEAM-3436] - RetryHttpRequestInitializerTest takes 4min to complete
  • [BEAM-3438] - KinesisReaderIT fails due to a missing PipelineOptions property
  • [BEAM-3450] - RemoteGrpcPorts should contain the wire format
  • [BEAM-3452] - NullPointerException when executing BigQuerySourceBase.split()
  • [BEAM-3458] - Go SDK beam.Create & beam.CreateList should support complex types
  • [BEAM-3478] - Flink checkstyle broken
  • [BEAM-3486] - Progress reporting for python sdk fix
  • [BEAM-3492] - Spark Integration Tests fail with a Closed Connection
  • [BEAM-3499] - Watch can make no progress if a single poll takes more than checkpoint interval
  • [BEAM-3510] - Python Runner API serialization for CombinePerKey drops arguments to CombineFn
  • [BEAM-3511] - Eager evaluation of overridden transforms in Python SDK does not work
  • [BEAM-3537] - Remove DirectRunner-specific internal PValue cache, allow more general eager in-process pipeline execution
  • [BEAM-3569] - SpannerIO.write throws on delete mutations
  • [BEAM-3570] - More elegant interface for RuntimeValueProvider
  • [BEAM-3584] - Java dataflow job fails with 2.3.0 RC1, due to missing worker image
  • [BEAM-3585] - Python dataflow job fails with 2.3.0 RC1, due to missing worker image
  • [BEAM-3589] - Flink runner breaks with ClassCastException on UnboundedSource
  • [BEAM-3592] - Spark-runner profile is broken on Nexmark after move to Spark 2.x
  • [BEAM-3668] - Apache Spark Java Quickstart fails 2.3.0 RC2
  • [BEAM-3958] - beam-sdks-java-io-amazon-web-services may be global pollution.

New Feature

  • [BEAM-2430] - Java FnApiDoFnRunner to share across runners
  • [BEAM-2500] - Add support for S3 as a Apache Beam FileSystem
  • [BEAM-2718] - Add bundle retry logic to the DirectRunner
  • [BEAM-2774] - Add I/O source for VCF files (python)
  • [BEAM-2806] - support View.CreatePCollectionView in FlinkRunner
  • [BEAM-2865] - Implement FileIO.write()
  • [BEAM-3008] - BigtableIO should use ValueProviders
  • [BEAM-3009] - Implement context access from user code closures
  • [BEAM-3035] - Extract ReifyTimestampsAndWindows from GatherAllPanes
  • [BEAM-3171] - convert a join into lookup

Improvement

  • [BEAM-675] - Introduce message mapper in JmsIO
  • [BEAM-1630] - Add Splittable DoFn to Python SDK
  • [BEAM-1847] - KafkaIO can't specify both max records and max duration.
  • [BEAM-1872] - implement Reshuffle transform in python, make it experimental in Java
  • [BEAM-1920] - Add Spark 2.x support in Spark runner
  • [BEAM-2482] - CodedValueMutationDetector should use the coders structural value
  • [BEAM-2674] - Runner API translators should own their rehydration
  • [BEAM-2804] - support TIMESTAMP in Sort
  • [BEAM-2875] - Portable SDK harness containers
  • [BEAM-3005] - Provision API should optionally include SDK harness memory/cpu/disk limits
  • [BEAM-3018] - Remove duplicated methods in StructuredCoder
  • [BEAM-3041] - Add portable Python SDK container setup support
  • [BEAM-3063] - VoidCoder should implement structuralValue instead of forcing encoding of zero bytes
  • [BEAM-3111] - Upgrade ElasticsearchIO elastic dependences to 5.6.3
  • [BEAM-3112] - improve logs in ElasticsearchIO test utils
  • [BEAM-3113] - Disable stack trace optimization in java container
  • [BEAM-3114] - Do not generate ApiService text proto string manually in containers
  • [BEAM-3133] - Staging files on Dataflow should not block forever without producing updates
  • [BEAM-3135] - Adding futures dependency to beam python SDK
  • [BEAM-3170] - support topicPartition in BeamKafkaTable
  • [BEAM-3184] - Not able to access GCS API when submitting Python jobs behind corporate firewall
  • [BEAM-3189] - Python Fnapi - SDK harness speedup
  • [BEAM-3209] - Update documentation on support for reading/writing BQ date partitioned tables
  • [BEAM-3238] - [SQL] Add builder to BeamRecordSqlType
  • [BEAM-3239] - Enable debug server for python sdk workers
  • [BEAM-3267] - Return file names from TFRecordIO write
  • [BEAM-3275] - Support Kafka 1.0.0
  • [BEAM-3308] - Improve Go exec runtime error handling
  • [BEAM-3340] - Update Flink Runner to Flink 1.4.0
  • [BEAM-3347] - Add worker ID to the provisioning API
  • [BEAM-3356] - Add varint coder
  • [BEAM-3373] - Add serviceEndpoint parameter to KinesisIO
  • [BEAM-3384] - Make DataflowRunner.replaceTransforms protected
  • [BEAM-3388] - Reduce Go runtime reflective overhead
  • [BEAM-3404] - Update KinesisIO to use AWS SDK 1.11.255 and KCL 1.8.8
  • [BEAM-3405] - Make maxNumRecords a long and validate if stream exists for KinesisIO
  • [BEAM-3454] - Make Unbounded to Bounded conversions on IOs able to trigger on maxNumRecords and maxReadTime combined
  • [BEAM-3502] - Avoid use of proto.Builder.clone() in DatastoreIO
  • [BEAM-3507] - [JdbcIO] Allow to define the batch size in WriteFn
  • [BEAM-3533] - Replace hard-coded UTF-8 Strings
  • [BEAM-3539] - BigtableIO.Write javadoc of some methods is incorrect
  • [BEAM-3560] - Switch to use BigInteger/BigDecimal.ZERO/ONE/TEN

Test

  • [BEAM-3392] - Verify default container image names in Python SDK

Wish

  • [BEAM-3551] - Add -parameters flag to javac (and test)

Task

  • [BEAM-2994] - Refactor TikaIO
  • [BEAM-3243] - multiple anonymous DoFn lead to conflicting names
  • [BEAM-3294] - Move to graph.External and remove Source/Sink
  • [BEAM-3426] - Java 8 support

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.