Release Notes - Beam - Version 2.1.0 - HTML format

Sub-task

  • [BEAM-2175] - Support state API in Spark batch mode.
  • [BEAM-2410] - Remove TransportClient from ElasticSearchIO to decouple IO and ES versions

Bug

  • [BEAM-163] - When a PCollection (via Window.into) has smaller allowedLateness than upstream, might lose data
  • [BEAM-190] - Dead-letter drop for bad BigQuery records
  • [BEAM-318] - AvroCoder may be affected by AVRO-607
  • [BEAM-567] - FlinkRunner pollutes beam-examples-java[8] modules with SLF4J backend dependencies which are not optional (e.g. test/provided/system scoped, or marked optional)
  • [BEAM-659] - WindowFn#isCompatible should provide a meaningful reason
  • [BEAM-975] - Issue with MongoDBIO
  • [BEAM-991] - DatastoreIO Write should flush early for large batches
  • [BEAM-1151] - BigQueryIO.Write has no way of handling failures
  • [BEAM-1231] - Use well known coder types in Java for WindowedValue, GlobalWindow, and LengthPrefix
  • [BEAM-1282] - DoFnTester should allow output() calls in start/finishBundle
  • [BEAM-1938] - Side Inputs should be part of the expanded inputs
  • [BEAM-2153] - Move connection management in JmsIO to setup/teardown methods
  • [BEAM-2179] - Archetype generate-sources.sh should delete old files
  • [BEAM-2246] - JmsIO should use CLIENT_ACK instead of AUTO_ACK
  • [BEAM-2290] - CompressedSource doesn't forward timestamps from the underlying Source's output
  • [BEAM-2300] - Upgrade python sdk dependencies
  • [BEAM-2301] - Standard expansion of SDF should be in runners-core-construction
  • [BEAM-2302] - WriteFiles with runner-determined sharding and large numbers of windows causes OOM errors
  • [BEAM-2316] - GcsUtil matches GCS glob * as though it is **
  • [BEAM-2317] - Incorrect Javadoc for GroupAlsoByWindowEvaluatorFactory
  • [BEAM-2318] - Flake in HBaseIOTest
  • [BEAM-2319] - HBaseIO Source is not correct when read multiple times
  • [BEAM-2334] - OutOfMemoryError in RandomAccessData.java:350
  • [BEAM-2338] - GCS filepattern wildcard broken in Python SDK
  • [BEAM-2344] - Re-rename fat jar in archetypes
  • [BEAM-2359] - SparkTimerInternals inputWatermarkTime does not get updated in cluster mode
  • [BEAM-2364] - PCollectionView should not be a PValue, it should expand to its underlying PCollection
  • [BEAM-2365] - Use highest protocol for pickle coder.
  • [BEAM-2366] - Post commit failure: import not found gen_protos
  • [BEAM-2369] - HadoopFileSystem: NullPointerException on match of non existing resource
  • [BEAM-2373] - AvroSource: Premature End of stream Exception on SnappyCompressorInputStream
  • [BEAM-2379] - SpannerIO tests are failing
  • [BEAM-2380] - Flink Batch Runner does not forward additional outputs to operator
  • [BEAM-2389] - GcpCoreApiSurfaceTest isn't testing right module
  • [BEAM-2398] - Increasing latency within DirectRunner caused by cumulated TransformWatermarks
  • [BEAM-2405] - Create a BigQuery sink for streaming using PTransform
  • [BEAM-2406] - NullPointerException when writing an empty table to BigQuery
  • [BEAM-2407] - Flink CoderTypeSerializer ConfigSnapshot cannot be deserialised
  • [BEAM-2408] - Flink unbounded source does not emit watermarks when there are multiple Readers
  • [BEAM-2437] - quickstart.py docs is missing the path to MANIFEST.in
  • [BEAM-2439] - Datastore writer can fail to progress if Datastore is slow
  • [BEAM-2485] - Reject stateful ParDo / DoFn in merging windows at construction time in Dataflow runner
  • [BEAM-2488] - Elasticsearch IO should read also in replica shards
  • [BEAM-2490] - ReadFromText function is not taking all data with glob operator (*)
  • [BEAM-2497] - TextIO can't read concatenated gzip files
  • [BEAM-2502] - Processing time timers for expired windows are not ignored
  • [BEAM-2504] - Processing time timers with timestamp past GC time are interpreted as GC timers
  • [BEAM-2505] - When EOW != GC and the timers fire in together, the output is not marked as the final pane
  • [BEAM-2508] - Fix javaDoc of Stateful DoFn
  • [BEAM-2509] - Fn API Runner hangs in grpc controller mode
  • [BEAM-2521] - Simplify packaging for python distributions
  • [BEAM-2529] - SpannerWriteIT failing in postcommit
  • [BEAM-2533] - GCP IO: Use new bigtable client version
  • [BEAM-2534] - KafkaIO should allow gaps in message offsets
  • [BEAM-2537] - Can't run IO ITs - conflicting project vs projectId pipeline options
  • [BEAM-2551] - KafkaIO reader blocks indefinitely if servers are not reachable
  • [BEAM-2557] - DirectRunner should not have a hard dependency on Hamcrest
  • [BEAM-2570] - Python Post commit tests are failing
  • [BEAM-2571] - Flink ValidatesRunner failing CombineTest.testSlidingWindowsCombineWithContext
  • [BEAM-2575] - ApexRunner doesn't emit watermarks for additional outputs
  • [BEAM-2587] - Build fails due to python sdk
  • [BEAM-2595] - WriteToBigQuery does not work with nested json schema
  • [BEAM-2636] - user_score on DataflowRunner is broken
  • [BEAM-2642] - Upgrade to Google Auth 0.7.1
  • [BEAM-2662] - Quickstart for Spark Java not working
  • [BEAM-2670] - ParDoTest.testPipelineOptionsParameter* validatesRunner tests fail on Spark runner
  • [BEAM-2690] - Make hadoop and hive dependencies provided for HCatalogIO
  • [BEAM-2703] - KafkaIO: watermark outside the bounds of BoundedWindow
  • [BEAM-2708] - Decompressing bzip2 files with multiple "streams" only reads the first stream
  • [BEAM-2726] - Typo in programming guide
  • [BEAM-3390] - unable to serialize org.apache.beam.sdk.io.jdbc.JdbcIO$Read$ReadFn
  • [BEAM-3687] - Reject stateful ParDo / DoFn in merging windows at construction time in Java Direct runner

New Feature

  • [BEAM-210] - Allow control of empty ON_TIME panes analogous to final panes
  • [BEAM-245] - Create Cassandra IO
  • [BEAM-1237] - Create AmqpIO
  • [BEAM-1377] - Support Splittable DoFn in Dataflow streaming runner
  • [BEAM-1472] - Use cross-language serialization schema for triggers in Python SDK
  • [BEAM-1476] - Support MapState in Flink runner
  • [BEAM-1483] - Support SetState in Flink runner
  • [BEAM-1542] - Need Source/Sink for Spanner
  • [BEAM-1702] - Add support for local execution to BigtableIO using the google cloud emulator
  • [BEAM-2059] - Implement Metrics support for streaming Dataflow runner
  • [BEAM-2150] - Support for recursive wildcards in GcsPath
  • [BEAM-2248] - KafkaIO support to use start read time to set start offset
  • [BEAM-2276] - make it easier to specify windowed filename policy
  • [BEAM-2333] - Rehydrate Pipeline from Runner API proto
  • [BEAM-2357] - Add HCatalogIO (Hive)
  • [BEAM-3615] - Dynamic/Default Coder For Data

Improvement

  • [BEAM-788] - Execute ReduceFn directly, not via OldDoFn wrapper
  • [BEAM-1347] - Basic Java harness capable of understanding process bundle tasks and sending data over the Fn Api
  • [BEAM-1348] - Model the Fn Api
  • [BEAM-1498] - Use Flink-native side outputs
  • [BEAM-1663] - Generated package names should be deterministic and consistent across runs.
  • [BEAM-1779] - Port UnboundedSourceWrapperTest to use Flink operator test harness
  • [BEAM-1925] - Make DoFn invocation logic of Python SDK more extensible
  • [BEAM-2134] - expose AUTO_COMMIT to KafkaIO.read()
  • [BEAM-2252] - maven-shade-plugin should be defined in pluginManagement instead of plugins so that the plugin execution order defined by the root pom.xml is used
  • [BEAM-2253] - maven archetype poms should have versions controlled automatically based upon root pom.xml versions
  • [BEAM-2258] - BigtableIO should use AutoValue for read and write
  • [BEAM-2293] - Top.Largest and Top.Smallest naming is counter-intuitive
  • [BEAM-2343] - ExecutionContext and StepContext have a lot of unused cruft that obscures their use
  • [BEAM-2392] - Avoid use of proto builder clone
  • [BEAM-2401] - Update Flink Runner to Flink 1.3.0
  • [BEAM-2411] - Make the write transform of HBaseIO simpler
  • [BEAM-2412] - Update HBaseIO to use HBase client 1.2.6
  • [BEAM-2423] - Abstract StateInternalsTest for the different state internals/Runners
  • [BEAM-2473] - Dataflow runner should reject unsupported state & timers (MapState, SetState) at construction time, and documentation should be more prominent
  • [BEAM-2481] - Update commons-lang3 dependency to version 3.6
  • [BEAM-2486] - Should throws some useful messages when statefulParDo use non-KV input
  • [BEAM-2494] - Remove 'GroupedShuffleRangeTracker' which is unused in the SDK
  • [BEAM-2514] - Improve error message for missing required options of Beam pipeline

Test

  • [BEAM-1268] - Add integration tests for CassandraIO
  • [BEAM-2489] - Use dynamic ES port in HIFIOWithElasticTest in module IO :: Hadoop :: jdk1.8-tests

Task

  • [BEAM-2284] - Completely remove OldDoFn from Beam
  • [BEAM-2361] - Add TikaIO to the list of in-progress transforms
  • [BEAM-2522] - upgrading jackson
  • [BEAM-2552] - Update Dataflow container version for 2.1.0 release

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.