Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[connector/servicegraph] Unit test failure: TestConnectorConsume/test_fix_failed_label_not_work #33998

Closed
crobert-1 opened this issue Jul 9, 2024 · 15 comments · Fixed by #34070 or #34076
Labels
ci-cd CI, CD, testing, build issues connector/servicegraph flaky test a test is flaky

Comments

@crobert-1
Copy link
Member

Component(s)

connector/servicegraph

Describe the issue you're reporting

Failing CI/CD action link

Failure occurred in unrelated PR, running on ubuntu.

Failure output:

Running target 'test-with-cover' in module 'connector/servicegraphconnector' as part of group 'connector'
make --no-print-directory -C connector/servicegraphconnector test-with-cover
mkdir -p /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/coverage/unit
/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/.tools/gotestsum --rerun-fails=1 --packages="./..." -- -race -timeout 600s -parallel 4 --tags="" -cover -covermode=atomic -args -test.gocoverdir="/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/coverage/unit"
✖  . (79ms) (coverage: 87.3% of statements)
✓  internal/metadata (1.02s) (coverage: 83.3% of statements)
✓  internal/store (1.121s) (coverage: 92.7% of statements)

DONE 31 tests, 2 failures in 7.887s

✖  . (38ms) (coverage: 87.3% of statements)

=== Failed
make[2]: *** [../../Makefile.Common:131: test-with-cover] Error 1
=== FAIL: . TestConnectorConsume/test_fix_failed_label_not_work (0.01s)
make[1]: *** [Makefile:179: connector/servicegraphconnector] Error 2
    logger.go:146: 2024-07-09T18:45:05.793Z	INFO	Started servicegraphconnector
    logger.go:146: 2024-07-09T18:45:05.793Z	INFO	Shutting down servicegraphconnector
    logger.go:146: 2024-07-09T18:45:05.796Z	DEBUG	edge completed	{"client_service": "foo", "server_service": "bar", "connection_type": "", "trace_id": "b3a53e9c2ac533afc48b511c2f368dc1"}
    logger.go:146: 2024-07-09T18:45:05.796Z	DEBUG	edge completed	{"client_service": "foo", "server_service": "bar", "connection_type": "", "trace_id": "c6aafb2dcf4d71c81ae52e50e8f805ee"}
make: *** [Makefile:131: gotest-with-cover] Error 2
    logger.go:146: 2024-07-09T18:45:05.796Z	DEBUG	edge completed	{"client_service": "foo", "server_service": "bar", "connection_type": "", "trace_id": "7af657de319d6a57967bc241882ea94f"}
    connector_test.go:129: 
        	Error Trace:	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/connector/servicegraphconnector/connector_test.go:129
        	Error:      	Received unexpected error:
        	            	the following errors occurred:
        	            	 -  resource "map[]": scope "traces_service_graph": metric "traces_service_graph_request_client_seconds": missing expected datapoint: map[client:foo connection_type: failed:false server:bar]
        	            	 -  resource "map[]": scope "traces_service_graph": metric "traces_service_graph_request_client_seconds": unexpected datapoint: map[client:foo connection_type: failed:true server:bar]
        	            	 -  resource "map[]": scope "traces_service_graph": metric "traces_service_graph_request_client_seconds": missing expected datapoint: map[client:foo connection_type: failed:true server:bar]
        	            	 -  resource "map[]": scope "traces_service_graph": metric "traces_service_graph_request_client_seconds": unexpected datapoint: map[client:foo connection_type: failed:false server:bar]
        	            	 -  resource "map[]": scope "traces_service_graph": metric "traces_service_graph_request_server_seconds": missing expected datapoint: map[client:foo connection_type: failed:false server:bar]
        	            	 -  resource "map[]": scope "traces_service_graph": metric "traces_service_graph_request_server_seconds": unexpected datapoint: map[client:foo connection_type: failed:true server:bar]
        	            	 -  resource "map[]": scope "traces_service_graph": metric "traces_service_graph_request_server_seconds": missing expected datapoint: map[client:foo connection_type: failed:true server:bar]
        	            	 -  resource "map[]": scope "traces_service_graph": metric "traces_service_graph_request_server_seconds": unexpected datapoint: map[client:foo connection_type: failed:false server:bar]
        	Test:       	TestConnectorConsume/test_fix_failed_label_not_work

=== FAIL: . TestConnectorConsume (0.01s)

=== FAIL: . TestConnectorConsume/test_fix_failed_label_not_work (re-run 1) (0.01s)
    logger.go:146: 2024-07-09T18:45:07.902Z	INFO	Started servicegraphconnector
    logger.go:146: 2024-07-09T18:45:07.902Z	INFO	Shutting down servicegraphconnector
    logger.go:146: 2024-07-09T18:45:07.904Z	DEBUG	edge completed	{"client_service": "foo", "server_service": "bar", "connection_type": "", "trace_id": "b3a53e9c2ac533afc48b511c2f368dc1"}
    logger.go:146: 2024-07-09T18:45:07.905Z	DEBUG	edge completed	{"client_service": "foo", "server_service": "bar", "connection_type": "", "trace_id": "c6aafb2dcf4d71c81ae52e50e8f805ee"}
    logger.go:146: 2024-07-09T18:45:07.905Z	DEBUG	edge completed	{"client_service": "foo", "server_service": "bar", "connection_type": "", "trace_id": "7af657de[319](https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9854192691/job/27233186145?pr=33458#step:9:320)d6a57967bc241882ea94f"}
    connector_test.go:129: 
        	Error Trace:	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/connector/servicegraphconnector/connector_test.go:129
        	Error:      	Received unexpected error:
        	            	the following errors occurred:
        	            	 -  resource "map[]": scope "traces_service_graph": metric "traces_service_graph_request_server_seconds": missing expected datapoint: map[client:foo connection_type: failed:false server:bar]
        	            	 -  resource "map[]": scope "traces_service_graph": metric "traces_service_graph_request_server_seconds": unexpected datapoint: map[client:foo connection_type: failed:true server:bar]
        	            	 -  resource "map[]": scope "traces_service_graph": metric "traces_service_graph_request_server_seconds": missing expected datapoint: map[client:foo connection_type: failed:true server:bar]
        	            	 -  resource "map[]": scope "traces_service_graph": metric "traces_service_graph_request_server_seconds": unexpected datapoint: map[client:foo connection_type: failed:false server:bar]
        	Test:       	TestConnectorConsume/test_fix_failed_label_not_work

=== FAIL: . TestConnectorConsume (re-run 1) (0.01s)

DONE 2 runs, 33 tests, 4 failures in 8.913s
make[1]: Leaving directory '/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib'
Error: Process completed with exit code 2.
@crobert-1 crobert-1 added the needs triage New item requiring triage label Jul 9, 2024
Copy link
Contributor

github-actions bot commented Jul 9, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member Author

crobert-1 commented Jul 10, 2024

@jpkrohling
Copy link
Member

I talked to @t00mas and he'll be looking into this soon.

@jpkrohling jpkrohling added ci-cd CI, CD, testing, build issues and removed needs triage New item requiring triage labels Jul 11, 2024
@evantorrie
Copy link
Contributor

The problem is due to the servicegraph connector unit tests making use of pmetrictest.CompareMetrics() with a golden set of metrics using IgnoreMetricsOrder() option, but where there are two different instances of a metric with the same name but different datapoint attributes (this seems to mostly (only?) affect comparison of Histograms).

pmetrictest.CompareScopeMetrics() considers two metrics from the same scope to be "matched" simply by comparing their names.

When the servicegraph connector test fails, it's because the names of consecutive metrics are the same, but the datapoint attributes are different between the actual metric and the expected metric from the golden dataset.

Perhaps IgnoreMetricAttributeValue() should be passed during this test?

Alternatively, perhaps pmetrictest.CompareScopeMetrics() with IgnoreMetricOrder() should consider more than just the name when deciding whether their identity is the same.

@jpkrohling
Copy link
Member

I think that IgnoreMetricOrder() should also ignore the order of the attributes.

@evantorrie
Copy link
Contributor

I don't completely understand the servicegraph metrics implementation, but on further investigation, I do wonder if the way the metrics are created in calls like collectServerLatencyMetrics and collectClientLatencyMetrics is incorrect.

In each case it ranges over a key, and creates a new metric each time, even though the name of the metric is the same in all cases.

My understanding was that if this were inside an application using the Otel SDK that the SDK would aggregate all those keys (which I think represent different attribute sets) into a single metric entry but with multiple datapoints. The servicegraphconnector code on the other hand, seems to create multiple entries for the same metric, with just one datapoint per metric entry.

Perhaps this is breaking the preconditions which pmetrictest.CompareScopeMetrics() is expecting - which appear to be only one entry per unique metric name in a single ScopeMetrics list of metrics.

@songy23
Copy link
Member

songy23 commented Jul 16, 2024

+2 freq: #34092 #34094

jpkrohling pushed a commit that referenced this issue Jul 16, 2024
…Order option to CompareMetricsOption (#34076)

By introducing the new option and using it in the
`servicegraphconnector` test, the flakyness of the test and therefore
false positives are resolved

Fixes #33998

---------

Signed-off-by: odubajDT <[email protected]>
@mwear
Copy link
Member

mwear commented Jul 16, 2024

@crobert-1
Copy link
Member Author

+1 freq: #33994

@songy23
Copy link
Member

songy23 commented Jul 16, 2024

Given the frequency / flakiness I would suggest to skip this test for now #34120

codeboten pushed a commit that referenced this issue Jul 16, 2024
jpkrohling pushed a commit that referenced this issue Jul 17, 2024
…peMetrics metric entry (#34070)

Tests in `connector/servicegraph` were failing because the servicegraph
`buildMetrics()` code was creating multiple metric entries for the same
metric name within a single MetricScope. Although this may not be
forbidden by the Otel specification, I think there is a general
assumption that a metric name does not appear more than once within the
same MetricScope.
Instead, different values (e.g. with different sets of attribute values)
should be created as separate datapoints within the same metric.

The `pmetrictest.CompareScopeMetrics` test functionality is not designed
to handle multiple metric entries with the same `Name()`. Instead, it is
assumed that in cases where Order is ignored, the [first entry found in
the actual metrics which matches the name of the expected
metrics](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/ca9a8d14df471ed6a7dda7562ddfe659ec52127d/pkg/pdatatest/pmetrictest/metrics.go#L186-L195)
*must* be the metric to compare.

This fix changes the `buildMetrics()` code to create one metric within a
scope, and instead create multiple datapoints per metric when there are
entries where the datapoint attribute set is unique (i.e. all entries in
the internal maps
`serviceGraphConnector.req{Total,FailedTotal,ServerDurationSecondsCount}`
are coalesced into a single named metric as appropriate.)

_Note_: I don't have any past experience working with
servicegraphconnector, but just observing that
`collectClientLatencyMetrics()` and `collectServerLatencyMetrics()` both
range over the same map - `p.reqServerDurationSecondsCount`, although
the actual values collected in `collectClientLatencyMetrics()` are from
`p.reqServerDurationSeconds{Count,Sum,BucketCounts}` and the values
collected in `collectServerLatencyMetrics()` are from
`p.reqClientDurationSeconds{Count,Sum,BucketCounts}`.

This seems a little asymmetrical, but I don't have enough experience to
say whether this is an error or not.

**Description:**  Fixes #33998 

**Testing:** All other unit tests now complete, and the previously
failing unit test now works reliably.

**Documentation:** No documentation added. This is a unit test fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment