refreshUserImpactJob logs mysterious fatal errors
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Urbanecm_WMF
	Aug 17 2023, 11:12 AM

Description

RefreshUserImpactJob is a job that refreshes user impact data for the purpose of displaying those at Special:Homepage and Special:Impact, see https://www.mediawiki.org/wiki/Growth/Positive_reinforcement for project details.

According to @Tgr and @Krinkle comments at T341658 (the first task of this kind):

In T341658#9042285, @Krinkle wrote:
In T341658#9041935, @Tgr wrote:

Most of these errors […] are about file access. […] it's probably overrunning some sort of quota that applies to open files + connections?

Nice catch, and yes, I believe this is exactly what's happening. See also T230245, where a maintenance scripts that generates captcha images, gets a fatal error:
Fatal error: require(/srv/mediawiki/php-1.34.0-wmf.8/includes/json/FormatJson.php): File not found in /srv/mediawiki/php-1.34.0-wmf.8/includes/AutoLoader.php on line 109
Notably, this fatal error happens as part of error reporting, after it fails to upload a file to Swift over HTTP. Both are happening for the same reason — EMFILE (Too many open files), which is a system-level restriction typically set by the operating system.

In case of T230245 the problem was that there was no concurrency limit set. It was correctly uploading things in batches, but it was preparing the temp files all in parallel and keeping the file handles open.

Filling this task as an umbrella task for all RefreshUserImpactJob-related tasks that might be explained by the "too many descriptors" root cause, to ensure the discussion around the root cause is centralized and to ensure individual symptomps (error messages) are logged individually, to make them findable.

The job makes a lot of AQS requests and other file descriptor-consuming actions (such as, database connections), to fetch all the necessary data that can be precalculated and cached in growthexperiments_user_impact. Unfortunately, the job seems to be consuming too many file descriptors, and either the job or a used library might be forgetting to close something.

Goal of this task is to determine the root cause of the problem.

Details

Subject	Repo	Branch	Lines +/-
changeprop: Increase refreshUserImpactJob concurrency	operations/deployment-charts	master	+1 -1
jobrunner: increase open files limit	operations/puppet	production	+10 -1
changeprop: Rule for refreshUserImpactJob	operations/deployment-charts	master	+3 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		KStoller-WMF	T248330 [EPIC] Growth: Positive reinforcement
Resolved		KStoller-WMF	T222310 [EPIC] Positive reinforcement: New Impact Module
Open		KStoller-WMF	T342303 [EPIC] Growth: Deploy Positive reinforcement features to all Wikipedias
Resolved	Nov 1 2023	• Urbanecm_WMF	T336203 Positive reinforcement: Deploy the new Impact module to all Wikipedias
Resolved		• Urbanecm_WMF	T344143 New Impact module: Run backend updating logic on all Wikipedias
Resolved		• Urbanecm_WMF	T344428 refreshUserImpactJob logs mysterious fatal errors
Resolved	PRODUCTION ERROR	• Urbanecm_WMF	T344427 Wikimedia\Rdbms\DBConnectionError: Cannot access the database: could not connect to any replica DB server; Unknown error while connecting (db1221)
Resolved	PRODUCTION ERROR	• Urbanecm_WMF	T341658 Error: Class 'Cdb\Exception' not found
Resolved	PRODUCTION ERROR	• Urbanecm_WMF	T348614 PHP Fatal Error from line 221 of /srv/mediawiki/php-1.41.0-wmf.29/includes/AutoLoader.php: require(): Failed opening required '/srv/mediawiki/php-1.41.0-wmf.29/includes/libs/rdbms/exception/DBConnectionError.php' (include_path=

Event Timeline

• Urbanecm_WMF created this task.Aug 17 2023, 11:12 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 17 2023, 11:12 AM

• Urbanecm_WMF added subtasks: T344427: Wikimedia\Rdbms\DBConnectionError: Cannot access the database: could not connect to any replica DB server; Unknown error while connecting (db1221), T341658: Error: Class 'Cdb\Exception' not found.Aug 17 2023, 11:12 AM

• Urbanecm_WMF added a parent task: T344143: New Impact module: Run backend updating logic on all Wikipedias.

• Urbanecm_WMF mentioned this in T344143: New Impact module: Run backend updating logic on all Wikipedias.

• Urbanecm_WMF renamed this task from RefreshUserImpactJob consumes too many file handlers to RefreshUserImpactJob consumes too many file descriptors.Aug 17 2023, 11:32 AM

• Urbanecm_WMF updated the task description. (Show Details)

Reviewing the logs, this happens within the mediawiki_job_growthexperiments-userImpactUpdateRecentlyEdited periodic job (the errors are logged at 07:XX, and the RecentlyEdited variant starts at 07:45, while the RecentlyRegistered variant starts at 05:15.

I tried running mwscript extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --wiki=enwiki --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --verbose --use-job-queue to see if the same errors are logged. I don't see any so far.

Reedy added a project: Performance Issue.Aug 17 2023, 1:29 PM

KStoller-WMF moved this task from Incoming to Top Product Priorities on the Growth-Team (Sprint 0 (Growth Team)) board.Sep 3 2023, 11:38 PM

Mentioned in SAL (#wikimedia-operations) [2023-09-05T21:38:09Z] <urbanecm> mwmaint1002: /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=6hour --verbose --use-job-queue (trying to reproduce T344428)

Mentioned in SAL (#wikimedia-operations) [2023-09-05T22:11:09Z] <urbanecm> mwmaint1002: /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --batch-size=20 --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=6hour --verbose --use-job-queue (debugging T344428, lowered batch size [100 -> 20])

Findings so far:

Error can be reproduced when running the puppet-scheduled script manually (/usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=6hour --verbose --use-job-queue)
The error doesn't seem to appear when not using the job queue.
I'm unable to reproduce the issue while running the script with only one wiki (I tried enwiki and ruwiki). I also didn't manage to reproduce when running multiple wikis at once (scheduling jobs for multiple wikis at the same time).
The error happens less frequently when the batch size is lowered (I got ~5 related errors with batch size of 20 and ~1 with a batch size of 10)

I thought that this might be caused by the limit being exhausted because more jobs run at the same time. But, according to Grafana, the job processing rate during the morning regular job is 1.62, which is not that high, so maybe not? Or I'm misreading the charts? I'm confused by the fact that executing a lot of user impact jobs is seemingly necessary to trigger this issue.

I'm inclined towards trying to dedicate a queue for refreshUserImpactJob anyway, as it is a fairly frequent job (runs on every edit made by a Homepage user). Not sure it would resolve this issue, but seems like worth doing either way?

Change 955319 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/deployment-charts@master] changeprop: Rule for refreshUserImpactJob

https://gerrit.wikimedia.org/r/955319

gerritbot added a project: Patch-For-Review.Sep 6 2023, 12:04 PM

Change 955319 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: Rule for refreshUserImpactJob

https://gerrit.wikimedia.org/r/955319

Maintenance_bot removed a project: Patch-For-Review.Sep 7 2023, 10:30 AM

Mentioned in SAL (#wikimedia-operations) [2023-09-07T10:56:35Z] <urbanecm> mwmaint1002: /usr/local/bin/mw-cli-wrapper /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=1second --verbose --use-job-queue (T344428, testing with r955319 deployed)

In T344428#9145964, @Urbanecm_WMF wrote:

I'm inclined towards trying to dedicate a queue for refreshUserImpactJob anyway, as it is a fairly frequent job (runs on every edit made by a Homepage user). Not sure it would resolve this issue, but seems like worth doing either way?

I did that in r955319 and started the script again (with --ignoreIfUpdatedWithin=1second to ensure we attempt to update all users). Logstash:

Time	level	channel	host	wiki	message
Sep 7, 2023 @ 10:58:15.001	ERROR	exception	mw1486	dewiki	[4ca2e70d560c09fabfab51e9] /rpc/RunSingleJob.php   Error: Class 'Cdb\Exception' not found
Sep 7, 2023 @ 10:57:30.854	ERROR	exception	mw1485	arwiki	[a5956761b15336e3f49de58a] /rpc/RunSingleJob.php   Error: Class 'Cdb\Exception' not found
Sep 7, 2023 @ 10:54:14.697	ERROR	exception	mw1457	arwiki	[7bd49600ca4fdf07812b8ab5] /rpc/RunSingleJob.php   Error: Class 'Cdb\Exception' not found

so...dedicating a lane for refreshUserImpactJob didn't seem to help. Observing JobExecutor.log, the concurrency seems to be 3 (as set). Something different has to be playing a role...

Bummer. Thanks for continuing to investigate, @Urbanecm_WMF!

Any change we can pick your brain on this one, @Tgr? Or are there any other engineers we can pull in to provide other ideas to investigate?

This is blocking us from scaling the new Impact module, and I would really like us to wrap that up by the end of the month if possible.

Maybe SREs know the answer off-hand. There might also be error messages in OS-level logs. Otherwise I'd try to create a minimal reproduction. I can think of a few hypotheses:

Guzzle doesn't close its connections (although the code does look correct - CurlFactory::release() is called on all execution paths as far as I can see). Since the job runner executes multiple jobs in the same process, at some point it runs out of file handles.
The connections get closed, but they are still counted against the limit due to some curl or PHP or OS bug or quirk.
There is some very low quota for all PHP processes together. Guzzle keeps 5 connections open, when enough instances of it run in parallel, the quota runs out.

Some ways to try to differentiate between those: debug-log open resource counts (get_resources() and/or lsof), lower CurlFactory::$maxHandles, use some kind of lock to prevent job parallelism (or maybe the job queue can do that), limit (if possible) the number of jobs run in the same process.

Thanks for your thoughts @Tgr! I agree SRE expertise would be helpful for resolving the underlying issue, but before dragging them here, I want(ed?) to have a small command (or a set of steps) to trigger the issue. Unfortunately, I'm having a hard time reproducing this issue in any other way than running the same command Puppet runs :-(. Thank you for the possible hypothesis list; please find my thoughts on them below. Maybe my thinking around those options has a flaw somewhere that I'm not seeing, please do feel free to correct me on anything.

Guzzle doesn't close its connections (although the code does look correct - CurlFactory::release() is called on all execution paths as far as I can see). Since the job runner executes multiple jobs in the same process, at some point it runs out of file handles.
The connections get closed, but they are still counted against the limit due to some curl or PHP or OS bug or quirk.

If either of those possibilities were the case, I'd expect disabling the job queue (ie. running the command without --use-job-queue) to trigger the issue as well, because in that case, all of the revalidation runs within a single process on a single server. So far (and surprisingly), using the job queue seems to be a prerequisite to reproducing the issue. To make a more definitive conclusion, I executed a slightly altered version of the script:

diff

[urbanecm@mwmaint1002 ~]$ git diff refreshUserImpactData.php /srv/mediawiki/php-1.41.0-wmf.25/extensions/GrowthExperiments/maintenance/refreshUserImpactData.php
diff --git a/refreshUserImpactData.php b/srv/mediawiki/php-1.41.0-wmf.25/extensions/GrowthExperiments/maintenance/refreshUserImpactData.php
index c978760..9741795 100644
--- a/refreshUserImpactData.php
+++ b/srv/mediawiki/php-1.41.0-wmf.25/extensions/GrowthExperiments/maintenance/refreshUserImpactData.php
@@ -86,19 +86,18 @@ class RefreshUserImpactData extends Maintenance {
                                        $this->output( "  ...would refresh user impact for user {$user->getId()}\n" );
                                }
                                continue;
-                       } elseif ( $this->hasOption( 'use-job-queue' ) ) {
+                       } elseif ( $this->hasOption( 'use-job-queue' ) || true ) {
                                $users[$user->getId()] = null;
                                if ( count( $users ) >= $this->getBatchSize() ) {
                                        if ( $this->hasOption( 'verbose' ) ) {
                                                $usersText = implode( ', ', array_keys( $users ) );
-                                               $this->output( " ... enqueueing refreshUserImpactJob for users $usersText\n" );
+                                               $this->output( " ... runing refreshUserImpactJob for users $usersText\n" );
                                        }
-                                       $this->jobQueueGroupFactory->makeJobQueueGroup()->lazyPush(
-                                               new RefreshUserImpactJob( [
-                                                       'impactDataBatch' => $users,
-                                                       'staleBefore' => $this->ignoreAfter,
-                                               ] )
-                                       );
+                                       $job = new RefreshUserImpactJob( [
+                                               'impactDataBatch' => $users,
+                                               'staleBefore' => $this->ignoreAfter,
+                                       ] );
+                                       $job->run();
                                        $users = [];
                                }
                        } else {
[urbanecm@mwmaint1002 ~]$

No errors. Admittedly, foreachwikiindblist starts a dedicated process for each wiki, while job workers are more long running, but especially for the big wikis like enwiki where thousands of users need to be updated, I'd expect this error to happen without job queue being involved. For some reason, this is not happening. That leads me to believe this hypothesis is not the cause.

There is some very low quota for all PHP processes together. Guzzle keeps 5 connections open, when enough instances of it run in parallel, the quota runs out.

I thought so too, this is why I lowered the job concurrency to 3. Previously, the user impact job was running as a low-traffic job (all low-traffic jobs have a combined concurrency of 50). The data refresh now takes significantly longer (as expected), but it still errors out, see Logstash. If we want to be super-certain, we can try lowering the job concurrency to 1 temporarily (and probably stop ImpactHooks::refreshUserImpactDataInDeferredUpdate from scheduling the job on each edit), but I believe concurrency of 3 is low enough -- the quota would have to be terribly low for 3 for the error to be still present.

Maybe the job queue has a different PHP environment, or the jobrunner servers have a different configuration? Seems unlikely.

At this point, it might be worth to ascertain that the issue really is caused by open file limits. There is probably something in either the PHP error log or syslog about it.

Tagging SRE for assistance with this issue, as it is definitely system-related in some way. Quick summary of the above discussion:

This is one of the most frequent errors on the Growth team's Logstash dashboard
We've been suspecting the issue is caused by exceeding process resources limits. This was not proven by logs, as nothing more than Class not found / Cannot access the database is visible in Logstash, see example error. Maybe there's something logged in syslog on the server itself? File permissions claim this is readable only by roots, perhaps a SRE could investigate them (or advice how syslog can be accessed by deployers)?
The issue can be reliably reproduced by running /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=1second --verbose --use-job-queue on a maintenance server. This command is executed as a maint. job every UTC morning.
Attempts to reproduce it on a single server (w/o the jobqueue) or a single wiki (w/o foreachwikiindblist) failed.
The issue didn't clear when significantly reducing the job concurrency (set to 3 in 955319).

We're calling SREs to help with both verifying the file resources limit suspicion is correct and with helping investigate the cause of the issue. Thanks in advance for any help you can provide here!

mw2381:

$ ulimit -Hn
1048576
$ ulimit -Sn
1024

I didn't see anything particularly interesting in syslog and the journal rolled over before I could see the results of today's run. It's might be worth asking a serviceops in a timezone closer to UTC 07:45 to have a look at the journal before it truncates.

Found one:

Sep 21 09:31:52 mw2381 php7.4-fpm[44244]: [WARNING] [pool www-7.4] child 44256, script '/srv/mediawiki/rpc/RunSingleJob.php' (request: "POST /rpc/RunSingleJob.php") executing too slow (18.839211 sec), logging
Sep 21 09:31:52 mw2381 php7.4-fpm[44244]: [NOTICE] child 44256 stopped for tracing
Sep 21 09:31:52 mw2381 php7.4-fpm[44244]: [NOTICE] about to trace 44256
Sep 21 09:31:52 mw2381 php7.4-fpm[44244]: [NOTICE] finished trace of 44256

[21-Sep-2023 09:31:52]  [pool www-7.4] pid 44256
script_filename = /srv/mediawiki/rpc/RunSingleJob.php
[0x00007f68cf2149e0] curl_exec() /srv/mediawiki/php-1.41.0-wmf.27/vendor/guzzlehttp/guzzle/src/Handler/CurlHandler.php:44
[0x00007f68cf214940] __invoke() /srv/mediawiki/php-1.41.0-wmf.27/vendor/guzzlehttp/guzzle/src/Handler/Proxy.php:28
[0x00007f68cf214890] GuzzleHttp\Handler\{closure}() /srv/mediawiki/php-1.41.0-wmf.27/vendor/guzzlehttp/guzzle/src/Handler/Proxy.php:48
[0x00007f68cf2147e0] GuzzleHttp\Handler\{closure}() /srv/mediawiki/php-1.41.0-wmf.27/vendor/guzzlehttp/guzzle/src/PrepareBodyMiddleware.php:64
[0x00007f68cf214700] __invoke() /srv/mediawiki/php-1.41.0-wmf.27/vendor/guzzlehttp/guzzle/src/Middleware.php:31
[0x00007f68cf214650] GuzzleHttp\{closure}() /srv/mediawiki/php-1.41.0-wmf.27/vendor/guzzlehttp/guzzle/src/RedirectMiddleware.php:55
[0x00007f68cf2145b0] __invoke() /srv/mediawiki/php-1.41.0-wmf.27/vendor/guzzlehttp/guzzle/src/Middleware.php:61
[0x00007f68cf214510] GuzzleHttp\{closure}() /srv/mediawiki/php-1.41.0-wmf.27/vendor/guzzlehttp/guzzle/src/HandlerStack.php:75
[0x00007f68cf214480] __invoke() /srv/mediawiki/php-1.41.0-wmf.27/vendor/guzzlehttp/guzzle/src/Client.php:331
[0x00007f68cf214330] transfer() /srv/mediawiki/php-1.41.0-wmf.27/vendor/guzzlehttp/guzzle/src/Client.php:107
[0x00007f68cf2142b0] sendAsync() /srv/mediawiki/php-1.41.0-wmf.27/vendor/guzzlehttp/guzzle/src/Client.php:137
[0x00007f68cf214230] sendRequest() /srv/mediawiki/php-1.41.0-wmf.27/vendor/wikimedia/shellbox/src/Client.php:162
[0x00007f68cf214050] sendRequest() /srv/mediawiki/php-1.41.0-wmf.27/vendor/wikimedia/shellbox/src/Client.php:107
[0x00007f68cf213e80] call() /srv/mediawiki/php-1.41.0-wmf.27/extensions/WikibaseQualityConstraints/src/ConstraintCheck/Checker/FormatChecker.php:199
[0x00007f68cf213dc0] runRegexCheckUsingShellbox() /srv/mediawiki/php-1.41.0-wmf.27/extensions/WikibaseQualityConstraints/src/ConstraintCheck/Checker/FormatChecker.php:181
[0x00007f68cf213d20] runRegexCheck() /srv/mediawiki/php-1.41.0-wmf.27/extensions/WikibaseQualityConstraints/src/ConstraintCheck/Checker/FormatChecker.php:139
[0x00007f68cf213bf0] checkConstraint() /srv/mediawiki/php-1.41.0-wmf.27/extensions/WikibaseQualityConstraints/src/ConstraintCheck/DelegatingConstraintChecker.php:568
[0x00007f68cf213af0] getCheckResultFor() /srv/mediawiki/php-1.41.0-wmf.27/extensions/WikibaseQualityConstraints/src/ConstraintCheck/DelegatingConstraintChecker.php:546
[0x00007f68cf2139c0] checkConstraintsForReferences() /srv/mediawiki/php-1.41.0-wmf.27/extensions/WikibaseQualityConstraints/src/ConstraintCheck/DelegatingConstraintChecker.php:389
[0x00007f68cf2138a0] checkStatement() /srv/mediawiki/php-1.41.0-wmf.27/extensions/WikibaseQualityConstraints/src/ConstraintCheck/DelegatingConstraintChecker.php:342

Some host logs from the same general time: https://logstash.wikimedia.org/goto/8fe070310612f8b8640d644a33dbd35f

In T344428#9189242, @colewhite wrote:

I didn't see anything particularly interesting in syslog and the journal rolled over before I could see the results of today's run. It's might be worth asking a serviceops in a timezone closer to UTC 07:45 to have a look at the journal before it truncates.

Thanks for checking! Just to confirm, did you check at mwmaint2002 only, or did you check what's happening at the job executors as well? The mwmaint-facing part only schedules jobs, and it's the jobs (at the job executors) that fail with this mysterious error. A list of failures is available at Logstash: https://logstash.wikimedia.org/goto/f25a08980e8a64bf48b23be5fbcd3d63.

Thanks for checking! I also looked at Logstash for system logs (thanks @colewhite for providing the link via IRC!), but I do not see anything that would explain why the job is unable to open a code file it should have access to :-/.

@Urbanecm_WMF as I said on IRC, there's two main differences when running in the jobqueue:

We use opcache
We use APC

What I fail to understand is how, if this was an open file limits problem, it would only show up when you run for all wikis. Unless we spawn 1000 jobs in parallel, which, well, we should avoid doing (and AFAICT we have no special configuration for such jobs in changeprop-jobqueue), I would expect the problem to show up even if you just select two large wikis.

This is unless the job opens a huge flurry of files, ofc.

It's hard to assess the issue as we stand, so I would ask you to do the following:

run the job, via the jobqueue, for say enwiki and dewiki with a custom dblist, see if it fails consistently this way.
run the job locally from mwmaint in coordination with an SRE and have them monitor the list of open files by that php process.
we can temporarily bump up the number of open files for php-fpm on the jobrunners to determine if that solves your issue.

Also please for every run where you encounter a problem, please include the full command line you used and a link to the exceptions on logstash, that would help me try to make sense of this - currently it does not!

LSobanski added a project: serviceops.Sep 25 2023, 11:05 AM

Thanks for the advice @Joe!

What I fail to understand is how, if this was an open file limits problem, it would only show up when you run for all wikis. Unless we spawn 1000 jobs in parallel, which, well, we should avoid doing (and AFAICT we have no special configuration for such jobs in changeprop-jobqueue), I would expect the problem to show up even if you just select two large wikis.

It seems it's happening with two wikis as well (dewiki and enwiki, as suggested below). I was testing only single wiki executions only.

run the job, via the jobqueue, for say enwiki and dewiki with a custom dblist, see if it fails consistently this way.

I did the following:

[urbanecm@mwmaint2002 ~]$ cat /srv/mediawiki/dblists/growth-biggest.dblist
dewiki
enwiki
[urbanecm@mwmaint2002 ~]$ foreachwikiindblist /srv/mediawiki/dblists/growth-biggest.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=1second --verbose --use-job-queue
[...]
[urbanecm@mwmaint2002 ~]$

Errors are reproducible with this sequence. Logstash link: https://logstash.wikimedia.org/goto/16ee5640cea6b2041944b6d410712ba1. Note that this is just what happened in a couple of minutes after I triggered the command, and further exceptions might be after the timestamp in the Logstash link I just shared. The exceptions are: [2462ee711f87091674836688] /rpc/RunSingleJob.php Error: Class 'Cdb\Exception' not found.

Does this help with understanding more context here, @Joe?

run the job locally from mwmaint in coordination with an SRE and have them monitor the list of open files by that php process.

I'm not sure what the SREs would do, but observing list of files of /proc/PID/fd when running the script locally at mwmaint didn't show an ever-growing list of open FDs. Happy to repeat this with a SRE. Would you be willing to help with this from the SRE end of things?

we can temporarily bump up the number of open files for php-fpm on the jobrunners to determine if that solves your issue.

I'm not 100% convinced it's open files causing the issue, it's the only explanation that makes sense to me given the error is "failed to open XYZ file" / "failed to connect to XYZ server". I think this would, at the very least, serve to rule out a potential cause.

KStoller-WMF triaged this task as High priority.Oct 6 2023, 8:44 PM

KStoller-WMF edited projects, added Growth-Team (Sprint 1 (Growth Team)); removed Growth-Team (Sprint 0 (Growth Team)).Oct 17 2023, 3:08 PM

• Urbanecm_WMF moved this task from Incoming to Blocked / Needs Work on the Growth-Team (Sprint 1 (Growth Team)) board.Oct 17 2023, 7:25 PM

Sorry for the silence, I was first at a conference then in bed sick (and I'm still not in a great health condition).

I'm happy to bump the open files number on the jobrunners but I'm almost convinced the problem has to do with some opcache clashing I don't fully understand. It's anyways not a big issue to add an override in puppet to raise the number of open files for php-fpm so let's try that first.

Change 967870 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] jobrunner: increase open files limit

https://gerrit.wikimedia.org/r/967870

gerritbot added a project: Patch-For-Review.Oct 23 2023, 9:24 AM

Change 967870 merged by Giuseppe Lavagetto:

[operations/puppet@production] jobrunner: increase open files limit

https://gerrit.wikimedia.org/r/967870

Maintenance_bot removed a project: Patch-For-Review.Oct 25 2023, 9:11 AM

Mentioned in SAL (#wikimedia-operations) [2023-10-25T10:24:17Z] <urbanecm> mwmaint2002: foreachwikiindblist /srv/mediawiki/dblists/growth-biggest.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=1second --verbose --use-job-queue (T344428; with higher file limit)

Mentioned in SAL (#wikimedia-operations) [2023-10-25T10:56:40Z] <urbanecm> mwmaint2002: foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=1second --verbose --use-job-queue (T344428; all wikis, higher file limit)

Change 968636 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/deployment-charts@master] changeprop: Increase refreshUserImpactJob concurrency

https://gerrit.wikimedia.org/r/968636

gerritbot added a project: Patch-For-Review.Oct 25 2023, 11:26 AM

Thanks @Joe for the FD limits change! All tests I did so far suggest that the errors tracked in subtasks no longer happen. Let's wait for tomorrow UTC morning to see what happens when the systemd timers fire (05:15 UTC, 07:45 UTC).

In the meantime, let's increase the job concurrency. It was limited to 3 in this patch as part of debugging this, and the very low concurrency of 3 is no longer serving any meaningful purpose. Uploaded a patch to do the increase.

Assuming the errors won't reoccur when the timer kicks in, it would allow us to finally go ahead with the rest of T344143 and T336203. We would, however, still need to investigate why the previous FD limit was exhausted, fix that problem and restore the previous limit.

KStoller-WMF awarded a token.Oct 25 2023, 2:58 PM

Change 968636 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: Increase refreshUserImpactJob concurrency

https://gerrit.wikimedia.org/r/968636

Maintenance_bot removed a project: Patch-For-Review.Oct 26 2023, 8:30 AM

Mentioned in SAL (#wikimedia-operations) [2023-10-26T08:49:26Z] <urbanecm> mwmaint2002: foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=1second --verbose --use-job-queue (testing T344428; after enabling backend on all Wikipedias)

Still no errors. I've increased job concurrency to 10, enabled new Impact backend on all Wikipedias and triggered a manual run of the refreshing script. Jobs seem to be both enqueued and processed, Logstash reports no errors so far. Let's check again in a couple of hours.

• Urbanecm_WMF claimed this task.Oct 26 2023, 8:54 AM

• Urbanecm_WMF closed subtask T344427: Wikimedia\Rdbms\DBConnectionError: Cannot access the database: could not connect to any replica DB server; Unknown error while connecting (db1221) as Resolved.Oct 26 2023, 9:03 AM

• Urbanecm_WMF closed subtask T341658: Error: Class 'Cdb\Exception' not found as Resolved.

• Urbanecm_WMF added a subtask: T348614: PHP Fatal Error from line 221 of /srv/mediawiki/php-1.41.0-wmf.29/includes/AutoLoader.php: require(): Failed opening required '/srv/mediawiki/php-1.41.0-wmf.29/includes/libs/rdbms/exception/DBConnectionError.php' (include_path=.Oct 26 2023, 9:07 AM

• Urbanecm_WMF closed subtask T348614: PHP Fatal Error from line 221 of /srv/mediawiki/php-1.41.0-wmf.29/includes/AutoLoader.php: require(): Failed opening required '/srv/mediawiki/php-1.41.0-wmf.29/includes/libs/rdbms/exception/DBConnectionError.php' (include_path= as Resolved.

• Urbanecm_WMF mentioned this in T348614: PHP Fatal Error from line 221 of /srv/mediawiki/php-1.41.0-wmf.29/includes/AutoLoader.php: require(): Failed opening required '/srv/mediawiki/php-1.41.0-wmf.29/includes/libs/rdbms/exception/DBConnectionError.php' (include_path=.

• Urbanecm_WMF renamed this task from RefreshUserImpactJob consumes too many file descriptors to refreshUserImpactJob logs mysterious fatal errors.Oct 26 2023, 10:49 AM

• Urbanecm_WMF updated the task description. (Show Details)

• Urbanecm_WMF mentioned this in T349809: refreshUserImpactJob requires a high number of file descriptors.Oct 26 2023, 11:00 AM

Let's be optimistic and call this resolved, since the errors disappeared. I've logged T349809: refreshUserImpactJob requires a high number of file descriptors to identify and resolve the exact cause of the FD exhaustion problem.

hnowlan mentioned this in T348517: Daily pageview/PageViewInfo errors on jobrunners.Nov 1 2023, 3:17 PM

refreshUserImpactJob logs mysterious fatal errorsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

refreshUserImpactJob logs mysterious fatal errors
Closed, ResolvedPublic
Actions

Related Objects
Search...