Page MenuHomePhabricator

Security review for Cognate Extension
Closed, ResolvedPublic

Description

Project Information

Description of the tool/project

An extension to provide automatic inter language links for Wiktionary projects

Description of how the tool will be used at WMF

The extension will be deployed to all Wiktionary sites

Dependencies

List dependencies, or upstream projects that this project relies on

None

Has this project been reviewed before?

please link to tasks or wiki pages of previous reviews

Only regular CR

Working test environment

please link or describe setup process for setting up a test environment

Labs test environment (which you can be given access to)

http://enwiktionary-cognate.wmflabs.org
http://dewiktionary-cognate.wmflabs.org
http://frwiktionary-cognate.wmflabs.org

The extension require a multi wiki setup.

Post-deployment

name of team responsible for tool/project after deployment and primary contact

WMDE / Wikidata / @Lydia_Pintscher

Related Objects

StatusSubtypeAssignedTask
OpenFeatureNone
OpenFeatureNone
OpenFeatureNone
OpenFeatureNone
OpenNone
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
OpenNone
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
Resolvedaude
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
Resolvedaude
ResolvedNone
OpenNone

Event Timeline

daniel subscribed.

Added T150404 as a blocked, since that should be merged before the security review. I plan to do this this week.

Bawolff subscribed.

Security review of Cognate extension (As of a187d934b8c6)

Normal issues

  • Use sha256 (or some other modern hash) instead of md5. Even in contexts where the attacks against it is not important we should be phasing out md5.

Issues of unclear severity

64-bits is not enough for collision resistance against a malicious adversary. If I have my math right (It is very possible that I mixed this up) There are 5 million (~2^22) titles on enwiktionary. 2^64/2^22 = 2^42. Based on googling, a 4 GPU system can do roughly 2^33 hashes/second. 2^42/2^33 = 2^11 seconds ~ half an hour to find a collision. So a vandal could make malicious page titles that collide with titles on enwiktionary, for the purpose of vandalism via the lang links. I'm not sure how serious an issue this is. Its clearly a lot more work for less disruption than most forms of vandalism, but it could be confusing to users who wouldn't understand what's going on.

I think it may be an acceptable risk, but users should be consulted about the possibility before the extension is enabled. The risk should also be documented.

Alternatively the risk could be prevented by using autoincrement ids (would require storing both the normalized and raw title though) or by increasing the size of the hash to 128 bits (e.g. Having two BIG INT fields and joining on both). A different alternative might be using an hmac with a secret key (that may be annoying to tool labs users).

Minor issues

  • line 68 of populateCognatePages.php Its preferred to use the MW DB abstraction layer instead of hand constructing sql where possible (instead of "page_namespace IN " . ..., do "page_namespace" => $namespaces).

Non-security issues

  • CacheUpdateJob doesn't update page_touched. More generally shouldn't this use built-in HTMLCacheUpdateJob for forward compatibility with any future changes to how caching works in MediaWiki? Also this would allow core's deduplication code to be used.
    • At one point I thought that this would be unnecessary, since the language links from this extension are generated on every (non-varnish cached) view (as opposed to stored in parser cache). However it actually is necessary since MW will give 304 not modified responses based on page_touched.
  • This assumes that all sites have same interwiki structure for language links or things will break. This is fine since its true on wiktionary, however I think this requirement should be clearly documented.
  • The indexe on cognatePages for the queries from selectLinkDetailsForPage, selectSitesForPage seem to be not quite right. I think the index should be (cpga_title, cpga_namespace, cpga_sites).
  • I assume this is not using the LinkUpdates related hooks because it only wants to deal with new pages not edits (?) However it seems that this still triggers on all edits. Perhaps it should look at the EDIT_NEW flag in the onPageSaveComplete hook to avoid some unnecessary database interaction.

Change 334031 had a related patch set uploaded (by Addshore):
populateCognatePages use mw db layer for IN query

https://gerrit.wikimedia.org/r/334031

Change 334032 had a related patch set uploaded (by Addshore):
Add note about interwiki lang link assumption to README

https://gerrit.wikimedia.org/r/334032

Change 334033 had a related patch set uploaded (by Addshore):
Adjust index on pages table

https://gerrit.wikimedia.org/r/334033

  • I assume this is not using the LinkUpdates related hooks because it only wants to deal with new pages not edits (?) However it seems that this still triggers on all edits. Perhaps it should look at the EDIT_NEW flag in the onPageSaveComplete hook to avoid some unnecessary database interaction.

Cognate needs to know when pages are edited to determine if the redirect state has changed:

  • redirect -> regular page = needs to be added to the db
  • regular page -> redirect = needs to be removed from the db

If I had both the new and the last content then I would be able to check the redirect state of each and only act where needed here.
I'll have a further look.

Change 334041 had a related patch set uploaded (by Addshore):
Switch to sha256 hash

https://gerrit.wikimedia.org/r/334041

Everything from the security review now either has a patch linked to this ticket or a subtask has been created.

In response to the issue with hash conflicts, pages that create duplicate hashes will simply not be written to the database / ignored, thus it should not be an attack vector in terms of vandalism.
I have created a sub task to confirm this, and add some logging when conflicts do happen (if they do) and add some extra testing in the area.

Change 334031 merged by jenkins-bot:
populateCognatePages use mw db layer for IN query

https://gerrit.wikimedia.org/r/334031

Change 334032 merged by jenkins-bot:
Add note about interwiki lang link assumption to README

https://gerrit.wikimedia.org/r/334032

  • I assume this is not using the LinkUpdates related hooks because it only wants to deal with new pages not edits (?) However it seems that this still triggers on all edits. Perhaps it should look at the EDIT_NEW flag in the onPageSaveComplete hook to avoid some unnecessary database interaction.

Cognate needs to know when pages are edited to determine if the redirect state has changed:

  • redirect -> regular page = needs to be added to the db
  • regular page -> redirect = needs to be removed from the db

If I had both the new and the last content then I would be able to check the redirect state of each and only act where needed here.
I'll have a further look.

Oh. This makes sense to me now. When i was first looking at it I didnt realize it needed to do that

Oh. This makes sense to me now. When i was first looking at it I didnt realize it needed to do that

The patch attached to T156239 reduces the db interaction.

I'll have a follow up comment about the hashes / conflict situation tomorrow.

Other than that I believe everything else already has patches up / merged patches.

Addshore claimed this task.

So, it looks like all subtasks are resolved and thus I am closing this!