Page MenuHomePhabricator

[tracking] Community feedback for the WDQS Split the Graph project
Closed, InvalidPublic

Description

This ticket is created as a tracking task for the WDQS Split the Graph project. Specific issues with specific queries should be reported as subtasks of this task, general discussion of the graph split can happen in comments to this task.

Event Timeline

Gehel triaged this task as High priority.Feb 6 2024, 2:52 PM
Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.
Gehel moved this task from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.

Just here to say that I have a positive feedback querying for "collaboration platforms by creation date, with images".

Prod: https://w.wiki/94fS

Full: https://w.wiki/96$n

Main: https://w.wiki/96$q (a bit faster than prod, I see)


I also tried with a very-custom query of "software licenses, grouped by logical ones, by software count, with approved by OSI/FSF". But URL shortening fails indeed, so you have to copy-paste:

# Software licenses
# Taking in consideration license-sub-editions and counting software using them.
# Taking in consideration direct-usage by that license.
# Taking in consideration FSF and OSI approval.
# Author: [[User:Valerio Bozzolan]] and contributors
# Date: 2023
# License: CC 0, public domain
# https://phabricator.wikimedia.org/P52339
# https://www.wikidata.org/wiki/User:Valerio_Bozzolan

SELECT 
  ?count_broad_software
  ?count_exact_software
  ?license
  ?licenseLabel
  ?min_license_date
  ?approved_fsf
  ?approved_osi
WHERE
{

  # START SUB-QUERY: NO-LABEL
  {
    SELECT 
      ?license
      (COUNT (DISTINCT ?broad_software) AS ?count_broad_software)
      (SAMPLE(?count_exact_software)    AS ?count_exact_software)
      (SAMPLE(?min_license_date)        AS ?min_license_date)
    WHERE 
    {

      # START SUB-QUERY: SOFTWARE COUNTER
      {
        SELECT
          ?license
          (COUNT(DISTINCT ?software)  AS ?count_exact_software)
          (MIN   (?license_date) AS ?min_license_date)
        WHERE
        {

          # START SUB-QUERY: LICENSE
          {
            SELECT ?license WHERE {
              # This is a license.
              ?license wdt:P31/wdt:P279* wd:Q207621.
              
              # The license must not be confused with a software (it happens).
              MINUS {
                ?license wdt:P31/wdt:P279* wd:Q7397.
              }
            } GROUP BY ?license
          }
          # STOP SUB-QUERY: LICENSE

          # License must be used by software.
          ?software wdt:P275 ?license.
          wd:Q7397 ^wdt:P279*/^wdt:P31 ?software.

          # The license may have a publication date.
          OPTIONAL {
            ?license wdt:P577 ?license_date.
          }
          
        } GROUP BY ?license
      }
      # STOP SUB-QUERY: SOFTWARE COUNTER

      # License may have editions.
      # Software may use this license edition.
      OPTIONAL {
        ?child_license wdt:P629*/wdt:P279* ?license.
        ?broad_software wdt:P275 ?child_license.
        wd:Q7397 ^wdt:P279*/^wdt:P31 ?broad_software.
      }
    } GROUP BY ?license
  }
  # STOP SUB-QUERY: NO-LABEL  

  # The license may be approved by OSI / FSF.
  BIND (EXISTS{?license wdt:P790 wd:Q48413. } AS ?exists_fsf )
  BIND (EXISTS{?license wdt:P790 wd:Q845918.} AS ?exists_osi )
  BIND (IF(?exists_fsf, "✅ FSF", "❌ FSF")   AS ?approved_fsf)
  BIND (IF(?exists_osi, "✅ OSI", "❌ OSI")   AS ?approved_osi)

  # Helps get the label in your language, if not, then en language
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY DESC(?count_broad_software)

(Spoiler: GNU GPL wins)

Also in this case a bit faster in the "main". So, thanks for this promising work.

Hi - how does the federation work? I'm experimenting with this by trying to get the list of names of authors on a scholarly article - the article data itself is in the scholarly article subgraph, but the human items for the authors are in the main one. So I need to do a federated query but it's not clear how? Can you provide an example? Do I start on the main graph and federate to the scholarly one, or vice versa?

Ok, I got federation to work - sort of. From the main query service I can query the scholarly subgraph - but if I try to use the resulting values I always get a timeout.

select ?author WHERE {
  SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
         wd:Q56977964 wdt:P50 ?author .
        }
}

works fine, but even

select ?author ?b ?c WHERE {
  SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
         wd:Q56977964 wdt:P50 ?author .
        }
  ?author ?b ?c .
} LIMIT 1

times out. What's going on here???

I tried to get the federation working, but got time outs too. The problem is that the current setup makes splits at a statement level. That is, given statements with some property (e.g. P2860 and P1433), some results are in one QS instance and some are in the other. That means a lot of federation-union combinations to get all results. I posted an example query that is affected (the first I tried) in this issue report: https://github.com/WDscholia/scholia/issues/2423

success criteria

I have tried to understand the graph split "experiment", but I don't understand the success criteria. My recommendation would be to work out the success criteria more in detail before starting the user feedback.

relation to movement strategy

I addition I don't understand how this activity supports the Wikimedia Movement Strategy. Making it more difficult to write SPARQL queries does not seem very inclusive to me.

alternatives

I wonder if blazegraph (in the current configuration) is still the best solution. Coincidentally I was seeing a talk about another large graph about all software source code 34b nodes. The approach was to rewrite the software - that was written in JAVA - in rust. I imaging rewriting blazegraph in rust might give a similar (one time) performance gain as well and might make the split unnecessary.

Another alternative is to translate SPARQL queries to PHP code and execute it on the mediawiki runners. Maybe some mariadb graph query extension could also be helpful. While the implementation of the sparql endpoint would be some effort, it would eliminate the effort of syncing the data from mariadb to blazegraph.

I tried to get the federation working, but got time outs too. The problem is that the current setup makes splits at a statement level. That is, given statements with some property (e.g. P2860 and P1433), some results are in one QS instance and some are in the other. That means a lot of federation-union combinations to get all results. I posted an example query that is affected (the first I tried) in this issue report: https://github.com/WDscholia/scholia/issues/2423

I got this query rewritten at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split/Federated_Queries_Examples#Number_of_articles_with_CiTO-annotated_citations_by_year, I agree that given the current split strategy we have to UNION the main and scholarly articles graph most of the time.

@Physikerwelt thanks for your feedback.

Blazegraph is definitely not the best solution and the work to move off of blazegraph should be tracked under https://phabricator.wikimedia.org/T330525 (see the initial exploration we have done). The solutions you suggest might be better discussed in their own tickets as a subtask of T335067.
This particular ticket is about collecting feedback regarding use-cases that might be affected by the split. This split is one of the solution we want to experiment to address the scalabity issues of WDQS. We are conscious about the usability issues that you raise but at this point we are more focused on understanding the feasibility and limitations of federation with such a split. It should be worth noting that one goal is to be sure that use-cases not relying on the scientific articles should still work without federation.

Thank you for your response.

This particular ticket is about collecting feedback regarding use-cases that might be affected by the split. This split is one of the solution we want to experiment to address the scalabity issues of WDQS. We are conscious about the usability issues that you raise but at this point we are more focused on understanding the feasibility and limitations of federation with such a split. It should be worth noting that one goal is to be sure that use-cases not relying on the scientific articles should still work without federation.

As a scientist it is hard to understand how to collect feedback without properly defined success criteria. I am also a bit concerned about discussing the strategy on a technical level, where you can not just buy a bigger machine to mitigate the problem until a real solution is found. To me, it seems that WDQS is a non-essential service for wikipedia.org so migrating to something new can be done with service interruptions.

On a less detailed level, it seems very hard to understand why citations should be split off when [citation needed] became a catchphrase for the Wikimedia movement at large. Overall this experiment seems to be a waste of donation money to me.

One reason is that citations are a large corpus with a fairly narrow range of schemas and uses, so it could conceivably be implemented with optimizations that can't be applied across the board. But I do think that after a split into a separate citations wikibase, we should then formally evaluate the split and consider whether re-merging after a migration to a new graph db makes sense [I would hope it does -- 1B entities is not too much for a single db in other communities and contexts]

The consultation phase is over, summary in https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/June_2024_scaling_update. We will have more conversations during the transition, but this will be handled outside of this task. Exact communication channels need to be decided and communicated, but as always it will most probably be a mix of individual phab tasks for specific issues and talk pages.

Physikerwelt changed the task status from Resolved to Declined.Fri, Jun 28, 9:02 AM

It seems that the community feedback was rather ignored than taken into account. Thus I think decided is a better status for this task.

Sannita changed the task status from Declined to Resolved.Fri, Jun 28, 9:07 AM

Community feedback was considered and taken into account, not ignored in the slightest. Please, do not change the status of the task without proper consensus.

Could you please elaborate on how consensus was achieved to resolve this task? So that I can learn how to reach consensus for changing the status to declined. I could not find this information here.

I think it is totally okay, to be honest and say WMF doesn't have the resources to provide one SPARQL endpoint for all of Wikidata. Everyone can understand that, especially, since this endpoint is not essential for Wikipedia.

I can understand that for every idea there is someone who says this is a waste of resources, but I think it would be better to be transparent about that.

The report linked to a week ago gives a summary of decisions taken, but not a summary of the community feedback. Things what could be clarified in that report which community feedback it does and does not summarize. For example, I have the impression if only refers to feedback from that period, and excludes feedback from earlier (e.g. by me that Scholia will stop to work). And if feedback that was givin within the period but via other channels was included (for example the feedback by Lane). These things are not clear to me, and I understand therefore that also what "resolved" means is unclear (it is to me). So, even if the "resolved" just means "task completed" it is not fully clear to me if it really was completed or just ended. In the last case, "resolved" does not quite capture that for me either. Similarly, "was considered" and "not ignored" can mean a lot of things and says little what was done with the feedback. Some of my concerns are not addressed and then the conclusion to me is then, "yeah, sorry, we know Scholia will break, but it is neccesary". Really, believe me, I can live with that. But let's be clear about that. Only then we can plan action.

I think the discussion up to now was very technical and detailed.

From a movement strategy point of view, I feel that the option to hand over the SPARQL endpoint from WMF to a partner with the resources to focus on that job has not been discussed extensively. I understand that the alternatives are discussed in subtasks of T335067, in particular, T330525, but this disregards the option to discontinue an in-house SPARQL endpoint. Also, probably, the number of people that would be involved in a strategic discussion and those who use phabricator is also limited. Gathering community feedback in the way it was done might contribute to a misleading picture for the management that oversees different needs within the foundation.

"Resolved" means that the task has been completed, in this case. "Declined" would mean that somebody reported something, and that the developer team declined to act on it, or that a task has been merged into another one.

I'm sorry that the measures we have taken to ensure that your feedback was integrated in our current plan are deemed not enough, but we did our best to take into proper consideration your feedback, while also providing a safely robust plan for the upcoming months of work that we have ahead. In these months of discussions, we took painstakingly long discussions to evaluate all possible solutions, and we had to come to term with the fact that some of them were not realistically feasible, and had to be excluded from discussion.

We did have months of discussion though, on several platforms, on wiki and in person, also with people from Scholia, to allow as much as possible people to have their say, so I would say this period of feedback is closed and can be marked as "resolved". This doesn't mean that we won't listen any more to feedback coming, and that we will not try our best to integrate that feedback in our plan. It means only that the current phase of feedback is over.

Physikerwelt changed the task status from Resolved to Invalid.Wed, Jul 3, 6:36 PM

I am changing the status to invalid, as the task was worked on (for months) but not completed in a measurable way. From previous discussions with @Aklapper I understand that investing effort alone doesn't justify resolving a task. In that sense, I am now getting the impression that this is more a discussion thread than a task with a measurable outcome.

When reading the task description

This ticket is created as a tracking task for the WDQS Split the Graph project. Specific issues with specific queries should be reported as subtasks of this task, general discussion of the graph split can happen in comments to this task.

it still seems inappropriate to me to mark this as resolved.

I hope this status can reach consensus otherwise please feel free to reopen the issue.

Depends...but I don't know what the scope of this task is/was as there's no verb in the title (collect? implement?) and what to achieve by having this very ticket...
In general, I'd personally say that tasks are not meant for general discussion of or collecting catch-all feedback on an entire project - that's what talk pages are for.