Page MenuHomePhabricator

Spike: Investigate "Improve export of electronic books" [8 hours]
Closed, ResolvedPublic

Description

As a Wikisource user, I want the team to investigate "improve export of electronic books" wish, so they can consider the various options and risks of this top Wikisource wish.

Background: In the 2020 Community Wishlist Survey, the #1 wish was to "Improve export of electronic books." This was also requested in the 2019 Community Wishlist Survey as the #4 wish. This has been a repeated pain point for the Wikisource community. While we have done work to improve the process, they are still experiencing issues. For this reason, we want to take the time to deeply investigate the potential options available to improve reliability and formatting for users.

Relevant Resources:

Acceptance Criteria:

  • Review the wishes from 2020 and 2019, as well as relevant Phabricator tasks (see links above)
  • Provide an analysis of potential risks associated with this project from a technical perspective
  • Provide an analysis of potential dependencies associated with this project from a technical perspective
  • Provide a recommendation for implementation of this change
  • Provide a rough estimate/sense of difficulty or effort required by this project
  • Investigate various options outlined in the All Hands brainstorms doc, which includes:
    • Upgrade the version of Calibre and see the new output
    • Pandoc
    • Move to VPS (more details in T242760)

Investigation

This wish focuses on two key aspects of the export tool: uptime/reliability and ebook formatting.

1. Reliability

Uptime we dealt with a fair bit last year, and we're in the process of moving the tool to its own VPS so that it can have more resources and not be as affected by Toolforge maintenance.

The other big thing we can do to improve reliability is to move to a job queue system, so that the book generation processes are handled separately both from the web frontend and each other. This is a large refactor, but I think one we understand reasonably well (it's similar to what we built for #EventMetrics).

Possible actions:

2. Formatting

A wide range of formatting errors have been reported, such as:

  • Missing text at end of page or beginning of page (in plain text or in table) T244825
  • Duplication of text at end of page or beginning of page
  • Table titles don't appear
  • Table alignment in a page (centered) not respected
  • Text alignment in table cell not respected
  • Style in table not respected in MOBI format

There are four main places where we're getting formatting errors in ebooks:

  1. The original HTML of the wiki, from things such as misnested tags or incorrect CSS in templates etc.
  2. How we process the wiki HTML into epub XHTML.
  3. The secondary output formats such as PDF, introduced by Calibre's internal conversion.
  4. Ereader rendering of epubs.

The first two are the only ones we can do much about.

Possible actions:

  • T244837: Upgrade Calibre on wsexport VPSs This has been done as part of the move to VPS. Toolforge has Calibre 2.75.1; VPS has 3.39.1. The latest Calibre is 4.10.1, so we should still upgrade more.
  • Come up with simple example pages that demonstrate each of the formatting issues.
  • Fix easy errors in templates that are widely used.
  • Add direct display of epubcheck output GH #190 and fix prevalent issues (such as T244694, T244448)

Misc.

  • Switching to Pandoc for epub generation: this looks likely to not give us enough control over the epub contents (crucially the ToC, but also other metadata).
  • Switching to Pandoc for converting epubs to other formats. For example, an epub PDF from Calibre F31607711 and from Pandoc F31607712 (Calibre is better in almost all ways, but books with lots of tables for example might fare better with Pandoc, e.g. F31607741). This would give us lots of other output formats that aren't supported by Calibre. Perhaps we could just add a parameter so that people could choose which conversion system they want? But this is not something that we should worry about too much.

Event Timeline

ifried renamed this task from Spike: Investigate "Improve export of electronic books" [placeholder] to Spike: Investigate "Improve export of electronic books".Feb 3 2020, 11:54 PM
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried renamed this task from Spike: Investigate "Improve export of electronic books" to Spike: Investigate "Improve export of electronic books" [8 hours].Feb 5 2020, 12:41 AM
ifried moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.
Samwilson added a project: WS Export.
Samwilson updated the task description. (Show Details)

This is ready for review... otherwise I'm just going to keep going further down the rabbit hole while investigating bits of it! I think the main outcomes are (in somewhat priority order):

The job queue is probably the biggest code change, and in the previous survey it was suggested that the whole tool needs a big rewrite. I'm not sure if we've got the resources for that (and anyway it'd just result in more bugs like rewrites always do!) but we certainly could look at splitting out some parts into separate libraries (e.g.), and using more modern components in the code (e.g. switching to symfony/console for the CLI wouldn't be hard, and might be a nice precursor to adding the job runner). We've already got a database layer, which is now using MySQL, so if we extend that any further we will probably want to use a standard system for that too.

That said, it really looks like most issues with wsexport are around formatting, and lots of the fixes I've found for those are onwiki in templates and elsewhere. Some are good to fix, some are just part of an infinite long tail that's always going to be there.

Does anyone have any more thoughts on this? Should I go ahead and create tickets for any of the above?

This is good stuff. One thought I have is that maybe we separate this work into two categories: 1) the formatting issues and 2) the reliability and scalability issues.

Then, we can coordinate tasks underneath each category and prioritize the work appropriately once we see the other constraints we'll have.

Okay, I've added a bit more detail above and created some more tickets. Moving to product sign-off as there's nothing to review or QA here. The various tickets can be dealt with separately; some may be unnecessary or invalid, but might help with ongoing discussions.

The formatting issues I think there are going to be more of, but they're hard to figure out (lots of on-wiki or particular-device issues).

ifried moved this task from Product sign-off to Done on the Community-Tech (Kanban-2019-20-Q4) board.

Thank you for creating the tickets & conducting the investigation, @Samwilson! As we have now launched the project page for this work, and separate tickets have been created, I'm marking this investigation as Done.