Discussion about the data download feature - challenges and possible route forward

Question

Discussion about the data download feature - challenges and possible route forward

tlongers opened this issue 2 years ago · comments

The download of data from WhoWasInCommand has been a difficult feature to realise inside the platform. This is in part because we have very little user feedback about this feature and have had to make some quite big guesses about needs. In this issue I'm going to go over a few of the current challenges I see with the downloads feature. In outlining these, my intention is not to be critical of any of the work done to date. Rather, my intention is to understand the current situation and attempt to frame a route forward.

What are the challenges?

Performance issues with on-demand downloads

Our initial 2018 specification for the download feature was implemented, although it was not possible to include the source fields in on-demand downloads because of disk space and performance issues that could not be resolved effectively at the time.

This means that the ablity for users to probe the evidence base underlying the downloaded data is not present, removing the data integrity measure into which we invest the most time.

Outputs lag behind model changes

In response to a user request, I drew up some documentation for the platform's download feature, and it also seems that since the initial work on this feature we did not propagate the major model changes (see PRs #714 and #716) into the download files. Neither the changes to how we represent relationships and locations are reflected in the download files. Here, for example, is the fieldset for the "Areas of Operation" download:

unit:id:admin
[...]
unit:area_ops_id
unit:area_ops_name
unit:area_ops_country
unit:area_ops_feature_type
unit:area_ops_admin_level
unit:area_ops_admin_level_1_id
unit:area_ops_admin_level_1_name
unit:area_ops_admin_level_2_id
unit:area_ops_admin_level_2_name

This form for expressing areas of operation was replaced in #716 in favour of a new, distinct location model. The data in the download is correct. However, the fieldset for this download would be:

unit:id:admin
[...]
unit:location_type
unit:base_name
unit:location
unit:location_first_cited_date
unit:location_first_cited_date_founding
unit:location_last_cited_date
unit:location_open

In keeping with the current pattern for downloads, we would also provide with a separate download of the location data in tabular and GeoJSON formats.

Download outputs are still confusing

The download feature currently offers the users seven different reports. As our documentation shows, these correspond to different dimensions of the dataset: just units, then units' relationships, then their operational areas and sites; then persons, and so on. We made the decision to do it this way as it appeared to us to be more user-friendly than our first, more complex offering (see #282). However, what feedback we do have indicates the array of downloads is confusing and it is not clear to users what they are seeing in the downloads and what they can and need to do with the data to make use of it.

Confusing UX for accessing downloads

As originally framed the download feature was made available to the user in two places: on search results, and on specific records. This pattern is still present on the platform, as you can see in the screenshots of each scenario:

Both of the download buttons currently point to the same bulk download page. This page offers the user the choice of country and specific report, rather than either the results set or specific record advertised on the buttons that brought them to the download page. This is a bit misleading for users.

A path forward

We don't have the user data to support any particular direction for a flexible download mechanism, nor are we likely to obtain this in time to help inform that. The route suggesting itself to us is to simplify things, as below:

Rather than draw up a download from the database, just offer the original spreadsheets from which the platform data is assembled. I took a run at something similar to this before in #355, but we can improve on it. I'd add to that that we should include in these download a topsheet with metadata about the spreadsheet, and a sheet containing a data dictionary. This means we do not have to leave out anything (such as the sources). This may be as easy as just mainintain a static page manually, or with a few tweaks to the import scripts.
Greatly improve our documentation about how to use those specific worksheets to answer questions. Whilst our worksheets are complicated, it's more sensible for us to explain an imperfect system once well in our documentation rather than maintaining documentation on both our data model and a set of download reports.
Remove the download links from search results and the record pages, and place a link to the download page prominently in the main navigation.
Improve the presentation of a printable view of specific records so at least a user can take a legible file home with them for a specific record they are interested in. We have touched before in #531, but didn't flesh it out. My own view is that a good practise here is captured by gov.uk's "Easy Read" versions of public health data, which are accompanied by a PDF.

Sam McAlilly · Answer 1 · Thu May 12 2022 06:01:54 GMT+0800 (China Standard Time)

Hi, @tlongers –

It seems like we’re discussing two things: revising the self-service bulk download, and creating a more ergonomic way to download data about a single record.

Bulk download

I'm hearing you say a minority of users want to download data in bulk. In your view, does the anticipated benefit of self-service bulk downloads justify continued investment, or is the scale of downloads such that SFM could field requests and use those interactions to gather information on what questions data users are trying to answer?

For context, according to Matomo, the downloads page accounted for only 171 of 8,307 (2 percent) of page views in April. Of those, the majority were referred from organization pages, followed by people pages. There were virtually no downloads referred from search results or incident pages, and only a few navigated to the download page directly.

If we do move forward with a revised bulk download, I expect the original spreadsheets would be quite confusing to the layperson, even with expanded documentation. Do you have a sense, based on the conversations you have had, what questions bulk data users are trying to answer? This could help inform a better format for users.

Record download

I agree removing the download links and providing a way to download a PDF of a given record would be a simple way for users to download individual pieces of data.

The public health site is a good example. Thanks! One big difference is that SFM records have a lot of related data. Would it be useful to expand or reformat some of the related records in a print version, for instance showing additional information about people related to an organization, or would a PDF snapshot of the webpage with clickable links be enough?

Curious to hear your thoughts!

Tom Longley · Answer 2 · Fri May 13 2022 01:55:45 GMT+0800 (China Standard Time)

Do you have a sense, based on the conversations you have had, what questions bulk data users are trying to answer?

Yes, it's the same set of questions that motivate us in monitoring state security and defence forces: When something terrible has happened, who can be said to be responsible? Usually to do with a specific incident (so time and place are the key filters). Also, some are curious about how what they can do with data on a force as a whole, such as relate it to other datasets (like training, corrupt acts, sanctions list, politically exposed persons, materiel lists, unforms/camo, insignia). However, in many cases its more vague and exploratory: they want to see what we have, as a whole; or the fact of this dataset, which is very novel, is simply interesting.

There are also doctrinal reasons: autonomy and curiousity, security and assurance: do the analysis yourself, get as much data as you can when you can for it be a precious and fleeting resource! For investigative journalists it's a basic safeguard to always have the source of your information. By and large, this particular group will also want to wade into the data themselves, and will likely either have access to some developer time, or have some data wrangling skills (ranging from solid spreadsheeting, to some level of scripting). I would never write off a very motivated reporter and their abilty to wrangle and organize data, and also ask us for help!

The dilemma we face, I think, is whether for this data model we can design a downloadable dataset that meets these broad needs that is any better than than the raw sheets we feed into WhoWasInCommand? The evidence so far is that we can't. However, there is a need from our side too - for transparency. We want all our data out there in as rich a form as possible.

would a PDF snapshot of the webpage with clickable links be enough?

Good question. I think an initial PDF should be just the immediate record the person is looking at, with its substantive data and its sources organized. Clickable links to linked records would be good. The key, I think, is the information design of the download. It may have be quite different from the webpage layout.

Tom Longley · Answer 3 · Sat May 21 2022 00:11:31 GMT+0800 (China Standard Time)

We took this into an off-issue conversation to settle the path forward. @smcalilly @hancush does this capture the conversation ok?

Development wise, this is the path SFM would like to take:

Add downloads as top-level link.
Remove entry points to download page from search results and individual record pages.
adapt import process to stash orignal import sheets in a more persistent location.
redesign download page to user can grab a single zip (or workwook) containing the complete dataset in original format, for a specific country.
Download page will have better text along the lines of "this data is complex, but here's how to do things with it", with links to a set of guides for working with the data.
Plenty of modalities for this download page (long page, big table, dropdown, etc), which need wireframing for discussion.

Accompanying this, SFM needs to update the Research Handbook to include:

description of what you'll find in the workbook.
finding your way around the workbook, including useful filters/pivots/charts.
some advice on using common scripting tools to work with the data (when ready, perhaps, links to SFM internal tooling).

hannah cushman garland · Answer 4 · Sat May 21 2022 01:19:40 GMT+0800 (China Standard Time)

That's an excellent summary, @tlongers, thank you! We'll get going on wireframes and have something for you first half of next week.

hannah cushman garland · Answer 5 · Thu May 26 2022 06:00:03 GMT+0800 (China Standard Time)

I think it makes most sense to maintain one ZIP archive that contains all the country sheets, plus the sources sheet and a README with links to examples and relevant documentation.

Here's the first draft of a more thorough Download page: https://app.moqups.com/BVsRXncIlxyCWygvQcc2m8BUsSiyRqCM/view/page/a774936d3

I tried to do a couple of things:

Center the download and documentation
Offer help
Specify what's in the data and how one might use it, in digestible sections (N.b., we should draft the content in these sections for optimal discoverability. As of now, there's a lot of placeholder text and copy stolen from the About page.)
Offer help again!

@smcalilly @tlongers @tonysecurityforcemonitor Very interested to hear your thoughts!

Tom Longley · Answer 6 · Wed Jun 08 2022 19:11:56 GMT+0800 (China Standard Time)

@hancush This is great; I like the confidence it exudes! For implementation, let's divide it into two steps:

implement the new download mechanism itself at ~/downloads with a single additional link to our docs.
augment the page with the supporting text pulling out more information on specifics. My reasoning to make this a second step is it certainly better for us to do a more comprehensive overhaul of what information goes where between WWIC, our organizational site, and our docs.