cfpb / cfgov-lighthouse

Home Page:https://cfpb.github.io/cfgov-lighthouse/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

create-report-index fails for URLs without a trailing slash

thibaudcolas opened this issue · comments

Some of the logic in create-report-index does not support URLs without a trailing slash. I spotted this while testing another site, and it doesn’t affect any of the URLs currently being monitored on consumerfinance.gov, so feel free to discard this issue if irrelevant.

Current behavior

If you test a URL like /news/news.php?feature=7790, the Lighthouse auditing works as expected, but create-report-index will fail with:

2021-01-04T15:01:15.332Z error: 	TypeError: Cannot read property '1' of null
TypeError: Cannot read property '1' of null
    at /cfgov-lighthouse/scripts/lib/reports.js:81:25
    at Array.map (<anonymous>)
    at processManifestRuns (/cfgov-lighthouse/scripts/lib/reports.js:76:15)
    at reducer (/cfgov-lighthouse/scripts/create-reports-index.js:59:27)
    at async /cfgov-lighthouse/scripts/create-reports-index.js:68:19

(see #21 for the separate issue of error reporting).

Expected behavior

No crash for all valid URLs. Again, this works as expected for all URLs currently tested in this repository – it only fails when adding new URLs that do not have the trailing slash.

Steps to replicate behavior (include URLs)

  1. Run an audit on a URL that does not have a trailing slash, for example https://www.jpl.nasa.gov/news/news.php?feature=7790. This will create a report filename of www_jpl_nasa_gov-_news_news_php-2021_01_04_14_58_51.report.json.
  2. Run create-report-index

Looking at the code, I can see the issue comes from logic that extracts the report’s slug and date from the filename:

function processManifestRuns( runs ) {
return runs.map( run => {
const runFilename = path.basename( run.jsonPath );
const runDirectory = path.basename( path.dirname( run.jsonPath ) );
// Report filenames are in the format: URL_YYYY_MM_DD_HH_MM_SS.report.json
const details = runFilename.match( /(.+)_\-(\d\d\d\d_\d\d_\d\d)_\d\d_\d\d_\d\d\.report\.json/ );
const slug = details[1];

The regex assumes slugs end with _-. It won’t if there is no trailing slash (www_jpl_nasa_gov-_news_news_php-2021_01_04_14_58_51.report.json).

For my case I decided to fix this by changing the report filename pattern so there is a more predictable separator. There might be other viable approaches. Here is the relevant part of my lighthouserc.js:

module.exports = {
  ci: {
    /* […] */
    upload: {
      target: 'filesystem',
      outputDir: path.join(REPORTS_ROOT, timestamp),
      reportFilenamePattern:
        '%%HOSTNAME%%-%%PATHNAME%%___%%DATETIME%%.report.%%EXTENSION%%',
    },
  },
}

You’d then need to update the corresponding logic to match that ___ separator.

Additionally to this trailing slash issue, I think there is also a problem with the "form factor" logic for URLs that contain a query string already. Just like the above, this isn’t an issue with URLs currently tested in the repository – I only stumbled upon this while testing another site / reviewing the code.

The code that generates URLs to test by Lighthouse correctly handles the query string and generates the appropriate URL, since it uses the URL interface rather than processing URLs as strings. Here is the generated URL for the Lighthouse logs:

Running Lighthouse 3 time(s) on https://www.jpl.nasa.gov/news/news.php?feature=7790&mobile=1

You can see that this doesn’t contain the ? expected by the report index code:

url: run.url.replace( '?mobile=1', '' ),
jsonPath: `${ runDirectory }/${ runFilename }`,
formFactor: run.url.includes( '?mobile=1' ) ? 'mobile' : 'desktop',

I imagine it would work to use the URL interface here as well to remove the query parameter / check for its presence regardless of where it is in the query string.

Thanks for reporting these @thibaudcolas! We really appreciate it and it's nice to see others trying to use this code. I'll give your suggestions a shot and open some PRs to address.

Lovely, let me know if further details would help.

@thibaudcolas I've opened #24 to address the first issue you mention -- would this solution work for you?

Indeed, at least based on the code only this looks like it would do just as well, and is much simpler!

#24 above has been merged, addressing the title of this issue.

@thibaudcolas I've opened #25 to address the second issue you discovered about query strings in tested URLs. Please give it a try if you get a chance!

Hopefully fixed by #24 and #25!