Slow performance and server errors from searching resources for an identifier
kevdoan opened this issue · comments
Describe the bug
We have a workflow set up where we first search for a resource based on an identifier:
If it exists, we update the resource.
If it doesn't, we create it.
However, doing a GET to resources with an identifier like this: /fhir-server/api/v4/Coverage?identifier=29754927
, we see slow performance and eventually get 500 server errors.
failed to execute HTTP Get. Status Code: 500. Reason: Internal Server Error. Response: {"resourceType":"OperationOutcome","id":"a-a-2c-bc-d7b1421c-8328-41f5-a052-5056f5fcdb54","issue":[{"severity":"fatal","code":"exception","details":{"text":"FHIRPersistenceDBConnectException: Failure acquiring connection for datasource"}}]}
Upon investigation, and looking at pg_stat_activity, we identified this query as the bottleneck:
utility_fhir=# explain analyze SELECT R.RESOURCE_ID, R.LOGICAL_RESOURCE_ID, R.VERSION_ID, R.LAST_UPDATED, R.IS_DELETED, R.DATA, LR.LOGICAL_ID
FROM (
SELECT LR0.LOGICAL_RESOURCE_ID, LR0.LOGICAL_ID, LR0.CURRENT_RESOURCE_ID
FROM fhirdata.Coverage_LOGICAL_RESOURCES AS LR0
WHERE LR0.IS_DELETED = 'N'
AND EXISTS (
SELECT 1
FROM fhirdata.Coverage_LOGICAL_RESOURCES AS LR1
INNER JOIN fhirdata.Coverage_RESOURCE_TOKEN_REFS AS P2 ON P2.LOGICAL_RESOURCE_ID = LR1.LOGICAL_RESOURCE_ID
AND P2.PARAMETER_NAME_ID = 20000
AND ((P2.COMMON_TOKEN_VALUE_ID IN (46163125,46163127,46163128)))
WHERE LR1.LOGICAL_RESOURCE_ID = LR0.LOGICAL_RESOURCE_ID)) AS LR
INNER JOIN fhirdata.Coverage_RESOURCES AS R ON LR.CURRENT_RESOURCE_ID = R.RESOURCE_ID
ORDER BY LR.LOGICAL_RESOURCE_ID
LIMIT 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=3313.12..8745.36 rows=10 width=880) (actual time=19821.291..19821.300 rows=1 loops=1)
-> Nested Loop (cost=3313.12..279814.02 rows=509 width=880) (actual time=19821.289..19821.297 rows=1 loops=1)
-> Merge Semi Join (cost=3312.69..276930.53 rows=509 width=65) (actual time=19819.612..19819.619 rows=1 loops=1)
Merge Cond: (lr0.logical_resource_id = p2.logical_resource_id)
-> Index Scan using coverage_logical_resources_pk on coverage_logical_resources lr0 (cost=0.43..266459.74 rows=4985905 width=65) (actual time=0.461..19284.621 rows=4589654 loops=1)
Filter: (is_deleted = 'N'::bpchar)
-> Sort (cost=3312.26..3313.53 rows=509 width=16) (actual time=4.096..4.099 rows=2 loops=1)
Sort Key: lr1.logical_resource_id
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=1.00..3289.37 rows=509 width=16) (actual time=4.066..4.081 rows=2 loops=1)
-> Index Only Scan using idx_coverage_resource_token_refs_tplr on coverage_resource_token_refs p2 (cost=0.57..580.32 rows=509 width=8) (actual time=2.321..2.329 rows=2 loops=1)
Index Cond: ((common_token_value_id = ANY ('{46163125,46163127,46163128}'::bigint[])) AND (parameter_name_id = 20000))
Heap Fetches: 2
-> Index Only Scan using coverage_logical_resources_pk on coverage_logical_resources lr1 (cost=0.43..5.32 rows=1 width=8) (actual time=0.872..0.872 rows=1 loops=2)
Index Cond: (logical_resource_id = p2.logical_resource_id)
Heap Fetches: 2
-> Index Scan using coverage_resources_prf_in1 on coverage_resources r (cost=0.43..5.67 rows=1 width=823) (actual time=1.669..1.669 rows=1 loops=1)
Index Cond: (resource_id = lr0.current_resource_id)
Planning Time: 40.273 ms
Execution Time: 19821.522 ms
(20 rows)
Environment
4.10.2
To Reproduce
Steps to reproduce the behavior:
- Load about 5million coverage resources into the database.
- Do a search for a specific identifier, and you will notice slow response times (2-3seconds for a single GET, 20+ seconds when doing multiple concurrent GETs).
Expected behavior
We're currently using a workaround, which is to supply a system to the identifier search to be more specific, and is much faster (2-3ms instead of 2second response for a single GET) : /fhir-server/api/v4/Coverage?identifier=https://www.abchealthplan.com/fhir/coverageid|29754927
"https://www.abchealthplan.com/fhir/coverageid" is a static value and works for us for now, but in the future it will be dynamic and we won't always know what it is.
So, we expect the workaround and the original query to have similar performance. We noticed that the SQL query is doing a subselect which grabs all ~5million of our records from the table, since nothing is deleted:
SELECT LR0.LOGICAL_RESOURCE_ID, LR0.LOGICAL_ID, LR0.CURRENT_RESOURCE_ID
FROM fhirdata.Coverage_LOGICAL_RESOURCES AS LR0
WHERE LR0.IS_DELETED = 'N'
Your workaround matches what we mention in our performance guide https://linuxforhealth.github.io/FHIR/guides/FHIRPerformanceGuide#65-search-examples
We think the postgres optimizer is chosing a poor plan in this case.
Disabling merge join on your postgresql instance might help, but could have impacts on other queries.