Slow performance and server errors from searching resources for an identifier

Question

Slow performance and server errors from searching resources for an identifier

kevdoan opened this issue 2 years ago · comments

Describe the bug
We have a workflow set up where we first search for a resource based on an identifier:
If it exists, we update the resource.
If it doesn't, we create it.

However, doing a GET to resources with an identifier like this: /fhir-server/api/v4/Coverage?identifier=29754927, we see slow performance and eventually get 500 server errors.

failed to execute HTTP Get. Status Code: 500. Reason: Internal Server Error. Response: {"resourceType":"OperationOutcome","id":"a-a-2c-bc-d7b1421c-8328-41f5-a052-5056f5fcdb54","issue":[{"severity":"fatal","code":"exception","details":{"text":"FHIRPersistenceDBConnectException: Failure acquiring connection for datasource"}}]}

Upon investigation, and looking at pg_stat_activity, we identified this query as the bottleneck:

utility_fhir=# explain analyze SELECT R.RESOURCE_ID, R.LOGICAL_RESOURCE_ID, R.VERSION_ID, R.LAST_UPDATED, R.IS_DELETED, R.DATA, LR.LOGICAL_ID 
         FROM (                                                                                                       
       SELECT LR0.LOGICAL_RESOURCE_ID, LR0.LOGICAL_ID, LR0.CURRENT_RESOURCE_ID                                        
         FROM fhirdata.Coverage_LOGICAL_RESOURCES AS LR0                                                                       
        WHERE LR0.IS_DELETED = 'N'                                                                                    
          AND EXISTS (                                                                                                
       SELECT 1                                                                                                       
         FROM fhirdata.Coverage_LOGICAL_RESOURCES AS LR1                                                                       
   INNER JOIN fhirdata.Coverage_RESOURCE_TOKEN_REFS AS P2 ON P2.LOGICAL_RESOURCE_ID = LR1.LOGICAL_RESOURCE_ID                  
          AND P2.PARAMETER_NAME_ID = 20000                                                                            
          AND ((P2.COMMON_TOKEN_VALUE_ID IN (46163125,46163127,46163128)))                                            
        WHERE LR1.LOGICAL_RESOURCE_ID = LR0.LOGICAL_RESOURCE_ID)) AS LR                                               
   INNER JOIN fhirdata.Coverage_RESOURCES AS R ON LR.CURRENT_RESOURCE_ID = R.RESOURCE_ID                                       
     ORDER BY LR.LOGICAL_RESOURCE_ID                                                                                  
  LIMIT 10;
                                                                                                  QUERY PLAN                                                                                                  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3313.12..8745.36 rows=10 width=880) (actual time=19821.291..19821.300 rows=1 loops=1)
   ->  Nested Loop  (cost=3313.12..279814.02 rows=509 width=880) (actual time=19821.289..19821.297 rows=1 loops=1)
         ->  Merge Semi Join  (cost=3312.69..276930.53 rows=509 width=65) (actual time=19819.612..19819.619 rows=1 loops=1)
               Merge Cond: (lr0.logical_resource_id = p2.logical_resource_id)
               ->  Index Scan using coverage_logical_resources_pk on coverage_logical_resources lr0  (cost=0.43..266459.74 rows=4985905 width=65) (actual time=0.461..19284.621 rows=4589654 loops=1)
                     Filter: (is_deleted = 'N'::bpchar)
               ->  Sort  (cost=3312.26..3313.53 rows=509 width=16) (actual time=4.096..4.099 rows=2 loops=1)
                     Sort Key: lr1.logical_resource_id
                     Sort Method: quicksort  Memory: 25kB
                     ->  Nested Loop  (cost=1.00..3289.37 rows=509 width=16) (actual time=4.066..4.081 rows=2 loops=1)
                           ->  Index Only Scan using idx_coverage_resource_token_refs_tplr on coverage_resource_token_refs p2  (cost=0.57..580.32 rows=509 width=8) (actual time=2.321..2.329 rows=2 loops=1)
                                 Index Cond: ((common_token_value_id = ANY ('{46163125,46163127,46163128}'::bigint[])) AND (parameter_name_id = 20000))
                                 Heap Fetches: 2
                           ->  Index Only Scan using coverage_logical_resources_pk on coverage_logical_resources lr1  (cost=0.43..5.32 rows=1 width=8) (actual time=0.872..0.872 rows=1 loops=2)
                                 Index Cond: (logical_resource_id = p2.logical_resource_id)
                                 Heap Fetches: 2
         ->  Index Scan using coverage_resources_prf_in1 on coverage_resources r  (cost=0.43..5.67 rows=1 width=823) (actual time=1.669..1.669 rows=1 loops=1)
               Index Cond: (resource_id = lr0.current_resource_id)
 Planning Time: 40.273 ms
 Execution Time: 19821.522 ms
(20 rows)

Environment
4.10.2

To Reproduce
Steps to reproduce the behavior:

Load about 5million coverage resources into the database.
Do a search for a specific identifier, and you will notice slow response times (2-3seconds for a single GET, 20+ seconds when doing multiple concurrent GETs).

Expected behavior
We're currently using a workaround, which is to supply a system to the identifier search to be more specific, and is much faster (2-3ms instead of 2second response for a single GET) : /fhir-server/api/v4/Coverage?identifier=https://www.abchealthplan.com/fhir/coverageid|29754927

"https://www.abchealthplan.com/fhir/coverageid" is a static value and works for us for now, but in the future it will be dynamic and we won't always know what it is.

So, we expect the workaround and the original query to have similar performance. We noticed that the SQL query is doing a subselect which grabs all ~5million of our records from the table, since nothing is deleted:

SELECT LR0.LOGICAL_RESOURCE_ID, LR0.LOGICAL_ID, LR0.CURRENT_RESOURCE_ID                                        
FROM fhirdata.Coverage_LOGICAL_RESOURCES AS LR0                                                                       
WHERE LR0.IS_DELETED = 'N'

Lee Surprenant · Answer 1 · Tue Jan 24 2023 21:40:51 GMT+0800 (China Standard Time)

Your workaround matches what we mention in our performance guide https://linuxforhealth.github.io/FHIR/guides/FHIRPerformanceGuide#65-search-examples

We think the postgres optimizer is chosing a poor plan in this case.
Disabling merge join on your postgresql instance might help, but could have impacts on other queries.