Java/ Load big XER files: performance issue

Question

Java/ Load big XER files: performance issue

alex-matatov opened this issue a year ago · comments

Dear Jon,

First of all thank you SO much for your work on mpxj lib! It is great! We recently started to use it and so far so good except one case. When a P6 project file is huge (for example we have 73Mb XER) it takes more then 5 mins just to read it:

var project = new UniversalProjectReader().read(content);  // 5..6 mins

The reason of it it is ProjectEntityContainer.getByUniqueID(). This method takes about 95% of total execution time.

public T getByUniqueID(Integer id)
{
   if (m_uniqueIDMap.size() != size())
   {
      clearUniqueIDMap();
      for (T item : this)
      {
         m_uniqueIDMap.put(item.getUniqueID(), item);
      }
   }
   return m_uniqueIDMap.get(id);
}

As far as I understand m_uniqueIDMap is used for Task caching during parsing (it is ok) and it is cleared on each iteration and filled again with previously loaded Tasks (it is not ok).

PrimaveraReader.processTasks():

...
// first loop: load parent tasks (wbs)
for (Row row : wbs)
{
  // an empty Task is created and is added to a list of tasks (TaskContainer)
   Task task = m_project.addTask();
   …
   
   // populate Task with data from a file. There clearUniqueIDMap() is called! (AbstractFieldContainer.invalidateCache-> Task.invalidateCache)
   // it is not big deal now. The problem comes on the second loop 
   processFields(m_wbsFields, row, task); 
  …
}
….
// second loop: load tasks
for (Row row : tasks)
{
   Task task;
   Integer parentTaskID = row.getInteger("wbs_id");

  // problem is here. Even we loaded N  tasks m_uniqueIDMap is empty and getTaskByUniqueID() will re-populate this map again (see above)
   Task parentTask = m_project.getTaskByUniqueID(parentTaskID); 
   …
   processFields(m_taskFields, row, task); // m_uniqueIDMap will be cleared here!

In order to fix it I would propose these changes (of course it is up to you how it could be done):

ProjectEntityContainer (ListWithCallbacks) could use only m_uniqueIDMap instead of m_list. I would say it is not necessary to duplicate data between 2 collections. I don’t think that invalidateCache() would be necessary in this case.
Change logic around creating Task (and other project entities: Calendar, Resource…) .
Ideally,
Step 1: create Task and fill it with data
Step 2: add it to TaskContainer. In case of Point #1 it could be added to m_uniqueIDMap (Map<UniqueID, Task/Resource/Calendar>) . Otherwise it could be added to m_list (current implementation) and to m_uniqueIDMap. Also, processFields() should not clear m_uniqueIDMap

By the way… Maybe this issue somehow related to #266

Cheers,
Alexander

Jon Iles · Answer 1 · Wed May 17 2023 18:31:43 GMT+0800 (China Standard Time)

@alex-matatov thanks for opening the issue, I will take a look as soon as I can. Would you be able to email me a large sample XER file to validate any changes I make (happy to NDA if required).

alex-matatov · Answer 2 · Wed May 17 2023 19:22:02 GMT+0800 (China Standard Time)

I think you can reproduce it with any P6 file (both xml and xer). Basically the problem of cache clear operation is in ProjectEntityContainer.

By the way... I did a dirty quick fix by avoiding cache clean and that xer was loaded in 4 seconds instead of 5 mins

Jon Iles · Answer 3 · Mon May 22 2023 21:35:35 GMT+0800 (China Standard Time)

@alex-matatov thanks again for opening the issue, I've merged some changes to address this and improve performance. They'll be available in the next MPXJ release.

alex-matatov · Answer 4 · Mon May 22 2023 22:26:32 GMT+0800 (China Standard Time)

Hi Jon,

Thank you SO much!

I've just checked "master" and indeed "73Mb xer" processing time was reduced to ~10 seconds. Good job :) .

Cheers,
Alex