Java/ Load big XER files: performance issue
alex-matatov opened this issue · comments
Dear Jon,
First of all thank you SO much for your work on mpxj lib! It is great! We recently started to use it and so far so good except one case. When a P6 project file is huge (for example we have 73Mb XER) it takes more then 5 mins just to read it:
var project = new UniversalProjectReader().read(content); // 5..6 mins
The reason of it it is ProjectEntityContainer.getByUniqueID()
. This method takes about 95% of total execution time.
public T getByUniqueID(Integer id)
{
if (m_uniqueIDMap.size() != size())
{
clearUniqueIDMap();
for (T item : this)
{
m_uniqueIDMap.put(item.getUniqueID(), item);
}
}
return m_uniqueIDMap.get(id);
}
As far as I understand m_uniqueIDMap
is used for Task caching during parsing (it is ok) and it is cleared on each iteration and filled again with previously loaded Tasks (it is not ok).
PrimaveraReader.processTasks():
...
// first loop: load parent tasks (wbs)
for (Row row : wbs)
{
// an empty Task is created and is added to a list of tasks (TaskContainer)
Task task = m_project.addTask();
…
// populate Task with data from a file. There clearUniqueIDMap() is called! (AbstractFieldContainer.invalidateCache-> Task.invalidateCache)
// it is not big deal now. The problem comes on the second loop
processFields(m_wbsFields, row, task);
…
}
….
// second loop: load tasks
for (Row row : tasks)
{
Task task;
Integer parentTaskID = row.getInteger("wbs_id");
// problem is here. Even we loaded N tasks m_uniqueIDMap is empty and getTaskByUniqueID() will re-populate this map again (see above)
Task parentTask = m_project.getTaskByUniqueID(parentTaskID);
…
processFields(m_taskFields, row, task); // m_uniqueIDMap will be cleared here!
In order to fix it I would propose these changes (of course it is up to you how it could be done):
-
ProjectEntityContainer (ListWithCallbacks)
could use onlym_uniqueIDMap
instead ofm_list
. I would say it is not necessary to duplicate data between 2 collections. I don’t think thatinvalidateCache()
would be necessary in this case. -
Change logic around creating Task (and other project entities: Calendar, Resource…) .
Ideally,
Step 1: create Task and fill it with data
Step 2: add it to TaskContainer. In case ofPoint #1
it could be added tom_uniqueIDMap
(Map<UniqueID, Task/Resource/Calendar>) . Otherwise it could be added tom_list
(current implementation) and tom_uniqueIDMap
. Also,processFields()
should not clearm_uniqueIDMap
By the way… Maybe this issue somehow related to #266
Cheers,
Alexander
@alex-matatov thanks for opening the issue, I will take a look as soon as I can. Would you be able to email me a large sample XER file to validate any changes I make (happy to NDA if required).
I think you can reproduce it with any P6 file (both xml and xer). Basically the problem of cache clear operation
is in ProjectEntityContainer
.
By the way... I did a dirty quick fix by avoiding cache clean and that xer was loaded in 4 seconds instead of 5 mins
@alex-matatov thanks again for opening the issue, I've merged some changes to address this and improve performance. They'll be available in the next MPXJ release.
Hi Jon,
Thank you SO much!
I've just checked "master" and indeed "73Mb xer" processing time was reduced to ~10 seconds. Good job :) .
Cheers,
Alex