eclipse / xtext

Eclipse Xtext™ is a language development framework

Home Page:http://www.eclipse.org/Xtext

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Timeout on Tycho 4/5 builds

cdietrich opened this issue · comments

Tycho 4/5 builds regularly get killed due to hanging tests it seems
needs to be investigated

cc @LorenzoBettini @szarnekow

https://ci.eclipse.org/xtext/job/xtext/job/cd_tycho40/
https://ci.eclipse.org/xtext/job/xtext/job/cd_tycho50/

SeveralEditorsQueuedBuildTest seems to be one culprit, SaveWithReconciliationQueuedBuildDataTest another one

Only in Jenkins? Or also on GitHub Actions?

does this run on actions daily at all. not to MY knowledge

I meant the workflows that ran the first time you pushed those branches.

this branches are month old and updated quite seldom only.
but i usually dont check github action results for them

wonder if there is an option to get a jstack / thread dump on timeout

we might also have to set a test timeout.
right now its the jenkins job that times out

I don't know if that's the case, but the last time I saw a stuck job on Xtext GitHub Actions, which I had to cancel, was due to a small bug I introduced in my fork that caused Xtend renaming, throwing an exception. Of course, I could reproduce the bug locally: everything was stuck due to the Eclipse dialog showing the uncaught exception. With that dialog on the way, the tests did not proceed at all...

hang can be reproduced locally
@szarnekow any idea
a2.log
what are we waiting for there

maybe
org.eclipse.xtend.ide.tests.builder.JavaEditorExtension.waitForElementChangedEvent(int, ()=>void)
should get a timeout.
but that would lead to flaky tests i guess

I'm pretty sure there should be an event since we change the content of the editor.
Very surprising.

If I understand correctly this problem shows up only since Tycho 4, right?
And it is a race condition?
I wonder if in Tycho they introduce some default configuration for parallel tests that break these tests.

saw it run as plugin test.

saw it run as plugin test.

What do you mean?
It runs fine when executed from Eclipse?

No it hang on the 10th attempt or so

Maybe trying with org.eclipse.xtend.ide.tests.builder.JavaEditorExtension.VERBOSE set to true and see what happens? I cannot reproduce it locally.

By the way,

	def waitForElementChangedEvent(int eventMask, =>void producer) {
		if (VERBOSE) {
			println('''start waiting for an element changed event: «eventMask»''')
		}
		val changed = new AtomicBoolean(false)
		JavaCore.addElementChangedListener(
			[
				JavaCore.removeElementChangedListener(self)
				if (!changed.get) {
					changed.set(true)
					if (VERBOSE) {
						println(it)
					}
				}
			], eventMask)
		producer.apply
		while (!changed.get) {
			if (Display.getCurrent() !== null) {
				while (Display.getDefault().readAndDispatch()) {
					// process queued ui events
				}
			}
		}
		if (VERBOSE) {
			println('''end waiting for an element changed event: «eventMask»''')
		}
	}

maybe it should never be the case but if Display.getCurrent() is null we never exit that while loop.

log says start waiting for an element changed event: 1
with verbose.
15th attempt this time
which tp do you use?

maybe in recent jdt something about reconciliation has changed that it never happens

which tp do you use?

The latest. Reloaded this morning.

am also using latest

What does the screenshot want to show? Is the program stuck there?

yes. so the event is never produced

I cannot reproduce it on macOS m1 nor on a slower laptop with Linux...

On the screenshot, the console shows that the event has been caught and printed on the screen.

Could you please try this?

		while (!changed.get && Display.getDefault().readAndDispatch()) {
			if (VERBOSE) {
				println("readAndDispatch")
			}
		}

leads to test errors with

/*******************************************************************************
 * Copyright (c) 2024 itemis AG (http://www.itemis.eu) and others.
 * This program and the accompanying materials are made available under the
 * terms of the Eclipse Public License 2.0 which is available at
 * http://www.eclipse.org/legal/epl-2.0.
 * 
 * SPDX-License-Identifier: EPL-2.0
 *******************************************************************************/
package org.eclipse.xtend.ide.tests.builder;

import org.junit.runner.RunWith;
import org.junit.runners.Suite;
import org.junit.runners.Suite.SuiteClasses;

/**
 * @author dietrich - Initial contribution and API
 */
@RunWith(Suite.class)
@SuiteClasses({
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
	SeveralEditorsQueuedBuildTest.class,	
})
public class DullySuite {

}

Yes, on this line (assertThereAreNotDeltas)

		barEditor.save
		queuedBuildDataContribution.unconfirmedDeltas.assertThereAreDeltas("mypackage.Bar")
		queuedBuildData.andRemovePendingDeltas.assertThereAreNotDeltas

...but only sometimes on my computer (though I'm not using your suite)

see the hang also with the suite sometimes

I managed to see the hang with your suite.
Twice, it hangs in closeEditorWithChanges.
If I close the welcome view at that point, the editor is not closed.
If I manually close the editor, the suite goes on with that test failed at:

java.lang.AssertionError: There are not deltas
	at org.junit.Assert.fail(Assert.java:89)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.eclipse.xtend.ide.tests.builder.AbstractQueuedBuildDataTest.assertThereAreDeltas(AbstractQueuedBuildDataTest.java:99)
	at org.eclipse.xtend.ide.tests.builder.SeveralEditorsQueuedBuildTest.closeEditorWithChanges(SeveralEditorsQueuedBuildTest.java:130)

I can't seem to be able to reproduce it if I manually close the welcome screen when the suite starts;
could you please try as well?

thought we have a close welcome screen somewhere in a base class

does not help
grafik

@cdietrich I pushed my branch based on yours and merged with main.

I think it's better to do like what I was proposing (245619f): the nested while is prone to infinite loop, I would say it's wrong from the concurrency/synchronization point of view.
It's better to exit the while and let the test fail; we can rely on our configuration for re-running flaky tests.
Let's see how the build is going.

I'm not familiar with those tests, but it looks like it's due to some platform/jdt synchronizations.

yes i hope for some input from @szarnekow

On GitHub Actions with my change a flake is detected (https://pipelinesghubeus7.actions.githubusercontent.com/xwd7D2dTH7KjkIQFqrWM6jPOzIKPN7cIWrsK5HktZyQ1MMY4jS/_apis/pipelines/1/runs/1292/signedlogcontent/5?urlExpires=2024-01-09T17%3A27%3A18.7778195Z&urlSigningMethod=HMACV1&urlSignature=HdvChU33DWHSe%2F5nHRm4MK91p%2BI%2BSlUjBbrcbFMkphI%3D):

Flakes: 
org.eclipse.xtend.ide.tests.builder.SeveralEditorsQueuedBuildTest.saveSeveralEditorsOneByOne
Run 1: SeveralEditorsQueuedBuildTest.saveSeveralEditorsOneByOne:66->AbstractQueuedBuildDataTest.assertThereAreDeltas:99->Assert.assertTrue:42->Assert.fail:89 There are not deltas
Run 2: PASS

On Jenkins with your branch with Tycho 5 everything runs smoothly... did you change anything on that branch?

Note that the test was sane and should succeed in any case since we certainly do want to see a JDT event within a reasonable time budget. I'm not aware of any JDT changes that would explain why this should no longer be the case.

I see that the linked commit #2895 uses Display.getDefault yet it's still failing as flaky. So it's unlikely that Tycho 4 uses a different thread for the display.

On Jenkins it succeeds: https://ci.eclipse.org/xtext/job/xtext/job/lb_tycho40_2895/1/console

The nested while though can lead to an infinite loop as was happening in a few CI runs.
I mean, if there's a problem in JDT, with the current shape, if we don't get a JDT event the code loops forever.
With a single while, the test fails without looping forever.
If I understand it correctly.

I believe that this is really related to a bug in platform, not our fault.

As I said, when I see the test hang, or fail with my commit 245619f, the dirty editor is not effectively closes, despite the call to close editor returned correctly; the editor is still there and usable. The same holds when the hanging or failing test programmatically saves an editor: you can see that the editor still appears dirty.

I've been hit by the same problem in my Xtext development workspace with 2023-04 M1: I open an editor, make some changes, and save, and the editor is still dirty. If I close the editor, I'm asked to save the changes, and I accept. I reopen the file, and the file has not been saved. Unfortunately, I constantly see that every day on the first opened file after I started Eclipse :(

In any case, I think we should protect our UI tests from such bugs, avoiding the test to hang the CI. Note that when it hangs, it is not simply in a deadlock: it's eating all the resources since it continuously executes the while loop, so it's a critical problem.

My original proposal makes the test too flaky, I agree; but I can propose an alternative solution that at least does not hang and fails after some attempts. When we hit the problem, the test becomes flaky, but it's less likely, and at least the build will try to re-execute that.

Besides what I wrote above, it's not a Tycho 4/5 problem anymore: it officially appears also on the main branch: https://ci.eclipse.org/xtext/job/xtext/job/main/820/console:

08:08:22  Running org.eclipse.xtend.ide.tests.builder.SeveralEditorsQueuedBuildTest
12:44:40  Cancelling nested steps due to timeout
12:44:40  Sending interrupt signal to process