https://github.com/MicrosoftResearch/Naiad/issues/20

Question

https://github.com/MicrosoftResearch/Naiad/issues/20

khgl opened this issue 10 years ago · comments

Hello,

We are running calculations on Naiad and computation fails on large datasets. For single thread and small datasets, there is no problem. We running Naiad on virtual machine in Mac.
Can you help to find out reason ?

Thank you.

Error :
Logging initialized to console
00:07:57.5913840, Graph 0 failed on scheduler 1 with exception:
System.NullReferenceException: Object reference not set to an instance of an obj
ect.
at Microsoft.Research.Naiad.Dataflow.VertexOutputBufferPerTime2.Send(TRecord record) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Frameworks\StandardVertic es.cs:line 313 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.Conside rFlushingBufferedUpdates() in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime \Progress\UpdateAggregator.cs:line 157 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.OnRecv( Dictionary2 deltas) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progr
ess\UpdateAggregator.cs:line 77
at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateProducer.Start() i
n c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progress\UpdateProducer.cs:
line 92
at Microsoft.Research.Naiad.Scheduling.Scheduler.DrainMessagesForComputation(
Int32 computationIndex) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Sc
heduling\Scheduler.cs:line 343
00:07:57.5941910, Cancelling execution of graph 0, due to exception:
System.NullReferenceException: Object reference not set to an instance of an obj
ect.
at Microsoft.Research.Naiad.Dataflow.VertexOutputBufferPerTime2.Send(TRecord record) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Frameworks\StandardVertic es.cs:line 313 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.Conside rFlushingBufferedUpdates() in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime \Progress\UpdateAggregator.cs:line 157 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.OnRecv( Dictionary2 deltas) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progr
ess\UpdateAggregator.cs:line 77
at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateProducer.Start() i
n c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progress\UpdateProducer.cs:
line 92
at Microsoft.Research.Naiad.Scheduling.Scheduler.DrainMessagesForComputation(
Int32 computationIndex) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Sc
heduling\Scheduler.cs:line 343

Logging initialized to console
00:07:57.5913840, Graph 0 failed on scheduler 1 with exception:
System.NullReferenceException: Object reference not set to an instance of an obj
ect.
at Microsoft.Research.Naiad.Dataflow.VertexOutputBufferPerTime2.Send(TRecord record) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Frameworks\StandardVertic es.cs:line 313 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.Conside rFlushingBufferedUpdates() in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime \Progress\UpdateAggregator.cs:line 157 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.OnRecv( Dictionary2 deltas) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progr
ess\UpdateAggregator.cs:line 77
at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateProducer.Start() i
n c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progress\UpdateProducer.cs:
line 92
at Microsoft.Research.Naiad.Scheduling.Scheduler.DrainMessagesForComputation(
Int32 computationIndex) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Sc
heduling\Scheduler.cs:line 343
00:07:57.5941910, Cancelling execution of graph 0, due to exception:
System.NullReferenceException: Object reference not set to an instance of an obj
ect.
at Microsoft.Research.Naiad.Dataflow.VertexOutputBufferPerTime2.Send(TRecord record) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Frameworks\StandardVertic es.cs:line 313 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.Conside rFlushingBufferedUpdates() in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime \Progress\UpdateAggregator.cs:line 157 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.OnRecv( Dictionary2 deltas) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progr
ess\UpdateAggregator.cs:line 77
at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateProducer.Start() i
n c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progress\UpdateProducer.cs:
line 92
at Microsoft.Research.Naiad.Scheduling.Scheduler.DrainMessagesForComputation(
Int32 computationIndex) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Sc
heduling\Scheduler.cs:line 343

Frank McSherry · Answer 1 · Thu Dec 18 2014 00:50:16 GMT+0800 (China Standard Time)

Hello,

Can you check that you are running the very most recent version? There was a race condition in approximately that part of the code which was fairly recently (1-2 weeks) fixed.

Assuming it is not that, it also looks like an issue we've recently seen reported from others. Line 313 is trying to write into a payload array, and the array is never supposed to be un-allocated or inappropriately sized. Some other folks have seen what seems like memory corruption, and we haven't tracked down if it is Naiad's unsafe code, or Mono, or what is going on.

You'll notice that all access to this.Output in UpdateAggregator.cs is right around line 157 (other than where it is assigned, in the constructor), and under a global lock. So, it is especially vexing that something is able to null out a field in the object.

I'll take a closer look, and keep you up to date with what we learn in the other case. We are trying to get a reproducible test case, or something reproducing on a non-Mono runtime, just to narrow down whose fault it is and where we can look to fix it. For clarity, which OS are you hosting in the VM?

Frank McSherry · Answer 2 · Thu Dec 18 2014 01:34:17 GMT+0800 (China Standard Time)

Ok, reviewed a bit of code, and I suspect this is totally our fault, not mono or any other nice people. Let me explain:

The ProgressUpdateAggregator is one of the few places in Naiad where multiple threads in a process actually work with shared state. It is where they dump their collective notes on how much work they've done, and some "smarts" get used to accumulate the notes until a meaningful update has happened. Each of the threads is very well disciplined, and acquires all sorts of locks before they start working with the shared state. They also make sure to only send data on this.Output while under a lock, because we don't really think our channels are thread-safe.

This is all well and good, but nobody told the general purpose vertex flushing code about that lock. So, if anyone calls ProgressUpdateAggregator.Flush() while someone is working, it can rip the buffer out from under them. I'm not super clear when this would happen (the vertex doesn't need to be flushed, as it flushes its output explicitly), but it seems to fit the symptoms.

If you would be so kind, would you consider adding

this.AddOnFlushAction (() => Console.Error.WriteLine ("Flushing UpdateAggregator?!?"));

in between lines 170 and 171 of UpdateAggregator.cs (in the constructor)? Vertex Flush should only be called by an endpoint directed at the vertex (or explicitly, by the vertex), and I don't see the code path that would do this (the vertex is not constructed with any inputs). But, if this prints out it would suggest that it is happening. If so, could you snag a stack trace (just throw an exception there, and we'll report it back)? The code behavior here is a bit data-dependent, and just because I don't see it doesn't mean it isn't happening to you.

In the meantime, I'll start to figure out a fix, for example just not registering this.Output as something to flush, but Michael is the only person who could check anything in, and he is on the road at the moment. I'll post again once we've managed to do something productive.

Thanks very much for the bug report!

Khayyam Guliyev · Answer 3 · Thu Dec 18 2014 01:34:21 GMT+0800 (China Standard Time)

Hello Frank,
I have updated NAIAD dll, but the same error occurred again. OS is Microsoft Windows 8.

I can put my code here, it is really simple one, our use case benchmarking, so running big data sets very important for us :
Vertex :

public static class ExtensionMethods
{
    public static Stream<double, Epoch> StreamingRevenue(this Stream<LINEITEM, Epoch> stream)
    {
        return stream.NewUnaryStage(
            (i, s) => new Query6Vertex(i, s), 
            x => x.GetHashCode(), 
            null, "Revenue");
    }

    internal class Query6Vertex: UnaryVertex<LINEITEM, double, Epoch>
    {
        double REVENUE = 0;
        DateTime c1 = new DateTime(1994, 1, 1);
        DateTime c2 = new DateTime(1995, 1, 1);

        public override void OnReceive(Message<LINEITEM, Epoch> message)
        {
            double agg1 = 0;
            for (int i = 0; i < message.length; i++)
            {
                LINEITEM lineitem = message.payload[i];

                    if (lineitem.shipdate >= c1 &&
                        c2 > lineitem.shipdate &&
                        lineitem.discount >= 0.05 &&
                        0.07 >= lineitem.discount &&
                        24L > lineitem.quantity)
                        agg1 += (lineitem.function * (lineitem.extendedprice * lineitem.discount));

            }
            REVENUE = agg1;
            var output = this.Output.GetBufferForTime(message.time);
            output.Send(REVENUE);
        }

        public override void OnNotify(Epoch time)
        {
            var output = this.Output.GetBufferForTime(time);
            output.Send(REVENUE);
        }

        public Query6Vertex(int index, Stage<Epoch> stage) : base(index, stage) { }
    }
}

Main:

        {

            var source = new BatchedDataSource<LINEITEM>();
            var query6 = computation.NewInput(source).StreamingRevenue().StreamingRevenueReduce();
            query6.Subscribe(list => { foreach (var element in list) Console.WriteLine("Revenue : " + element); });


            int batchSize = 10;
            LINEITEM[] data = getLineItems();
            for (int i = 0; i < data.Length; )
            {
                int nextSize = Math.Min(data.Length - i, batchSize);
                LINEITEM[] nextBatch = new LINEITEM[nextSize];
               // Console.WriteLine(i + " " + nextSize);
                for (int j = 0; j < nextSize; j++)
                    nextBatch[j] = data[i++];

                source.OnNext(nextBatch);
            }
            computation.Activate();       // activate the execution of this graph (no new stages allowed).
            source.OnCompleted();   // signal the end of the input.
            computation.Join();           // waits until the graph has finished executing.
        }

Frank McSherry · Answer 4 · Thu Dec 18 2014 01:36:08 GMT+0800 (China Standard Time)

That seems pretty simple, and I have a Windows 8 VM! Is the input data something that can be generated / shared?

Khayyam Guliyev · Answer 5 · Thu Dec 18 2014 01:48:55 GMT+0800 (China Standard Time)

It is TPCH lineitem table for big data set. For download : https://www.dropbox.com/s/3atragjp6pr5d9r/lineitem_big.csv?dl=0

Frank McSherry · Answer 6 · Thu Dec 18 2014 02:31:59 GMT+0800 (China Standard Time)

Thanks. I've grabbed it. I don't suppose you would be willing to share your project too, so that I can just run it and see it explode (and not have to re-implement things like LINEITEM)?

Thanks!

Khayyam Guliyev · Answer 7 · Thu Dec 18 2014 05:25:32 GMT+0800 (China Standard Time)

I tried to create simple Naiad program when error occurs in my VM: https://www.dropbox.com/s/pw2lp64c2k8gp0c/TestNaiad.zip?dl=0

We need run program with parameter -t 4.

Best,
Khayyam.

Frank McSherry · Answer 8 · Thu Dec 18 2014 06:51:33 GMT+0800 (China Standard Time)

Thanks very much! I'll fire it up and report back.

Frank McSherry · Answer 9 · Thu Dec 18 2014 08:18:44 GMT+0800 (China Standard Time)

Well, actually I think in the short term I'll let Michael see if he can reproduce it. My copy of Visual Studio has expired, the "Community" version installer errors out with "can't find package", and ... yeah. MSFT.

I'll see if I can get it up and running on Mono and have it explode similarly, but we'll need to wait for Michael to revive from his travels otherwise.

Frank McSherry · Answer 10 · Thu Dec 18 2014 09:21:22 GMT+0800 (China Standard Time)

Just a quick comment, that might help in the meantime: The program you've sent uses pretty small batches (10 records) resulting in 60,000 epochs. Naiad currently scales pretty badly (linearly) with the number of outstanding work items, so it's taking maybe 60,000x longer than it should. If you change the batch size to 1,000 the program completes in about a second on mono with four threads.

That isn't a solution to the bug, but it might help you work around it for the moment. I'm back to trying to get it to reproduce (how long before the crash for you, usually? it looks like it will run for a while due to the above issue.

Khayyam Guliyev · Answer 11 · Thu Dec 18 2014 10:04:33 GMT+0800 (China Standard Time)

Thanks very much! It works when we increase batch size. However, there is a strange behavior which is difficult for me to understand, in all cases running with more threads is slower than with single thread. As I know, -t 4 will create at least 4 vertices and it should be really faster than running with one thread which uses only one vertex.
It takes at least 4 - 5 minutes before the crash.

Frank McSherry · Answer 12 · Thu Dec 18 2014 10:37:41 GMT+0800 (China Standard Time)

Hi,

I get the same running time for both -t 1 and -t 4 under Debug build (Release is broken for me at the moment).

Done in 00:00:07.1995030

For something with very little computation, most of the time ends up being spent in data ingress. All but 0.5s is spent in getLineItems(), which is code outside of Naiad. One variation you could do, for example, is to rewrite the computation to do the parsing of getLineItems() in parallel, using a LINDI Select. When I do this, writing a function:

public static LINEITEM parseLineItem(string s) { /* per-record code from getLineItem() */ }

the time improves with two threads to

Done in 00:00:05.2192670

When I take it up to four threads, it slows down, which makes sense for me, at least, because I have two real cores and two hyper threaded cores. There is nothing in the code causing memory misses, so the hyper threading isn't buying me anything. Given that it takes about 1.3s for me to load up the strings in the first place, this is a difference of 6s -> 4s, which isn't horrible.

If it is still bad on a machine with multiple cores (I'm not sure what you are using), let me know and I'll see if I can help out.

As regards the crash, I let it run for about 45 minutes and ... nothing. Could you indicate the precise steps to reproduce? Like, Release/Debug build, if Debug, running with/without debugging (a separate thing from the build), etc.

Frank McSherry · Answer 13 · Thu Dec 18 2014 10:54:28 GMT+0800 (China Standard Time)

I should also say, thread performance scaling might be all sorts of weird when using a virtual machine. It will depend a lot on how many cores the VM decides to use, for example.

Khayyam Guliyev · Answer 14 · Thu Dec 18 2014 18:16:15 GMT+0800 (China Standard Time)

I calculate running time after reading input. Because in real application, we will not have all data in memory, we will get them with some batches.
I have 5 cores, and I get worse result no matter how many core Naiad use, maybe as you indicated it is related to VM.
It is exactly same code running with parameter -t 4 and batch size = 10 :