palle-k / DL4S

Accelerated tensor operations and dynamic neural networks based on reverse mode automatic differentiation for every device that can run Swift - from watchOS to Linux

Home Page:https://palle-k.github.io/DL4S/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

New "You need some help?"

philipturner opened this issue · comments

Use this thread from now on for continued discussion about DL4S

@palle-k another code style I'd like to change:

Turn this:
Array.map {$0...}
into:
Array.map{ $0...$1 }

Some people add a space between map and the first bracket, but I only do that when it goes onto multiple lines. Are you okay with following my convention?

SwiftLint doesn't seem to have support for this convention. I'm going through all of the code manually right now. Doing this will help me get familiar with the code, so it's okay to not use Swift-Lint. How about we prioritize reformatting code for the time being?

Also, I changed my mind again for GPU: the GPU type will just be integrated GPU, and will be a dynamic device, offloading ops to the CPU if statistics shows that's faster. The switching to CPU will be incredibly optimized for Apple silicon, since the AMX is used in Accelerate and can drastically change performance characteristics (although Intel's AVX2 isn't that bad either). I'm used to switching between CPU and GPU in compact/allocate for scene color reconstruction, so I'm familiar with dynamic computation behavior.

If I deem it worth my time, I will later add a separate dGPU device, which runs Metal using private storage mode for all resources. This will be easier than setting static properties, and allow for the compiler to optimize away what would have otherwise been a big mess of conditionals.

I have a gut feeling that I'll never get around to implementing dGPU, but I might be wrong. I will be careful to leave room in naming conventions for the dGPU (if anybody else does in the future), just like you designed DL4S from the start with GPU acceleration in mind with the generic device.

Mentioning @RahulBhalley because he hasn't been mentioned in this thread.

You commented out a crucial part of the CPU allocator’s code - the part of the free() function that actually adds buffers to the cache. Why did you do that? It prevents the cache from ever being used.

commented

I don't think formatting should be a priority, TBH.

Regarding the cache, if I remember correctly, malloc and free are actually so fast that there is no benefit to caching allocations on the CPU.

Then we need to remove the search into the cache dictionary during the allocate() function, as it pointlessly wastes clock cycles and generated an extra function call. allocate() and free() should inline into malloc() and free().

I'll put my PRs in the feature/metal branch starting with the change to the CPU allocator. That's the first commit that changes the behavior of DL4S. In the meantime, could you approve my PR to the master branch?

May I remove the #if DEBUG directive around assertions that will already be removed in release mode?

I can't test whether DL4S works on Linux after importing ARHeadsetUtil. Could you verify that?

Some XCTests were freezing. All of the model tests froze when initializing the model. I commented out the broken ones so you could at least run the entire suite in a reasonable amount of time. All of these changes are not yet in the pending pull request.

commented

Make sure to run the tests in release mode. In debug mode, you're gonna have to wait a lot.

The compiler keeps giving errors when I attempt to test in release mode

I'll investigate further. Also, are you fine with removing unnecessary instances of #if DEBUG like I described above?

commented

Regarding Linux: I don't have a Linux machine here either. Do you have the option to test in docker?

commented

If you compile the tests in release mode, you need to pass the -enable-testing flag to the Swift compiler. That's not possible if you open the Swift package in Xcode (only in a way that breaks a lot of other things). Either generate an Xcode project or put the -Xswiftc -enable-testing flag in the swift test command.

Doing that right now - just un-commenting out the tests

I don't have docker set up. I think ARHeadsetUtil will compile fine, since I put guards around the import statements to Apple frameworks.

Again, can I remove unnecessary #if DEBUG?

@palle-k when you're done reading this comment, please let me know that you have read this. I don't want to type all of this just for it to get lost in the comment history, and it concerns you especially.

Mentioning @RahulBhalley, @palle-k, @ryendu, @digantamisra98 because this is important to all of you. Stuff is finally getting real, so @palle-k please create a DL4S organization and invite us. If you're strapped for time, I can make and own it instead—just let me know in a comment below.


More importantly, there is some major bad news for DL4S. You designed it from the ground up for code reuse, where an accelerator could just be inserted in the place of the CPU at key entry points with everything else untouched.

I just read a research paper on the Mish activation function, and a highly optimized CUDA version of Mish caught my interest. I realized that complex activation functions need assembly-level optimizations. Under the current model, you would read and store values from memory at each mathematical component of Mish (e.g. log(), exp()...). That prevents the kind of assembly-level optimization I may need (depending on what made Mish-CUDA fast).

Although there might be an easy solution for the above problem, there is something much worse I have been debating over the last few days. Metal command buffers have an insane amount of overhead - 10's of microseconds, while CPU function calls have orders of magnitude less. That will become a massive bottleneck, particularly on smaller models but possibly medium or large ones.

I could examine the distribution of times for atomic operations in DL4S over a wide variety of use cases. Doing so would consume a lot of time resources. However, I already have gut feeling that the investigation will prove we need to rewrite everything for the GPU anyway. So, I won't waste precious time on such an investigation.

In short, (almost) all the code you planned to have shared between CPU and GPU internally to DL4S was (possibly) unfortunately in vain. Although the GPU may be a dynamic device using both CPU and GPU where appropriate, we will need to build everything over again from scratch (with the exception of high-level stuff like the public API, DL4S-Tensorboard, and Swift package tests). To get CPU and GPU used at the same time, we will need to be multithreaded and have CPU-GPU pipelining. For the GPU itself, we will need indirect Metal command buffers compiled once from a graph of functions, and concurrent execution of data within a batch. We may need multiple shader variants for combinations of atomic operations, to reduce the computational cost of frequent operations.

I really hope this unfortunate news isn't the case, as it means we have to debug a whole lot over again (on top of GPU being exceptionally troublesome to debug). Your plan for shared code between CPU and GPU wasn't entirely useless, especially at the higher level which this rewrite will not affect.

My next steps will be to further scrutinize the source code of DL4S, and possibly play around with the 3 DL sample projects you posted on GitHub. I need to know DL4S by heart before trying to reconstruct it on the GPU. Also, there is still the minuscule chance I could see an opportunity to integrate GPU with what already works :-)


If you read this long rant carefully, thanks. @palle-k I'm going to assume you're okay with harmless removals of #if DEBUG around assert statements, unless you tell me otherwise.

Metal support seems like it’s much more complex than scene color reconstruction. I estimate it’ll take the greater part of 1000 hours even with my knowledge of GPGPU, and won’t be ready until spring 2022. But it’ll be fun for me!

commented

@philipturner thanks for your estimation.

I guess the Engine part of the library could be reworked in major ways. When I designed DL4S, I was striving for the simplest solution that works, so fused ops were not used.

One option that I considered if it ever came to a proper GPU (not the stuff in feature/metal) implementation was the following approach:

  1. Don't actually execute operations eagerly, only ensure that they will be completed when a synchronization point is hit.
  2. Do not have a 1-to-1 mapping between Engine function calls and kernel invocations.
  3. Implement the compilation of parts of compute graphs into fused kernels a la MetalPerformanceShaderGraph.
  4. (the hard part): Identify, which parts of the compute graph to fuse and identify instances, where fused kernels can be reused.

This approach could theoretically work well especially for deep learning applications, which typically involve the same code getting executed with different data over and over again.

DL4S would basically include a neural network compiler at this point, which could easily cover cases like the optimized Mish activation.

Stuff like this has been done before (TVM and others), one hard part is to do it on dynamic graphs (eager mode).

Just to be clear, fused ops were one of 3 main reasons why I needed to rework code. The others were

  • insane command buffer latencies
  • the need for concurrent execution within a batch or rapid alternation between CPU and GPU (staggering and switching between elements in a batch)

Just want to make sure we are on the same page.

commented

Yes. With compiling graphs and fusing operations, I mean that I want to go beyond stuff like sigmoid, mish or convolutions fused into one kernel.

I'd assume that this idea can be taken way further, meaning that stuff like subsequent operations (e.g. max pooling followed by relu may be fused together)

TBH I'm not super happy with the current design of the engine and I think it could be done better and I noticed that overhead was the main factor why my metal implementation attempt failed. Again it was the most simple solution for CPU support at the time. Can you explain the overall design of the engine replacement that you want to build?

This plan is getting very long. I'm putting it in a Google Doc and I'll post the link when I'm done.

I'll make a GitHub repo that lists all of our plans for implementing Metal support in DL4S. When plans change, I can just update that.

People who scan over the repo might want to know what we're up to. So, would you be willing to link to that repo in DL4S's README?

commented

One thing: DL4S is already taken as an organization name (DeepLearning4Search).

Any suggestions? DL4Swift? DeepLearning4Swift? Rename the whole thing?

I guess we can follow the strategies of other community managed Open Source projects and create a DL4S-Evolution repo. In that, we can create proposals, which outline certain features.

Definitely! I was just thinking of putting DL4S plans in an organization, but I was going to create it myself. While I finish the draft for Metal implementation plans, could you look over my pending pull request? I just wanted to sync some code formatting changes, and so it's fine if you just close to PR.

I'd rather set up the organization myself, since I have experience with doing so. You and I will be co-owners, and other members will just have view access. I need to message all members to get them to make their repository membership public, as it's private by default (meaning nobody in the public can see that you're a member).

DL4S-Evolution sounds like a good repo name. We will still need to put it under an organization's control, because personally owned repos can't manage individual contributor's accesses. I suggest calling the organization dl4s-contributors ("DL4S Contributors"), much like I made an arheadsetkit-alpha-testers organization during its alpha stage. Let me know whether you approve of this name. We can always change it.

commented

I can do some code reviews, no problem. though I'm busy today.

DL4S-Contributors is fine for now. DL4S-Team, DL4S-Project or DL4S-Community would also be acceptable for me (with the first being my preferred name).

I sent an invite request to you and the other people I mentioned in the massive comment.

Also removed the pull request. It's really not necessary to add right now.

The plans have been released under the organization, and are publicly visible. Please let me know what you think or if it's beyond your area of expertise.

Because of my new plans for Metal, I have major good, exciting news. A massive change to the public API in DL4S 1.0 that puts it closer in line with Swift for TensorFlow. I still have to finalize my thoughts before I tell you.

And it’s all only possibly because we scrapped the scheme of sharing code between CPU and GPU. Hope you’re excited too!

Going to post my idea on the evolution repo, i’ll comment here when it’s ready

Also, a second major change that lets you compile Swift code into Metal code (via a DSL), letting beginners create custom operators with full hardware acceleration. Very exciting!

@palle-k the new plans are out. Check out "Tensor<Float>" and "Lambda functions" under the redesigned DL4S Evolution home page.

I'm encountering problems with how to implement the new API that eliminates the Device generic parameter. The only way to do that is to make a breaking change to the public API. There are a few solutions:

  1. Make two Swift modules included with DL4S. They are DL4S and DL4Sv2. They are mutually exclusive, in that importing both will fail to compile. DL4S allows using Tensor<Float, CPU>, while DL4Sv2 allows Tensor<Float>. This provides the least friction and has been used by GPUImage.
  2. Like option 1, but the new API is "DL4S" and the old is "DL4S_Old". This will make a breaking change to the public API, and break open-source Swift packages that rely on DL4S.
  3. Transfer ownership of DL4S to the DL4S organization or me, which means old Swift package references to it just point to the old API.

In all 3 above options, we would create a namespace in the newer package (e.g. Old) for accessing the device-specific operations.

There are ways to get option 2 to work. You could go through all of your sample code and DL4S-Tensorboard, importing "DL4S_Old". @deprecated and @unavailable will not help, so there would have to be a noticeable pointer on the README.

In addition, we would have to search for any open-source packages that rely on DL4S's master branch and contact them about this change. I think the magic of Package.resolved should prevent the packages from completely breaking unexpectedly, but we definitely need to warn their owners. Given how small DL4S's user base currently is and that they are likely to go to our README when it fails to compile, option 2 is very achievable.

The third option may work because I tried swapping repository owners on a test project before, and the reference to the old repo URL just refused to update with newer commits. The same behavior should happen with DL4S, and all public links to DL4S would redirect to my page.


I know this sounds like a lot, and you may be ambivalent to the change. However, given that it makes DL4S closely mirror S4TF, it is vital that we try as hard as we can to implement this change. I am leaning toward option 2.

Since DL4S 1.0 is months away, we could alert any users now and make a warning on the README about the future change within a week. We could introduce a minor update to DL4S that exposes the old Swift module as something identical to the new one, in order to prepare active users for the coming change.

Even better option: convert DL4S's head branch from "master" to "main" at the same time as we roll out DL4S 1.0. That solves all of our problems with legacy Swift package dependencies. We might need to put up some notices on the README right now, but there would never be any unexpected changes in anybody's code.

This plan has the same benefits as option 3 without the drawbacks.

I'm seeing option 1 as a more favorable option now. It would cause the least friction and headache. I think you'll approve of my API changes since they're in a separate "DL4Sv2" module. And, the Tensor<Float, GPU> style still works under the regular "DL4S" module. We might be able to still brand this as DL4S 1.0 (and not 2.0), since different major versions of Python were supported simultaneously for a long time (and other precedents).

Ultimately, I'm thinking of creating a "TensorFlow" Swift package that re-exports DL4Sv2 as a resurrection of S4TF. It'll be hosted by my "S4TF" organization. The work I'm going to do reconstructing the DL4S engine from the ground up is completely my own work, but the CPU fallback engine currently present is yours. I'd also like to export DL4S-Tensorboard as "Tensorboard" under the same organization.

Therefore, I would need your consent before I did an act like resurrecting Swift for TensorFlow. I would give you credit and say that the both of us resurrected it (you can have owner access to S4TF). This has been a big dream of mine ever since I saw Google kill S4TF. Please tell me about your feelings on this. I'll add GPU acceleration either way.

@RahulBhalley since you're a big fan of S4TF, I'll give you an honorable mention as well.

commented

My favored approach is to transfer DL4S to the DL4S team and leave behind an archived version on my account, linking to its new home. That way, everyone currently using DL4S will not see their project break due to API changes and so on.

In that new repo, we can mention that APIs are subject to change so that users are prepared for any breaking changes and if they want, they can pin their DL4S dependency to a specific commit / release to prevent their code from breaking. If the CPU engine in a refactored form remains in DL4S to support other platforms, I don't think there's much need to be maintaining 2 versions.

If you want to use DL4S for your own TensorFlow adventures, feel free to do so. I set the license of DL4S to MIT deliberately so that anyone can do pretty much anything they want with it.

Lastly, note that the TensorFlow (and any related names) are trademarks by Google.

Thanks. That's a really good way to go. It also gives me more freedom to commit to DL4S. I also changed the organization's name from "dl4s-contributors" to "dl4s-team".