Skip to content
This repository has been archived by the owner on Jul 1, 2023. It is now read-only.

Patching to enable building swift-apis on current nightly Swift #1184

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

ProfFan
Copy link

@ProfFan ProfFan commented Oct 23, 2021

Currently this builds in Debug on the latest nightly toolchain on Linux. macOS is untested.

All unit tests pass except one, which is not crucial (only a floating point difference of 1e-6).

Building on Release is blocked by swiftlang/swift#38745 (comment)

@google-cla
Copy link

google-cla bot commented Oct 23, 2021

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@@ -118,7 +118,7 @@ if(ENABLE_PYTHON_SUPPORT)
GIT_REPOSITORY
git://github.com/pvieito/PythonKit
GIT_TAG
master
6a05a15
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6a05a15 is the last commit before CMake is removed.

@@ -189,7 +189,7 @@ if(NOT X10_FOUND AND NOT USE_BUNDLED_X10)
COMMAND
rm -rf <SOURCE_DIR>/bazel-bin # ${CMAKE_COMMAND} -E rm -Rrf <SOURCE_DIR>/bazel-bin
COMMAND
bazel build ${VISIBILITY_FLAGS} -c opt --define framework_shared_object=false //tensorflow/compiler/tf2xla/xla_tensor:x10 --nocheck_visibility
bazel build ${VISIBILITY_FLAGS} -c opt --define framework_shared_object=false //tensorflow:tensorflow //tensorflow/compiler/tf2xla/xla_tensor:x10 --nocheck_visibility
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables one to extract a standalone X10 after a successful CMake build, so subsequent builds can be made with the Swift Package Manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right. You should be able to do that irrespective. This is just building an additional label, which means that there is potentially a dependency that is missing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this I think libtensorflow.so.xxx is not generated and I cannot copy them to a standalone /Library/usr/lib for a SPM build, following the current docs/Development.md. Interestingly this is not needed for the CMake build, which I don't know the reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, the tensorflow core itself is not needed for the build, it is only needed at runtime.

// for (i, x) in zip(lhs.indices, rhs) {
// lhs[i] .*= x
// }
// }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't get this to work, need investigation

@inlinable
@inline(__always)
@_semantics("autodiff.nonvarying")
public func withoutDerivative<T, R>(at x: T, in body: (T) -> R) -> R {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withoutDerivative is removed in language, so copy-pasted to library

@@ -159,8 +159,7 @@ struct Statistics {
}
}

@differentiable
public func _defaultLossFunction(_ ŷ: Tensor<Float>, _ y: Tensor<Int32>) -> Tensor<Float> {
@differentiable(reverse)public func _defaultLossFunction(_ ŷ: Tensor<Float>, _ y: Tensor<Int32>) -> Tensor<Float> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXME

public func forward(_ input: Tensor<Scalar>) -> Tensor<Scalar> {
let positiveAxis = (input.rank + axis) % input.rank
@differentiable(reverse)
public func callAsFunction(_ input: Tensor<Scalar>) -> Tensor<Scalar> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This place triggers a compiler crash in release due to debug info being generated twice. Core reason should be that the autodiff is not doing proper variable cloning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you isolate this into a standalone test case like Dan and I have done in the past (see some of the previously mentioned SRs).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try that after I got more time (have a CVPR deadline)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the same duplicate debug info validation failure I've been seeing on internal code (and it seems like it, given that it's triggered in release builds with debug info enabled), we started seeing this with Swift commit 9a8f2ed64221635646541757b7e38b3b4e02e4b3. Patching out the two lines in InstOptUtils.cpp that call swift::salvageDebugInfo(inst) prevents this validation failure, but that's obviously addressing a symptom not the cause. Dan and I had seen some duplicate SIL instructions being generated in the presence of inouts and control flow, but I've had a heck of a time getting a reproducer created. This might provide an excellent clue, so thanks for finding it. I'll take a look at this myself to see if I can minimize something out of it.


@inlinable
public mutating func replaceSubrange<C>(_ subrange: Range<Self.Index>, with newElements: C) where C : Collection, Self.Element == C.Element {
fatalError("withUnsafeBufferPointer unimplemented because TensorBuffer is abstract")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this shameless copy-pasta. I just can't get this new protocol requirement to work...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the additions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MutableCollection had a recently fixed loophole where you could inherit default conformances that triggered an infinite recursion. You have to provide these yourself now in recent toolchains, which I believe is what is happening here.

@ProfFan
Copy link
Author

ProfFan commented Oct 23, 2021

@googlebot I signed it!

Copy link
Contributor

@compnerd compnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still a large number of places where you have unnecessary whitespace changes. Could you please fix them all and update the diff?

@@ -189,7 +189,7 @@ if(NOT X10_FOUND AND NOT USE_BUNDLED_X10)
COMMAND
rm -rf <SOURCE_DIR>/bazel-bin # ${CMAKE_COMMAND} -E rm -Rrf <SOURCE_DIR>/bazel-bin
COMMAND
bazel build ${VISIBILITY_FLAGS} -c opt --define framework_shared_object=false //tensorflow/compiler/tf2xla/xla_tensor:x10 --nocheck_visibility
bazel build ${VISIBILITY_FLAGS} -c opt --define framework_shared_object=false //tensorflow:tensorflow //tensorflow/compiler/tf2xla/xla_tensor:x10 --nocheck_visibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right. You should be able to do that irrespective. This is just building an additional label, which means that there is potentially a dependency that is missing?

README.md Outdated Show resolved Hide resolved
Documentation/X10/SUMMARY.md Outdated Show resolved Hide resolved
/// An internal workaround for SR-13263: debug info generation crash.
internal var _sr13263Workaround: SR13263Workaround?
// /// An internal workaround for SR-13263: debug info generation crash.
// internal var _sr13263Workaround: SR13263Workaround?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why comment this out? Is it no longer needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this as-is triggers a compiler crash, so I removed this workaround.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a workaround for an actual issue. I think that it would be better to verify if the underlying issue has been resolved or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying issue has migrated in that you can now trigger the same crash by inserting this placeholder, rather than avoiding the crash by having it. I've seen this with other types. Unclear when the behavior inverted.

Copy link

@philipturner philipturner Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing this happen with or without SR13263Workaround commented out, on current Swift toolchains.

Nevermind. I reproduced the crash on recent nightly toolchains, using the reproducer from swiftlang/swift#55703.

Copy link

@philipturner philipturner Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing the crash affect S4TF, with or without SR13623Workaround commented out. I'm running swift build and swift test, and it's not activating the crash on recent toolchains. Am I doing this right?

export TENSORFLOW_USE_RELEASE_TOOLCHAIN=1
cd s4tf
swift build -Xswiftc -DTENSORFLOW_USE_STANDARD_TOOLCHAIN \
  -c release -Xswiftc -g -Xcc -I${DESTDIR}/usr/include -Xlinker -L${DESTDIR}/usr/lib

rm -f .build/release/libx10.dylib
cp ${DESTDIR}/usr/lib/libx10.dylib .build/release/libx10.dylib

swift test -Xswiftc -DTENSORFLOW_USE_STANDARD_TOOLCHAIN \
    -c release -Xswiftc -g -Xcc -I${DESTDIR}/usr/include -Xlinker -L${DESTDIR}/usr/lib

Sources/TensorFlow/Layer.swift Outdated Show resolved Hide resolved
Sources/TensorFlow/Layers/Pooling.swift Outdated Show resolved Hide resolved
Sources/TensorFlow/Layers/Pooling.swift Outdated Show resolved Hide resolved
Sources/TensorFlow/Layers/Pooling.swift Outdated Show resolved Hide resolved
Sources/TensorFlow/Layers/Pooling.swift Outdated Show resolved Hide resolved
Sources/TensorFlow/Layers/Pooling.swift Outdated Show resolved Hide resolved
@ProfFan
Copy link
Author

ProfFan commented Oct 25, 2021

@compnerd Sorry for all the whitespace diffs, made a mistake in regex replacement.

@@ -1297,7 +1297,7 @@ extension Tensor {
}

@inlinable
@differentiable(wrt: self where Scalar: TensorFlowFloatingPoint)
// @differentiable(reverse, wrt: self where Scalar: TensorFlowFloatingPoint)
internal subscript(_ indexPath: IndexPath) -> Tensor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were there problems with differentiating through subscripts that led to the removal of the attribute and VJP?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it seems that in nightly the where clause is somewhat broken, and the VJP seems to apply to the non-TensorFlowFloatingPoint case as well, resulting in a protocol non-compliance. I haven't got the time to investigate deeper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the where clauses broken in nightlies before or after this PR: swiftlang/swift#39826 ? I'm wondering if that fixed or caused the issue here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let me investigate. Thanks Brad :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a manifestation of swiftlang/swift#59135, a bug that appeared in September 2021. All you had to do was comment out the @differentiable(...) attribute on subscript(_: IndexPath). The other declarations - subscript(_: TensorRangeExpression...) and _vjpSubscript - are fine.

@ProfFan
Copy link
Author

ProfFan commented Oct 28, 2021

@rxwei @BradLarson @dan-zheng I tried really hard to isolate the failure case, but failed. I believe this is because this bug only happens where there is some kind of inlining optimization, hence that Release only failure mode.

@ProfFan ProfFan requested a review from compnerd November 1, 2021 03:13
Copy link
Contributor

@compnerd compnerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this would be easier to review if you split it out into logical commits which add the reverse modifier to the @differentiable, renames forward to callAsFunction, removes the TangentVector, etc. Theoretically those would have happened individually when the change was made upstream.

@@ -189,7 +189,7 @@ if(NOT X10_FOUND AND NOT USE_BUNDLED_X10)
COMMAND
rm -rf <SOURCE_DIR>/bazel-bin # ${CMAKE_COMMAND} -E rm -Rrf <SOURCE_DIR>/bazel-bin
COMMAND
bazel build ${VISIBILITY_FLAGS} -c opt --define framework_shared_object=false //tensorflow/compiler/tf2xla/xla_tensor:x10 --nocheck_visibility
bazel build ${VISIBILITY_FLAGS} -c opt --define framework_shared_object=false //tensorflow:tensorflow //tensorflow/compiler/tf2xla/xla_tensor:x10 --nocheck_visibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, the tensorflow core itself is not needed for the build, it is only needed at runtime.

/// An internal workaround for SR-13263: debug info generation crash.
internal var _sr13263Workaround: SR13263Workaround?
// /// An internal workaround for SR-13263: debug info generation crash.
// internal var _sr13263Workaround: SR13263Workaround?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a workaround for an actual issue. I think that it would be better to verify if the underlying issue has been resolved or not.


@inlinable
public mutating func replaceSubrange<C>(_ subrange: Range<Self.Index>, with newElements: C) where C : Collection, Self.Element == C.Element {
fatalError("withUnsafeBufferPointer unimplemented because TensorBuffer is abstract")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the additions?

@compnerd
Copy link
Contributor

compnerd commented Nov 28, 2021

Seeing as this PR is stale, I'm going to comment about something the people involved with this PR ought to know about. I know this might not be the most appropriate way to reach out, but please bear with me.

I think that this is rather an inappropriate forum for this. I don't particularly consider this PR as stale btw, there are still items to be resolved and investigated before it can be merged.

I'm working on turning DL4S into a successor to Swift for TensorFlow. I'm rebuilding it from the ground up for GPU acceleration through Metal. I will also heavily modify its current API, adding fragments of S4TF (using @differentiable for custom layers, Tensor<Float> for tensors, including the Sequential function builder API) and new activation functions. @ProfFan would you be willing to redirect your interest in resurrecting S4TF onto our effort at DL4S? I could invite you to our organization.

I just realized that I have an opportunity to bring DL4S to Windows with ArrayFire, which would make it closer to replacing S4TF. If it succeeds, it could be a major milestone in support for Swift on Windows. @compnerd I thought you might want to know about this. Please tell me what you think or if you could suggest anybody who might want to help us.

The project doesn't seem particularly enticing to me - the "rebuilding from the ground up for GPU acceleration through Metal" is a big component of that. If you were designing it for optimization for DCOM+ (DirectX is effective accessed through COM), that might be different. GPU environments are quite fragmented and designing around a particular one is a choice that I do not particularly agree with.

This also seems very much as other platforms are an afterthought. The overall structure of S4TF is far better from a design standpoint - it generically solves the problem in software and has abstractions over a compilation pipeline to separate the concern of optimization and device/runtime independence.

Additionally, I don't think that you realize the complexity of using ArrayFire. That codebase has a core in C++ - which means that it is a non-starter. Until the underlying ABI issues are resolved, C++ interop cannot be reliably used.

@philipturner
Copy link

I don't particularly consider this PR as stale btw, there are still items to be resolved and investigated before it can be merged.

The S4TF team archived this project in February and said they were no longer accepting pull requests to it. Merging this PR would seem to be in violation of Google’s policy. Furthermore, this PR is based on a branch called “resurrection", implying that it is an effort to un-archive S4TF.

When I made that comment, I had only skimmed through the edits and got the wrong impression that @ProfFan was trying to bring S4TF out of archive mode. I thought that because of his intention, you were never going to accept the PR. I apologize for this misunderstanding.

@philipturner
Copy link

If Swift for TensorFlow is open for changes again, I will gladly help you out with this pull request. I have plenty of time, which is something all of you seem to greatly lack. I assume that’s why this PR has been inactive for 3 weeks. Could you outline what must be done before this pull request is approved?

@rxwei @BradLarson @compnerd I have more to say on this topic and refactored it into an issue to prevent cluttering this PR. Could you please take a look at that?

@philipturner
Copy link

philipturner commented Jan 1, 2022

What bugs specifically are blocking this swift-apis PR? I’m trying to build this repo on Linux by redoing the build system, and I need to know exactly where autodiff will cause trouble.

What workarounds can I do for each of the bugs? I have to use an official Swift release toolchain, and I’m dealing with the bugs present in Swift 5.5.2.

Also, any explanation for this?
https://bugs.swift.org/browse/SR-15675

@philipturner
Copy link

philipturner commented Jan 2, 2022

@ProfFan @rxwei @dan-zheng @BradLarson someone please help me. Please bear with my rushed tone and run-on sentences; I have no way to continue working on S4TF because of this problem. I have tried building S4TF, but all of the solutions are failing. Bazel fails because it tries to dowload an arm64 version of Bazel 3.7.2 (only x86_64 exists). I have Bazelisk installed on my system and can't find out how to get it off. I suspect that if I could uninstall Bazelisk, I could force Bazel to download as something that works.

Then, I tried just using CMake instead of SwiftPM. According to a comment in @ProfFan's PR, I could then use SwiftPM to re-build the non-X10 stuff afterward. I'm trying to rip out X10 and build a new backend in its place (possibly named PlaceholderDevice), so I won't modify X10 after building. The Swift development snapshot seemed to have something wrong. I checked and made the terminal use the development snapshot (swift --version showed 5.6-dev) by putting export TOOLCHAINS=... in the .zshrc file. When it built in the terminal, it emitted errors that there was "no such module _Differentiation" and the froze indefinitely without an explanation. I checked the activity monitor, and no code compilation process was taking up anywhere close to 100% of my CPU. It seemed to stop at completely random points in the build process. Also, the Swift REPL crashed because libArgumentParser.dylib wasn't found somewhere.


At this point, I'm thinking of modifying S4TF to alternatively build on a release toolchain, using my Differentiation package. I can't do that yet because I have no way to reproduce the original build process. I also can't download the pre-built X10 binary and run it from Rosetta because I'm building from a personal fork with this PR's commits merged into it. @ProfFan modified something in the X10 directory(ies), and the existing binaries don't have these changes. If you have an Apple silicon Mac, could you send me the arm64 binaries you built in a package structured similarly to the pre-built x64 macOS X10 binaries?

@BradLarson
Copy link
Contributor

@philipturner - Regarding blocking compiler crashers, one was recently fixed, but two still remain (the 12/23/2021 nightly shapshot is the first toolchain to include the linked fix). The two remaining crashers are both due to failing assertions around debug symbols, one that triggers around mutating functions and control flow, and the other due to an optimization pass behaving oddly during release builds. The former can be worked around by moving control flow out of any mutating function that triggers the assertion failure (BatchNorm was one place we hit that here), and the latter can be avoided by disabling the generation of debug symbols when building for release. Both can be prevented if you build a non-assertion-enabled toolchain.

Fixing these has been slow going, because I've had a hard time creating a concrete reproducer for the first (I do have SIL dumps, which points to a possible cause, but I need a failing test to verify a fix), and the latter only occurs under very specific optimization conditions. I do know the exact commit and lines that introduced the latter failure, but not the root cause.

By "release toolchain", do you mean the ones integrated with Xcode, or the swift.org release toolchains? In either case, the 5.5 branch does not have some necessary autodiff fixes in it that have been put in place over the last year (the branching for that was a while ago), and I know that there was another compiler crasher that directly blocked tensorflow/swift-apis from building which has been fixed in main and is present in the nightly toolchain builds. I don't think switching to a release toolchain (Xcode or swift.org) will help you much in trying to build the project.

If you're getting errors of "no such module _Differentiation", that sounds like the build process is still trying to use the default Swift Xcode toolchain and not one from Swift.org, so you could try to manually place the path to the correct toolchain in your PATH ahead of the Xcode toolchain in order to force the use of the right toolchain.

Regarding building TensorFlow, getting Bazel set up can be a bit of a pain, and building TensorFlow from source can take as long as building the entire Swift toolchain. If you're trying to build this on an M1, I'm not sure that all even works with the version of TensorFlow you'd need to build to connect to the old Swift for TensorFlow interfaces (2.4.0 if I'm reading things right). I think you'll have to build that on X86_64. You may also need to match up the exact version of Bazel that had been used to build TensorFlow (somewhere between 3.7.2 and 3.99.0, but it sounds like you've got the correct 3.7.2).

@philipturner
Copy link

@BradLarson I'm continuing this conversation in the "requesting assistance" issue to prevent cluttering this PR.

@brettkoonce
Copy link
Contributor

start with x86 rather than jump to m1!

@philipturner
Copy link

philipturner commented Jan 4, 2022 via email

@brettkoonce
Copy link
Contributor

@philipturner wrt to #1185 (comment) + going off of what brad was saying, you should make sure you have the same cmake version as well!

@philipturner
Copy link

philipturner commented Jan 6, 2022

@ProfFan (or @compnerd or @BradLarson if you reproduced his steps while reviewing in October) could you let me know which Swift nightly snapshot was used while building this fix? I'm having trouble reproducing your results on Linux. I attached the Colab notebook where I'm facing issues right now. It's untidy, but hopefully you can use it to reproduce my conditions (please run on Google Colab and read the comments).

CompilingSwift2.ipynb.zip

If you'd rather not run it, I downloaded CMake 3.22.1, Bazel 3.1.0, and took the CMake route to compiling with a pre-compiled Ubuntu x64 binary. I pulled source code from ProfFan's fan/resurrection fork. I got CMake to start compiling the x10 library at runtime, but it was so slow that I expected it to take hours (a bad idea, especially in Colab).

I'm facing problems with it freezing after logging the command below. I faced a similar silent freezing error when attempting to build S4TF on my Mac mini (with a different cause), and I later found that the program had actually stopped and would log the error if run slightly differently. I can't figure out how to reveal that error in the Colab environment though.

[18/30] Linking Swift static library lib/libTensorTests.a

On the Swift releases website, it doesn't show snapshots earlier than November. Yet, these patches were made in October. I just need the date of the snapshot used, and the snapshot from then might still be downloadable. This is all to narrow down the cause of the error and reproduce the exact same environment that @ProfFan somehow successfully built S4TF in.

@philipturner
Copy link

I got S4TF to build successfully once in a Colab environment, although I used a pre-compiled x10.

@philipturner
Copy link

philipturner commented Jan 7, 2022

@ProfFan there was a build failure on the 2021-12-23 snapshot that didn't occur on the 2021-11-12 snapshot. This should be worth investigating to add commits to your PR. The crash log is in one of the later cells of this notebook:
CompilingSwift2 (3).ipynb.zip

The crash log was related to Differentiable, Requirement machine, and RecurrentLayerCell. Ignore the first two cells that instruct you to install CMake and Bazel, as those aren't necessary. @BradLarson is this crash something new?

I also tested it on 2022-01-06, with the same result.

@BradLarson
Copy link
Contributor

If I'm reading these notebook results correctly, it looks like the 12/23/2021 nightly snapshot toolchain got a lot farther in building than the 11/12/2021 one. I vaguely remember being able to use the 2021-07-07 nightly (downloadable from here for X86 Ubuntu 18.04) to build this branch before, but I could be wrong about that. That was a good, stable snapshot for autodiff for a while.

Looking at the failure on the 12/23/2021 toolchain, it appears that the RequirementMachine is unable to solve out the TangentVector for RecurrentLayer (type defined at swift-apis/Sources/TensorFlow/Layers/Recurrent.swift, lines 370 - 469). If that could be minimized, it'd be great to report upstream for Richard and / or Slava to look at. If this persists with the just-released 1/6/2022 toolchain, let me know and I can minimize and report that.

As a workaround, you might be able to define a custom TangentVector for RecurrentLayer that explicitly states what the compiler is currently generating for that TangentVector and that might help the solver here.

@philipturner
Copy link

philipturner commented Jan 7, 2022

2021-11-12 actually completed the build (in fact, it was the only toolchain that succeeded so far). I had some problems with it previously, but they were my fault and I resolved them. I didn’t verify whether it compiles on 2021-07-07.

The problem persists in the 2022-01-06 toolchain. Below is another Colab notebook that ran on 2021-01-06. Is this sufficient content for you to generate your bug report?
CompilingSwift2 (4).ipynb.zip

@philipturner
Copy link

philipturner commented Jan 8, 2022

As a workaround, you might be able to define a custom TangentVector for RecurrentLayer that explicitly states what the compiler is currently generating for that TangentVector and that might help the solver here.

@BradLarson could you give super precise details on how to implement the workaround? I anticipate that my lack of experience with differentiation will unnecessarily waste time if I tackle the problem alone. Plus, it takes 2 minutes to reach the build failure, unlike Xcode where you can cache build products (I might change that by moving to my Mac mini if this gets serious enough).

@BradLarson
Copy link
Contributor

@BradLarson could you give super precise details on how to implement the workaround?

I haven't verified that this will compile, but you'll want to add something like the following before line 376 in Recurrent.swift:

public struct TangentVector: Differentiable, AdditiveArithmetic {
  public typealias TangentVector = RecurrentLayer.TangentVector
  public var cell: Cell.TangentVector

  public init(cell: Cell.TangentVector) {
    self.cell = cell
  }
}

mutating func move(by offset: TangentVector) {
  self.cell.move(by: offset.cell)
}

That should provide what you need to satisfy a manual TangentVector for the RecurrentLayer type. Again, not entirely sure that'll compile, but something close to that is what you'd need.

@philipturner
Copy link

philipturner commented Jan 8, 2022 via email

@philipturner
Copy link

philipturner commented Jan 8, 2022

@BradLarson the crash still occurs (2022-01-06), even with your workaround. It says this in the crash log:

5.	While generating protocol witness thunk SIL function "@AD__$s10TensorFlow14RecurrentLayerVyxGAA0D0A2aEP14callAsFunctiony6OutputQz5InputQzFTW_jvp_SS".
 for 'callAsFunction(_:)' (at /content/swift-apis/Sources/TensorFlow/Layers/Recurrent.swift:464:10)

I noticed that it says jvp at the end of the mangled name, which shouldn't exist anymore. @rxwei made requests on JIRA for someone to remove JVP protocol witnesses:

It happened at this place (around line 440), which doesn't have a manually defined derivative. Will manually defining a derivative fix it, and what code should I put as the derivative?

@differentiable(reverse)
public func callAsFunction(_ inputs: [Cell.TimeStepInput]) -> [Cell.TimeStepOutput] {
  let initialState = withoutDerivative(at: cell.zeroState(for: inputs[0]))
  return self(inputs, initialState: initialState)
}

Colab notebook:
CompilingSwift2 (5).ipynb.zip

@philipturner

This comment has been minimized.

@ProfFan
Copy link
Author

ProfFan commented Jan 13, 2022

You have to click "Review" after you make review comments.

@philipturner

This comment has been minimized.

* Update Normalization.swift
@philipturner
Copy link

@ProfFan do any of your current projects depend on S4TF? If so, wouldn't your work be impacted by the new crash that appeared in 2021-12-23?

@ProfFan
Copy link
Author

ProfFan commented Jan 13, 2022

No I am not doing any S4TF in my current work

@philipturner

This comment has been minimized.

@ProfFan
Copy link
Author

ProfFan commented Jan 13, 2022

I have no bandwidth for this now. Maybe I will have time after 2 months.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants