Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues faced while getting mongodb test suite running locally #17

Open
rhishikeshj opened this issue Dec 3, 2020 · 23 comments
Open

Issues faced while getting mongodb test suite running locally #17

rhishikeshj opened this issue Dec 3, 2020 · 23 comments

Comments

@rhishikeshj
Copy link

Here are some issues I faced while getting this MongoDB jepsen suite running locally with docker. Information about the code that I am using

Jepsen : commit a2bcad59f0df5bd39cea1e61d9b64376c479df9c (HEAD -> main)
MongoDB : commit 83548bb8e054170ecc4b8fda70390e40fcca5e30 (origin/master, origin/HEAD)

Initially I had an issue of not enough nodes (by default Jepsen starts 5 nodes in docker) as evident by this function jepsen.mongodb.db/shard-node-plan I fixed that by adding 2 more nodes.
Then I hit another roadblock, while installing mongoDB on each node, it error'd out saying that a required dependency can't be found, specifically libcurl3 So apparently, libcurl4 and libcurl3 don't work well together and in-spite of efforts I wasn't able to get libcurl3 and mongo running. So I changed the way Jepsen was installing MongoDB and followed the official documentation that installs Mongo 4.2. That worked.
But now I am still unable to run the tests as every time there seems to be some SSH related exception saying the control node cant reach the DB nodes.

I changed the installation instructions for MongoDB since the default instructions in setup! were error'ing out due to a libcurl3 dependency. Instructions that I have coded into setup! instead

(defn install!
  [test]
  "Installs MongoDB on the current node."
  (c/su
   (c/exec :mkdir :-p "/tmp/jepsen")
   (let [version (:version test)
         m-version (str/join "." (butlast (str/split "4.2.10" #"\.")))
         versioner #(keyword (str "mongodb-" %1 "=" version))]
     (c/exec :dpkg :--configure :-a)
     (c/exec :apt :-y :--fix-broken :install)
     ()
     (c/exec :apt-get :install :gnupg)
     (c/exec :wget :-qO :-
             (str "https://www.mongodb.org/static/pgp/server-" m-version ".asc")
             :| :apt-key :add :-)
     (c/exec :echo (str "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/" m-version " multiverse") :| :tee (str "/etc/apt/sources.list.d/mongodb-org-" m-version ".list"))
     (c/exec :apt-get :update)
     (c/exec :apt-get :install :-y
             (versioner "org")
             (versioner "org-server")
             (versioner "org-shell")
             (versioner "org-mongos"))
     (c/exec :systemctl :daemon-reload))))
@rhishikeshj
Copy link
Author

@aphyr And what do you know, just as I went to run the tests again hoping to send you a stack trace, they worked ! 🍻 🙂
I will try running them again to see if there is some instability. Other than that if you see any obvious steps that I have missed, do let me know.
I ll paste the SSH related exceptions here as soon as I encounter them :)

@rhishikeshj
Copy link
Author

Here are some issues I faced while getting this MongoDB jepsen suite running locally with docker. Information about the code that I am using

Jepsen : commit a2bcad59f0df5bd39cea1e61d9b64376c479df9c (HEAD -> main)
MongoDB : commit 83548bb8e054170ecc4b8fda70390e40fcca5e30 (origin/master, origin/HEAD)

Initially I had an issue of not enough nodes (by default Jepsen starts 5 nodes in docker) as evident by this function jepsen.mongodb.db/shard-node-plan I fixed that by adding 2 more nodes.
Then I hit another roadblock, while installing mongoDB on each node, it error'd out saying that a required dependency can't be found, specifically libcurl3 So apparently, libcurl4 and libcurl3 don't work well together and in-spite of efforts I wasn't able to get libcurl3 and mongo running. So I changed the way Jepsen was installing MongoDB and followed the official documentation that installs Mongo 4.2. That worked.
But now I am still unable to run the tests as every time there seems to be some SSH related exception saying the control node cant reach the DB nodes.

I changed the installation instructions for MongoDB since the default instructions in setup! were error'ing out due to a libcurl3 dependency. Instructions that I have coded into setup! instead

(defn install!
  [test]
  "Installs MongoDB on the current node."
  (c/su
   (c/exec :mkdir :-p "/tmp/jepsen")
   (let [version (:version test)
         m-version (str/join "." (butlast (str/split "4.2.10" #"\.")))
         versioner #(keyword (str "mongodb-" %1 "=" version))]
     (c/exec :dpkg :--configure :-a)
     (c/exec :apt :-y :--fix-broken :install)
     ()
     (c/exec :apt-get :install :gnupg)
     (c/exec :wget :-qO :-
             (str "https://www.mongodb.org/static/pgp/server-" m-version ".asc")
             :| :apt-key :add :-)
     (c/exec :echo (str "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/" m-version " multiverse") :| :tee (str "/etc/apt/sources.list.d/mongodb-org-" m-version ".list"))
     (c/exec :apt-get :update)
     (c/exec :apt-get :install :-y
             (versioner "org")
             (versioner "org-server")
             (versioner "org-shell")
             (versioner "org-mongos"))
     (c/exec :systemctl :daemon-reload))))

Some of the code here for example the --fix-broken stuff is for fixing some weird state that my debian nodes were going into. Please ignore.

@aphyr
Copy link
Contributor

aphyr commented Dec 3, 2020

Huh, okay... I can say that the test is designed for a specific version of debian--it's been a while since I poked my head into the docker and mongo tests, but this miiiight be due to a mismatch between those versions? The libcurl transition has been a real bear: some systems need 3, some 4, etc. etc.

@rhishikeshj
Copy link
Author

If this change (the change in setup! to install MongoDB) works well, can I open a PR to submit this change ? What other kinds of tests do you require before taking contributions ? Any guides about other instructions for code contributions ?

@aphyr
Copy link
Contributor

aphyr commented Dec 3, 2020

I think it'd be good to figure out what version of Debian worked before, and what version it works with now, and to document that in the README, for starters! I do apologize, this was a rush job in my free time, and I wasn't as diligent about future-proofing things as I should have been!

@rhishikeshj
Copy link
Author

So running it 5 times, caused 1 instance of the test suite crashing

com.mongodb.MongoSocketOpenException: Exception opening socket
        at com.mongodb.internal.connection.SocketStream.open(SocketStream.java:70) ~[mongodb-driver-core-4.0.2.jar:na]
        at com.mongodb.internal.connection.InternalStreamConnection.open(InternalStreamConnection.java:127) ~[mongodb-driver-core-4.0.2.jar:na]
        at com.mongodb.internal.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:131) ~[mongodb-driver-core-4.0.2.jar:na]
        at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]
Caused by: java.net.ConnectException: Connection refused (Connection refused)
        at java.base/java.net.PlainSocketImpl.socketConnect(Native Method) ~[na:na]
        at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399) ~[na:na]
        at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242) ~[na:na]
        at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224) ~[na:na]
        at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403) ~[na:na]
        at java.base/java.net.Socket.connect(Socket.java:609) ~[na:na]
        at com.mongodb.internal.connection.SocketStreamHelper.initialize(SocketStreamHelper.java:63) ~[mongodb-driver-core-4.0.2.jar:na]
        at com.mongodb.internal.connection.SocketStream.initializeSocket(SocketStream.java:79) ~[mongodb-driver-core-4.0.2.jar:na]
        at com.mongodb.internal.connection.SocketStream.open(SocketStream.java:65) ~[mongodb-driver-core-4.0.2.jar:na]
        ... 3 common frames omitted
WARN [2020-12-03 16:40:00,246] main - jepsen.core Test crashed!
java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: Cannot write jepsen.control$session$fn__3025@54bb1068 as tag null
        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[na:na]
        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[na:na]
        at clojure.core$deref_future.invokeStatic(core.clj:2300) ~[clojure-1.10.0.jar:na]
        at clojure.core$future_call$reify__8439.deref(core.clj:6974) ~[clojure-1.10.0.jar:na]
        at clojure.core$deref.invokeStatic(core.clj:2320) ~[clojure-1.10.0.jar:na]
        at clojure.core$deref.invoke(core.clj:2306) ~[clojure-1.10.0.jar:na]
        at clojure.core$map$fn__5851.invoke(core.clj:2753) ~[clojure-1.10.0.jar:na]
        at clojure.lang.LazySeq.sval(LazySeq.java:42) ~[clojure-1.10.0.jar:na]
        at clojure.lang.LazySeq.seq(LazySeq.java:51) ~[clojure-1.10.0.jar:na]
        at clojure.lang.RT.seq(RT.java:531) ~[clojure-1.10.0.jar:na]
        at clojure.core$seq__5387.invokeStatic(core.clj:137) ~[clojure-1.10.0.jar:na]
        at clojure.core$dorun.invokeStatic(core.clj:3133) ~[clojure-1.10.0.jar:na]
        at clojure.core$dorun.invoke(core.clj:3133) ~[clojure-1.10.0.jar:na]
        at jepsen.store$save_1_BANG_.invokeStatic(store.clj:376) ~[jepsen-0.1.19.jar:na]
        at jepsen.store$save_1_BANG_.invoke(store.clj:372) ~[jepsen-0.1.19.jar:na]
        at jepsen.core$run_BANG_$fn__10005$fn__10012.invoke(core.clj:633) ~[jepsen-0.1.19.jar:na]
        at jepsen.core$run_BANG_$fn__10005.invoke(core.clj:619) ~[jepsen-0.1.19.jar:na]
        at jepsen.core$run_BANG_.invokeStatic(core.clj:605) ~[jepsen-0.1.19.jar:na]
        at jepsen.core$run_BANG_.invoke(core.clj:531) ~[jepsen-0.1.19.jar:na]
        at jepsen.cli$test_all_run_tests_BANG_$fn__10790.invoke(cli.clj:422) ~[jepsen-0.1.19.jar:na]
        at clojure.core$map_indexed$mapi__8533$fn__8534.invoke(core.clj:7308) ~[clojure-1.10.0.jar:na]
        at clojure.lang.LazySeq.sval(LazySeq.java:42) ~[clojure-1.10.0.jar:na]

I think the exceptions I was seeing earlier were of a similar nature

@aphyr
Copy link
Contributor

aphyr commented Dec 3, 2020

Ah, well that looks like there's a problem in the MongoDB setup process--it's not accepting connections. Likely a race condition between the code and MongoDB itself, if it's sporadic. Maybe there needs to be some additional health checks during db/setup!...

@rhishikeshj
Copy link
Author

From the dockerfile, I can see that the docker image is based on this Debian docker image : https://github.com/jgoerzen/docker-debian-base-standard

I am not sure I understand when you say what version of Debian worked before, and what version it works with now. Do you mean what mongo debian distro, which the code is pulling from https://repo.mongodb.org/apt/debian/dists/stretch/mongodb-org/4.2/main/binary-amd64/ ?

If I can help with updating the README, do let me know I can do that.

@rhishikeshj
Copy link
Author

Ah, well that looks like there's a problem in the MongoDB setup process--it's not accepting connections. Likely a race condition between the code and MongoDB itself, if it's sporadic. Maybe there needs to be some additional health checks during db/setup!...

Do you mean something like

echo 'db.runCommand("ping").ok' | mongo localhost:27017/test --quiet

To check if the mongo service is up and running ?

@aphyr
Copy link
Contributor

aphyr commented Dec 3, 2020

More that I'm not sure whether this ever worked with the Docker setup, and if you're having problems, it might be because this version of Jepsen and the version of Mongo it installs were intended to run on, say, Jessie, when the Docker env is giving you, say, Bullseye. I honestly forget, so much has happened this year. I'd love to go dig into this for you but I am scrambling to keep up with waaaay too much client stuff right now!

@aphyr
Copy link
Contributor

aphyr commented Dec 3, 2020

echo 'db.runCommand("ping").ok' | mongo localhost:27017/test --quiet

Maybe. I think the current code probably does its own health checks already... lemme check. Ah, yes, here it is:

; Wait for all nodes to be reachable
(.close (client/await-open node port))
(jepsen/synchronize test)

We've got blocking on individual node startup, blocking on cluster join, blocking on elections, blocking on the cluster, blocking on the primary. That is, apparently, not enough blocking! This isn't just you: Mongo's... historically been difficult to set up reliably.

@rhishikeshj
Copy link
Author

More that I'm not sure whether this ever worked with the Docker setup, and if you're having problems, it might be because this version of Jepsen and the version of Mongo it installs were intended to run on, say, Jessie, when the Docker env is giving you, say, Bullseye. I honestly forget, so much has happened this year. I'd love to go dig into this for you but I am scrambling to keep up with waaaay too much client stuff right now!

Aah, that makes sense. FWIW, the debian version that the current jepsen's main branch sets up is buster.
If you do come across some pointers on what this was originally supposed to run on, let me know. I can look at the jepsen code as of the time this mongodb test suite was initially created. Maybe that can give some pointers on the debian version ?

@aphyr
Copy link
Contributor

aphyr commented Dec 3, 2020

Ooof, yeah, Again, I'm sorry. This is a holdover from an older time in Jepsen when Debian versions lasted (compared to the lifetime of a test) forever and were often cross-compatible: we never really established a convention around OS versioning. Now that people are trying to dredge up tests written n years ago (or even 7 months ago!), those assumptions don't always hold.

This is a good reminder to me to write more of that documentation, and start splitting out future jepsen.os.debian/os objects into specific versions.

It looks like this test uses jepsen 0.1.19, which... I think should be using Jessie. Jepsen 0.2.1 transitioned to Buster.

@rhishikeshj
Copy link
Author

From this commit It seems the control node used ubuntu and the db nodes used stretch around the time these mongo tests were written.
Am I looking at this correctly ?

@aphyr
Copy link
Contributor

aphyr commented Dec 3, 2020

Oh, yeah, but that doesn't (and I am so sorry, I know this is confusing) mean this test was supposed to work with Docker. The docker directory was contributed by other people--I hadn't used it myself, and its maintainers drifted off to do other things, so it drifted behind. I test primarily using LXC and AWS, and was running Jessie at the time, I think. That's why this test was written for Jessie, and probably won't work with either the old or new docker setups, since they're for Stretch and Buster.

So, I think you've got two options here. One is if you get a Jessie environment going (are the mirrors still around?) you should be able to run the test as-is. The other is using Buster and figuring out how to port the test forward to Buster, which miiight be as simple as bumping the version of jepsen in project.clj to 0.2.1+.

@rhishikeshj
Copy link
Author

Okay, I understand now. Thanks.

As regards the 2 options, I would say bringing the tests up to date is more fruitful in the longer run. I can give that a crack to see what else needs changing. Right off the bat, I think there are some code changes that might be needed.
Currently mongodb.clj seems to depend on [jepsen.generator.pure :as gen] which isn't there in jepsen/0.2.1

Strangely, in the source code, I see this namespace mentioned in the docs but only see it used in the dgraph code.
Where does this namespace come from in the latest jepsen code ?

@aphyr
Copy link
Contributor

aphyr commented Dec 3, 2020

Currently mongodb.clj seems to depend on [jepsen.generator.pure :as gen] which isn't there in jepsen/0.2.1

Ah, now THIS I actually have good docs for! https://github.com/jepsen-io/jepsen/releases/tag/0.2.0

@aphyr
Copy link
Contributor

aphyr commented Dec 3, 2020

(also be advised there's bug in 0.2.0 that might affect generators--best to jump straight to 0.2.1 I think)

@rhishikeshj
Copy link
Author

Okay, so this morning I seem to be able to get the original SSH related exceptions rather frequently :

WARN [2020-12-04 03:17:54,150] jepsen node n4 - jepsen.control Encountered error with conn [:control "n4"]; reopening
java.lang.InterruptedException: sleep interrupted
        at java.base/java.lang.Thread.sleep(Native Method)
        at clj_ssh.ssh$ssh_exec.invokeStatic(ssh.clj:690)
        at clj_ssh.ssh$ssh_exec.invoke(ssh.clj:670)
        at clj_ssh.ssh$ssh.invokeStatic(ssh.clj:723)
        at clj_ssh.ssh$ssh.invoke(ssh.clj:699)
        at jepsen.control.SSHRemote.execute_BANG_(control.clj:331)
        at jepsen.control$ssh_STAR_$fn__3063.invoke(control.clj:172)
        at jepsen.control$ssh_STAR_.invokeStatic(control.clj:172)
        at jepsen.control$ssh_STAR_.invoke(control.clj:168)
        at jepsen.control$exec_STAR_.invokeStatic(control.clj:194)
        at jepsen.control$exec_STAR_.doInvoke(control.clj:191)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at clojure.core$apply.invokeStatic(core.clj:665)
        at clojure.core$apply.invoke(core.clj:660)
        at jepsen.control$exec.invokeStatic(control.clj:210)
        at jepsen.control$exec.doInvoke(control.clj:204)
        at clojure.lang.RestFn.invoke(RestFn.java:436)
        at jepsen.db$tcpdump$reify__3446.teardown_BANG_(db.clj:112)
        at jepsen.mongodb.db.ShardedDB.teardown_BANG_(db.clj:406)
        at jepsen.db$fn__3273$G__3269__3277.invoke(db.clj:11)
        at jepsen.db$fn__3273$G__3268__3282.invoke(db.clj:11)
        at clojure.core$partial$fn__5824.invoke(core.clj:2625)
        at jepsen.control$on_nodes$fn__3161.invoke(control.clj:430)

This is for node n4 but similar exceptions happen for all nodes.
A simple ssh n4 from the control node seems to work so there isn't an obvious problem with the docker cluster.
Any pointers for me to explore here ?

@aphyr
Copy link
Contributor

aphyr commented Dec 4, 2020 via email

@Tsunaou
Copy link

Tsunaou commented Dec 4, 2020

@aphyr And what do you know, just as I went to run the tests again hoping to send you a stack trace, they worked ! 🍻 🙂
I will try running them again to see if there is some instability. Other than that if you see any obvious steps that I have missed, do let me know.
I ll paste the SSH related exceptions here as soon as I encounter them :)

Oh bro! It is exciting that you have delt with the problem that running mongodb jepsen test in docker-compose, even though the test may crash in some situations.
In my previous work, I rent some sever to run this test suite, which is expensive so I didn't go on.
You have done, and only done, two things to fix the bug right?

  1. Adding 2 more nodes
  2. Change the installation instructions in setup! function

I am interested to your work and it would be help if you could share you config and fixment. Thanks.

@aphyr
Copy link
Contributor

aphyr commented Dec 6, 2020

I had a chance to go through the mongo code today and get everything fixed up for the lastest Jepsen and Debian Buster.

@rhishikeshj
Copy link
Author

@aphyr nice ! 😊 Would love to see that happen. Also I have opened a pull request making some of the changes for jepsen 0.2.1
Let me know if that's mergeable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants