Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes required to run SIMX on HPCAC #71

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docker/fc34/basic-setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -69,4 +69,5 @@ cat <<EOF > /etc/sysctl.d/hugepages.conf
vm.nr_hugepages=2
EOF

rpm -U /opt/rpms/*.rpm
#rpm -U /opt/rpms/*.rpm
rpm -U --force /opt/rpms/*.rpm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c2f86ca

12 changes: 11 additions & 1 deletion docker/fc34/kvm.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,14 @@ RUN \
unzip \
valgrind \
wget \
autoconf \
automake \
libtool \
g++ \
vim \
iperf \
crash \
zstd \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the development convenience.

Maybe it'd be cool to allow user the ability to provide his own docker file that will incrementally append needed things to the already existing image,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was always in back of my mind, but didn't investigate how to do it without rebuilding all images.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just apply something on top of the existing image?
Like running one additional docker file over the existing image extending it and making the new one the “current “?

&& dnf clean dbcache packages

COPY --from=rpms /opt/rpms /opt/rpms
Expand All @@ -69,4 +77,6 @@ ADD sshd_config ssh_host_rsa_key /etc/ssh/

ADD basic-setup.sh kvm-setup.sh /root/

RUN /root/basic-setup.sh && /root/kvm-setup.sh
RUN /root/basic-setup.sh

RUN /root/kvm-setup.sh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be ignored

6 changes: 3 additions & 3 deletions docker/fc34/support-simx.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/bin/bash
# ---
# git_url: http://l-gerrit.mtl.labs.mlnx:8080/simx
# git_commit: 41f602dc05b3c115b176ac3f7869e8bd390cbd92
# git_url: /global/home/users/ztiffany/test/simx
# git_commit: 3f3c2c9338f3bbb73cf3bd298152e020e394086f

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be ignored


cat <<EOF > mlx-simx.spec
%global debug_package %{nil}
Expand All @@ -18,7 +18,7 @@ From simx.git
%build
./mlnx_infra/config.status.mlnx --target=x86 --prefix=/opt/simx
make %{?_smp_mflags}
make %{?_smp_mflags} -C mellanox/
make %{?_smp_mflags} -C mellanox/ SIMX_PROJECT=mlx5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tells SimX to only build the NIC part. I think it makes sense unless the switch part is planned to be used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it makes sense for now. A long time ago, I pitched this project to switch team, they even tried it, but decided to stick with VMs because of differences in technical level expertise between development team and verification team.


#%install
make DESTDIR=%{buildroot} install
Expand Down
4 changes: 2 additions & 2 deletions docker/fc34/support-smatch.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/bin/bash
# ---
# git_url: git://repo.or.cz/smatch.git
# git_commit: 9bb66fa2d7c73b3338a27fd6b38d7d509b2a1c1b
# git_url: /global/home/users/artemp/scratch/.cache/mellanox/mkt/smatch.git
# git_commit: 72c21a144a812cadbe349801da1b24bc331af256
Copy link

@artpol84 artpol84 Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, the site where we were building this can't access the original URL.
This is specific to that site and shouldn't be considered, probably.
Especially given that "mkt images" is not a must.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC if you preload the normal cache directory it doesn't require network access so long as the commit_id is already present. So these weird disconnected cases are solved by transfering the cache directory from some network connected machine

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is normal cache directory?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

➜ kernel git:(master) ls ~/.cache/mellanox/mkt
iproute2-next.git rdma-core.git simx.git smatch.git sparse.git tc-build.git

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s the issue, it fails to download there from got saying connection refused.
Again, I don’t think we should consider it in MKT. this is obviously the issue on that site.

we just haven’t cleaned the version we ended up with for the sake of time consumption.


cat <<EOF > smatch.spec
Name: smatch
Expand Down
4 changes: 2 additions & 2 deletions docker/fc34/support-sparse.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/bin/bash
# ---
# git_url: git://git.kernel.org/pub/scm/devel/sparse/sparse.git
# git_commit: 8af2432923486c753ab52cae70b94ee684121080
# git_url: /global/home/users/artemp/scratch/.cache/mellanox/mkt/sparse.git

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

# git_commit: 49c98aa3ed1b315ed2f4fbe44271ecd5bdd9cbc7

cat <<EOF > sparse.spec
Name: sparse
Expand Down
3 changes: 3 additions & 0 deletions docker/fc34/support.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -59,4 +59,7 @@ RUN dnf install -y \
uuid-devel \
valgrind-devel \
zlib-devel \
autoconf \
automake \
libtool \
&& dnf clean dbcache packages
16 changes: 14 additions & 2 deletions plugins/do-build.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,12 @@ def make_simx(args):

subprocess.call(cmd + ['-j%d' %(args.num_jobs)])

def make_rdmo_app(args):
Copy link
Author

@ztiffany ztiffany Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started throwing in some stuff to make MKT build my rdmo app. I abandoned that, though. Ignore references to rdmo-app and the packages added to support.Dockerfile.

I added packages to the VM image to build rdmo-app inside my VM instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building inside VM looks simpler, but it misses the MKT concept. We wanted to separate build environment from run environment. It allows us to enjoy from specific optimizations and makes run fast.

if args.clean:
subprocess.check_output(['rm', '-rf', 'build'])
return
subprocess.call(['./build.sh'])

def switch_to_user(args):
with open("/etc/passwd","a") as F:
F.write(args.passwd + "\n");
Expand Down Expand Up @@ -79,9 +85,13 @@ def setup_from_pickle(args, pickle_params):
subprocess.check_output(['make', 'headers_install',
'INSTALL_HDR_PATH=/usr'], cwd=args.kernel)

if not os.path.isdir('/images/ztiffany/ccache'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely not needed based on my later experience

subprocess.check_output(['mkdir', '/images/ztiffany/ccache'])
subprocess.check_output(['chmod', '0777', '/images/ztiffany/ccache'])

switch_to_user(args)
if os.path.isdir('/ccache'):
os.environ['CCACHE_DIR'] = '/ccache'
if os.path.isdir('/images/ztiffany/ccache'):
os.environ['CCACHE_DIR'] = '/images/ztiffany/ccache'

if args.shell:
os.execvp('/bin/bash', ['/bin/bash'])
Expand All @@ -97,3 +107,5 @@ def setup_from_pickle(args, pickle_params):
make_rdma_core(args)
if args.project == "simx":
make_simx(args)
if args.project == "rdmo-app":
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this.

make_rdmo_app(args)
12 changes: 10 additions & 2 deletions plugins/do-kvm.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,11 +64,17 @@ def remove_mounts():


def is_passable_mount(v):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On HPCAC, this was needed to get the rdma-core directory passed through:

mkt run --dir /images/ztiffany/src/rdma-core/

Is this expected?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it means your config file is incomplete or another bug, we mount whole src directory

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be because /images is on tmpfs, but I don’t think we saw even an attempt to mount it

print ("Checking mount: {}".format(v))
if v[2] == "nfs" or v[2] == "nfs4":
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qemu-system-x86_64: -fw_cfg etc/sercon-port,string=2: warning: externally provided fw_cfg item names should be prefixed with "opt/"
qemu-system-x86_64: -device virtio-9p-pci,fsdev=host_bind_fs0,mount_tag=bind0: cannot initialize fsdev 'host_bind_fs0': failed to open '<snip>': Permission denied

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Permission denied" - let's debug, it shouldn't

Copy link

@artpol84 artpol84 Jun 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am root on the node, I cannot LS my users home directory

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ls fails with Permission denied as well

return False
if v[1].startswith("/images/"):
print ("YES!!!")
return True
if v[1].startswith("/plugins"):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HPCAC nodes are diskless. Here is how plugins are mounted:

Evaluating: /plugins
v is: ['tmpfs', '/plugins', 'tmpfs', 'ro,relatime,mode=555', '0', '0']
Passing: /plugins

Here is from a working system:

['/dev/sda5', '/plugins', 'ext3', 'ro,relatime', '0', '0']

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, docker can't mount tmpfs, need to think about workaround

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does work if we add the above

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think it should work as is

print ("YES!!!")
return True
if not v[0].startswith("/"):
return False
if v[1] == "/lab_tools":
print ("NOT START WITH")
return False
return True

Expand Down Expand Up @@ -106,8 +112,10 @@ def setup_fs():
# Copy over local bind mounts, eg from docker -v
cnt = 0
for dfn, v in get_mtab().items():
print ("Evaluating: {}".format(v[1]));
if not is_passable_mount(v):
continue
print ("Passing: {}".format(v[1]));

qemu_args["-fsdev"].add(
"local,id=host_bind_fs%u,security_model=passthrough,path=%s" %
Expand Down
2 changes: 2 additions & 0 deletions utils/build.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ def run_ci_cmd(self, supos):
"rdma": "iproute2",
"kernel": "kernel",
"mlnx_infra": "simx",
"rdmo-app": "rdmo-app",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore.

}

def build_list():
Expand All @@ -78,6 +79,7 @@ def set_args_project(args, section):

# "custom" project can't be sensed and must be provided explicitly
for key, value in project_marks.items():
print("comparing {} and {}".format(key, args.project))
if os.path.isdir(key):
args.project = value

Expand Down
2 changes: 1 addition & 1 deletion utils/cmdline.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def get_cache_fn(fn):
an impact on the operation of mkt - at worst it will run slower."""
global cache_dir
if cache_dir is None:
cache_dir = os.path.expanduser("~/.cache/mellanox/mkt/")
cache_dir = '/images/ztiffany/.cache/mellanox/mkt/'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the home dir is insufficient to hold these caches,
Is there a way to point it somewhere else?

Copy link
Collaborator

@rleon rleon Jun 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.cache is general mechanism, it is worth to make symlink

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

# In MTL network, user home directories are located on /labhome
# and doesn't have enough space to build cache efficiently.
# Do nasty hack and replace labhome with swgwork
Expand Down