Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scripts for plotting throughput measurements #375

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

guilhermeAlmeida1
Copy link
Collaborator

@guilhermeAlmeida1 guilhermeAlmeida1 commented Mar 29, 2023

This serves as an alternative to the plotting script using pyROOT in #344, using instead gnuplot matplotlib for users who might not have ROOT or prefer this interface. It also works well with the script for actually running lots of throughput measurements in that same PR, which takes care of creating all of these results in an automated fashion.

It reads the logs that our throughput executables dump using the --log_results option and turns these into a series of different plots:

  • For each log file (= each device tested) and each mu value, plot throughput vs number of cpu threads
  • For each mu value, throughput vs number of cpu threads for all devices in the same plot.
  • Plot throughput vs mu for all devices using their peak performance

@guilhermeAlmeida1
Copy link
Collaborator Author

As a sidenote, I did first go with using pyGnuplot directly but the whole interface is a bit weird, and the package doesn't seem to get much support. Given that the user would have to install an additional package either way, I didn't see how installing gnuplot would be much different, either way I'm open to criticism

@stephenswat
Copy link
Member

Excellent initiative, I will take a look at this later today. 👍

Copy link
Member

@stephenswat stephenswat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, let me make it clear that this is a very solid attempt at creating a plotting script, and it works well! However, I think the Python code is a little over-complicated, and the gnuplot could be made to be more robust. The following is provided as constructive comments which will hopefully help you in future.

Starting on the Python side, I would strongly recommend not creating directories from your Python script unless absolutely necessary. It is usually a good idea to rely on the user to make the appropriate directories for you, and then you simply write to those directories. This also allows you to make the output directory configurable.

I also would recommend not globbing the CSV files in the Python script. Rather, you should rely on the user's shell to do this for you. This gives the user more control over what they do and do not include. If the user has a CSV file in their working directory which is of the wrong format (which is not uncommon) then the script breaks, and there is no way for the user to fix this. It is more elegant to use the shell's globbing, such that the user can write:

python plot_data.py *.csv

Or, if they only want to plot a subset of files:

python plot_data.py a.csv b.csv my_prefix*.csv

Thankfully, parsing arguments like this in Python is super easy. You can import the argparse module and add code like this:

import argparse
import pathlib

parser = argparse.ArgumentParser(
    description="Guilherme's data prep"
)

parser.add_argument(
    "inputs",
    nargs='+',
    type=pathlib.Path,
    help="input CSV files",
)
parser.add_argument(
    "output",
    type=pathlib.Path,
    default="efficiency.pdf",
    help="output directory",
)

args = parser.parse_args()

Then, you should get into the habit of using pathlib to work with paths in Python, as this is more robust than simply working with string representations of paths. For example, instead of...

output_csv = "clean/{}".format(input_csv)

You could write...

output_csv = pathlib.Path("clean") / input_csv

Which is a little more robust.

There are a few comments on the use of Pandas, but I have put those inline.

Your Python script should really not be invoking gnuplot via a subprocess. Rather, leave that to the user to do via their shell. It is possible, for example, that someone might want to use your data processing script, but then plot with something other than gnuplot. Here, I recommend you follow the Unix philosophy of "do one thing and do it well".

I would also recommend putting the code that you want to run during the execution of your script in a top-level if __name__ == "__main__": block. This prevents your code from being executed if someone includes the script elsewhere. In this case, that is unlikely, but this is considered good practice in Python.

Finally, I would recommend you adopt Black as your Python formatter. It is, in my opinion, by far the superior choice and it has a variety of useful features when developing code collaboratively. Again, not super important but if you ever want (or need) to do Python this is a good habit to build up.

All in all, here is what I would write for the Python code:

#!/usr/bin/env python3

import pandas as pd
import argparse
import pathlib

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Guilherme's data prep")

    parser.add_argument(
        "inputs",
        nargs="+",
        type=pathlib.Path,
        help="input CSV files",
    )
    parser.add_argument(
        "output",
        type=pathlib.Path,
        help="output directory",
    )

    args = parser.parse_args()

    peaks = pd.DataFrame()

    for input_csv in args.inputs:
        # Create output filename in clean dir
        output_csv = args.output / input_csv

        # Read raw input data
        raw = pd.read_csv(input_csv)

        # Get int value from input directory name
        raw["mu"] = raw["directory"].str.extract("(\d+)").astype(int)

        # Calculate throughput
        raw["throughput (events/s)"] = (
            raw["processed_events"] * 1e9 / raw["processing_time"]
        )

        # Calculate mean throughput and respective standard deviation for each mu/thread pair
        clean = (
            raw.groupby(["mu", "threads"])
            .agg({"throughput (events/s)": [("stddev", "std"), ("mean", "mean")]})
            .fillna(0.0)
        )

        final = clean.unstack(level=0)
        final.columns = [f"{x}_{y}" for _, x, y in final.columns.to_flat_index()]

        # Print to output csv file
        final.to_csv(output_csv)

        # Add maximum values to global comparison table
        peak = clean["throughput (events/s)"].groupby("mu").max()

        peaks[input_csv[0 : len(input_csv) - 4]] = peak["mean"]
        peaks["stddev_{}".format(input_csv[0 : len(input_csv) - 4])] = peak["stddev"]

    peaks.to_csv(output_peaks)

I understand that some of this might be hard to grasp, so please do not feel discouraged, learning Python takes time. Adopt as much or as little of this advice as you want, I am happy to put your code into the repository regardless.

Concerning the gnuplot code, I like that a lot! The only strong recommendation I would make is to refer to columns by name rather than number. For example, you're in a situation where the column for the mean throughput of mu 20 is 2_i_ and the deviation is 2_i_+1, but if you put in a new column (say mu 10) this order shifts completely. I would recommend writing column("throughput_mu20") or something similar instead.

extras/plot_throughput/plot_data.py Outdated Show resolved Hide resolved
extras/plot_throughput/plot_data.py Outdated Show resolved Hide resolved
extras/plot_throughput/plot_data.py Outdated Show resolved Hide resolved
extras/plot_throughput/plot_data.py Outdated Show resolved Hide resolved
extras/plot_throughput/plot_data.py Outdated Show resolved Hide resolved
print("Printed", input_csv, " output to ", output_csv)

# Add maximum values to global comparison table
idx = clean.groupby('mu')['throughput (events/s)'].transform(max) == clean['throughput (events/s)']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be written more simply:

Suggested change
idx = clean.groupby('mu')['throughput (events/s)'].transform(max) == clean['throughput (events/s)']
peak = clean["throughput (events/s)"]["mean"].groupby("mu").max()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for something a bit different here because I wanted the stddev as well, so I need to keep the index as well.

peak = clean.loc[
            clean["throughput (events/s)"]["mean"].groupby("mu").idxmax()
        ].reset_index("threads")

extras/plot_throughput/plot_all.gp Outdated Show resolved Hide resolved
extras/plot_throughput/plot_data.py Outdated Show resolved Hide resolved
@stephenswat stephenswat added feature New feature or request extras Changes related to the extras labels Mar 30, 2023
@guilhermeAlmeida1
Copy link
Collaborator Author

Thank you for the thourough comments, I'll get to it soon enough

@stephenswat
Copy link
Member

As I mentioned, feel free to incorporate as much or as little of this as you want; I left this here as a learning exercise not as a request for changes so if you can't be bothered I will gladly approve this.

Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me ping you guys on this one.

Similar to #370 I would prefer the "main script" to be put into the extras/ directory. With a name something like extras/traccc_throughput_mt_gnuplotter.py. 😛 (Feel free to rename extras/tracccc_throughput_mt_plotter.py to extras/traccc_throughput_mt_rootplotter.py along the way. 😉)

Structurally I would also just auto-generate the .gp file in the .py script, but if you want to use the pre-made file, I can live with that. I'm even fine with that file staying in a sub-directory inside of extras/. Even if I myself would do it a bit differently. 😛

@stephenswat
Copy link
Member

Nooooo

@guilhermeAlmeida1
Copy link
Collaborator Author

I did not mean to close it, github was being smarter than I wanted it to

@stephenswat
Copy link
Member

Okay, give a shout if you want another review on this.

@guilhermeAlmeida1
Copy link
Collaborator Author

I ended up deciding to not use gnuplot but instead matplotlib because it was limiting the functionality I wanted this to have

@guilhermeAlmeida1
Copy link
Collaborator Author

@stephenswat I would like another review, yes. This still needs to get the visuals cleaned up to produce plots which aren't an eye-sore, but functionality wise it's where I wanted it to be. Of course the primitive user interface is very easy to break but I'm not really concerned with that.

@guilhermeAlmeida1 guilhermeAlmeida1 marked this pull request as ready for review June 16, 2023 09:44
@guilhermeAlmeida1
Copy link
Collaborator Author

guilhermeAlmeida1 commented Jun 16, 2023

This is now ready to be reviewed.
The script dumps a lot of plots which hopefully would satisfy most use-cases. If not, fine-tuning the code should not be too complicated for the user.
An example of the kind of plots this produces:
mu200_fp32.pdf
peaks_fp32.pdf
PS: these plots represent data from release v0.4.0, not our current main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extras Changes related to the extras feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants