Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column Names and Translation Dictionaries #68

Open
bt- opened this issue Aug 16, 2020 · 8 comments
Open

Column Names and Translation Dictionaries #68

bt- opened this issue Aug 16, 2020 · 8 comments
Labels
enhancement New feature or request

Comments

@bt-
Copy link

bt- commented Aug 16, 2020

@cwhanse and @wfvining, I'm considering if I should pull some of the functionality of pvcaptest out and put it into a separate package and I'd like to get your feedback on if pvanalytics might be a good place or if I should create a new package.

There are two closely related features that I'm considering pulling out:

  • a class and set of functions for cleaning up column names, see the column_renaming branch of my fork of pvcaptest, and
  • functions that parse column names to categorize each measurement by equipment type and/or physical value measured creating a 'translation dictionary'

A substantial amount of pvcaptest functionality depends on having a translation dictionary (CapData.column_groups). This approach was originally inspired by the Pecos package. Pecos enables using the translation dictionary concept, but doesn't generate them.

The more performance engineering work I do, especially on tests with longer time frames, the more I think it would be valuable to use Pecos. To facilitate this, I think it would make sense to move automatic translation dictionary generation out of pvcaptest into a more general purpose package (pvanalytics?) that can output a translation dictionary that can be used in both pvcaptest and pecos.

The pvcaptest code that generates translation dictionaries is contained in the translation dictionaries, group_columns function, and the __series_type function. This algorithm works surprisingly well given how rudimentary it is, but it could definitely be greatly improved.

I started the tools to rename columns based on how much variety there is in column names coming from a wide range of DAS/SCADA vendors and projects. I think this has to be a first step to get any type of reliable results from the algorithm to automatically generate the translation dictionary.

Look forward to hearing your thoughts!

@wfvining
Copy link
Collaborator

I think this would be a very good addition to PVAnanlytics! It seems almost indispensable for any kind of automated analysis.

I like the group_columns() function. In the PVAnalytics style it would probably need to take either a DataFrame or a list of names and return a dict mapping the input names to 'categorical names'. Pretty much what the function is already doing. What do you think of a pvanalytics.quality.names module for this and other related funcitonality (i.e. infer units from raw column names)?

@cwhanse
Copy link
Member

cwhanse commented Aug 17, 2020

@bt- is the scope to host translation tools and also a library of known translation dicts?

@bt-
Copy link
Author

bt- commented Aug 17, 2020

@cwhanse, I am thinking primarily hosting tools to create "translation dictionaries", where the translation dictionary is the mapping from measurement category id to groups of column names.

But, I do think there should be a library of dictionaries to facilitate renaming columns. As an example, this would be helpful for renaming data from AlsoEnergy where they seem to be consistent in using sun to identify POA irradiance to something more like poa irradiance. A basic version of this type of dictionary exists here Based on my experience, it will be more effective to rename columns and then try to group them.

@wfvining, when I review the library overview the intuitive location to me is under system. I think the end point of this work would be the ability to automatically extrapolate system characteristics (type and quantity of sensors and equipment) as much as possible. But, I'll defer to your understanding of how the library is structured. It would be nice to have a name for the module that conveys that the grouping functionality is in it, but I haven't thought of anything better than names.

@wfvining
Copy link
Collaborator

I was thinking about this in terms of quality control on the column names, not so much about identifying which sensors/equipment exist. I could see it going in system when you put it like that. In that case system.names doesn't make much sense to me, maybe system.sensors?

@cwhanse
Copy link
Member

cwhanse commented Aug 17, 2020

Maybe an io or iotools module? Since this feature is motivated by getting data into shape for the pvanalytic's functions.

@wfvining
Copy link
Collaborator

When I read io I think of functions for interacting with some external resource (a database, a file, or something on the network) as opposed to just identifying/manipulating column names for a data that is already in memory.

@bt-
Copy link
Author

bt- commented Aug 17, 2020

I'm having trouble thinking of a good name that encompasses the renaming and the grouping functionalities without falling back to a name like util.

What about one of these options:
system.utils
system.data_utils
system.prep_data

Or, I could see io being a good location as well because

  • I'd like to be able to save the data with renamed columns and save the translation dictionary of {categories: [renamed columns]} maybe as a json file
  • and I agree that using these functions might be a first step prior to using other parts of the package

I envision having renamed data and the translation dictionary exported being helpful if you wanted to use them in Pecos in one workflow/notebook and then use the same translation dictionary again in pvcaptest or other workflow.

Maybe:
io.naming and io.grouping ?

@wfvining
Copy link
Collaborator

We already have a pvanalytics.util module that could work. It doesn't currently have any public API functions, but I don't see why it couldn't.

@wfvining wfvining added the enhancement New feature or request label Aug 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants