Skip to content

nestordemeure/pandas2numpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pandas2Numpy

Converting Pandas dataframes into Numpy tensor to help feeding them to deep learning frameworks. This library tries to be modular, easy to use with any deep-learning framework by building on the common numpy API, and non-surprising in its behaviour.

Instalation

You can install our librarie with:

pip install git+https://github.com/nestordemeure/pandas2numpy.git

Usage

The Pandas2numpy class takes an example dataframe and column names in order to build an object that can encode/decode dataframe properly. It also takes information on columns you might want to normalize, put in logscale or that might contain NA that should be dealt with.

NA in categorical columns are encoded by adding an additional category. NA in continuous columns are replaced by the median value of the column (as computed in the example dataframe) and marked in a dedicated categorical column.

from pandas2numpy import Pandas2numpy

# example dataframe
df = pandas.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

# continuous variables to be encoded
continuous_columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
# categorical_columns to be encoded
categorical_columns = ['species']
# columns that should be set to mean=0, std=1
normalized_columns = ['sepal_length', 'sepal_width']
# columns that might contain NA
NA_columns = ['sepal_width', 'species']
# columns to which a logarithm should be applied
logscale_columns = ['sepal_length', 'petal_length']

# builds an encoder with an example dataframe to extract metrics for normalization and possible categories
# adds 1e-25 to columns that are put in logscale to avoid Nan on 0.0 values
tabularEncoder = Pandas2numpy(df, continuous_columns=continuous_columns, categorical_columns=categorical_columns,
                                  normalized_columns=normalized_columns, NA_columns=NA_columns, 
                                  logscale_columns=logscale_columns, log_epsilon=1e-25)

Once constructed, you can use the to_numpy methods to convert dataframes and rows into numpy tensors. We also provide methods that deal with categorical and continuous variables only.

# converts a dataframe into a tensor of floats and a tensor of ints
tensor_continuous,tensor_categorical = tabularEncoder.to_numpy(df)

# converts only continuous data into a tensor
tensor_continuous2 = tabularEncoder.continuous_to_numpy(df)

# converts a row (only the categorical data in this example)
# note the `df.iloc[[0]]` syntax to ensure that the row is in a dataframe and not a serie
tensor_categorical2 = tabularEncoder.categorial_to_numpy(df.iloc[[0]])

The from_numpy methods convert arrays back to dataframes.

# converts tensors back into a dataframe (note that the order of columns might change)
df = tabularEncoder.from_numpy(tensor_continuous,tensor_categorical)

# converts the continuous tensor back into a dataframe (that will only include continuous columns)
df_continuous2 = tabularEncoder.continuous_from_numpy(tensor_continuous2)

# converts the categorical tensor into a one row dataframe
row_categorical2 = tabularEncoder.categorial_from_numpy(tensor_categorical2)

Pandas2numpy also has a nb_category_per_categorical_column member containing a numpy array with the number of category encoded as int for each column of the categorical tensor (which is useful to make embeddings).

nb_categories = tabularEncoder.nb_category_per_categorical_column

For further information, we invite you to read the documentation of the individual functions.

About

Dataframe to tensor converter for deep learning.

Topics

Resources

License

Stars

Watchers

Forks

Languages