Append ignores time series index when data is identical? #69

josa69 · 2023-10-23T10:59:34Z

Given that pystore is still maintained...

A pandas dataframe with date as index (type DateTimeIndex) is stored with Pystore. Last entered date is eg 2020-01-01.
I call append to add a row with index 2020-01-02, with all data in all dataframe columns being identical (np.nan) to the row with index 2020-01-01 then only the last row (2020-01-02) is stored.
I suspect the line "combined = dd.concat([current.data, new]).drop_duplicates(keep="last")" in collection.py is the reason.

IRL perhaps unlikely that two days have 100% identical data (EOD stock data) but shouldn't the time series index be honored in this case?

gnzsnz · 2024-05-04T12:40:54Z

I can't reproduce your issue, but I get something quite similar. basically after an append the index get out of order. data is OKish, but not the index.

is a small file this is not an issue as results can be sorted, but on large files it's very slow

import pandas as pd
import pystore

# create new store
pystore.set_path(path='/tmp/store')
store = pystore.store(datastore='datastore', engine='pyarrow')
test_eod = store.collection(collection='TEST.EOD')

# generate sample data
df = pd.DataFrame(
      list(range(10)), index=pd.date_range(start='2020-1-1', periods=10), columns=['data']
)
df
            data
2020-01-01     0
2020-01-02     1
2020-01-03     2
2020-01-04     3
2020-01-05     4
2020-01-06     5
2020-01-07     6
2020-01-08     7
2020-01-09     8
2020-01-10     9
# generate 2 overlaping sets of data
dfa = df[:-3].copy()
dfb = df[-5:].copy()
# 
dfa
            data
2020-01-01     0
2020-01-02     1
2020-01-03     2
2020-01-04     3
2020-01-05     4
2020-01-06     5
2020-01-07     6

dfb
            data
2020-01-06     5
2020-01-07     6
2020-01-08     7
2020-01-09     8
2020-01-10     9

# write data
test_eod.write('STOCK',data=dfa)
# query data
test_eod.item('STOCK').to_pandas()
            data
2020-01-01     0
2020-01-02     1
2020-01-03     2
2020-01-04     3
2020-01-05     4
2020-01-06     5
2020-01-07     6
# append data
test_eod.append('STOCK',data=dfb)
# query again
test_eod.item('STOCK').to_pandas()
            data
2020-01-03     2
2020-01-04     3
2020-01-08     7
2020-01-01     0
2020-01-02     1
2020-01-05     4
2020-01-06     5
2020-01-07     6
2020-01-09     8
2020-01-10     9

pystore manages overlaping data, so i'm quite sure that's not the issue. The issue seems to be coming from dask, however there's no "sort_index" in dask only sort values and set index, which is not what we need here.

i wonder if this is caused by dask droping support for fastparquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Append ignores time series index when data is identical? #69

Append ignores time series index when data is identical? #69

josa69 commented Oct 23, 2023

gnzsnz commented May 4, 2024

Append ignores time series index when data is identical? #69

Append ignores time series index when data is identical? #69

Comments

josa69 commented Oct 23, 2023

gnzsnz commented May 4, 2024