Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append ignores time series index when data is identical? #69

Open
josa69 opened this issue Oct 23, 2023 · 1 comment
Open

Append ignores time series index when data is identical? #69

josa69 opened this issue Oct 23, 2023 · 1 comment

Comments

@josa69
Copy link

josa69 commented Oct 23, 2023

Given that pystore is still maintained...

A pandas dataframe with date as index (type DateTimeIndex) is stored with Pystore. Last entered date is eg 2020-01-01.
I call append to add a row with index 2020-01-02, with all data in all dataframe columns being identical (np.nan) to the row with index 2020-01-01 then only the last row (2020-01-02) is stored.
I suspect the line "combined = dd.concat([current.data, new]).drop_duplicates(keep="last")" in collection.py is the reason.

IRL perhaps unlikely that two days have 100% identical data (EOD stock data) but shouldn't the time series index be honored in this case?

@gnzsnz
Copy link

gnzsnz commented May 4, 2024

I can't reproduce your issue, but I get something quite similar. basically after an append the index get out of order. data is OKish, but not the index.

is a small file this is not an issue as results can be sorted, but on large files it's very slow

import pandas as pd
import pystore

# create new store
pystore.set_path(path='/tmp/store')
store = pystore.store(datastore='datastore', engine='pyarrow')
test_eod = store.collection(collection='TEST.EOD')

# generate sample data
df = pd.DataFrame(
      list(range(10)), index=pd.date_range(start='2020-1-1', periods=10), columns=['data']
)
df
            data
2020-01-01     0
2020-01-02     1
2020-01-03     2
2020-01-04     3
2020-01-05     4
2020-01-06     5
2020-01-07     6
2020-01-08     7
2020-01-09     8
2020-01-10     9
# generate 2 overlaping sets of data
dfa = df[:-3].copy()
dfb = df[-5:].copy()
# 
dfa
            data
2020-01-01     0
2020-01-02     1
2020-01-03     2
2020-01-04     3
2020-01-05     4
2020-01-06     5
2020-01-07     6

dfb
            data
2020-01-06     5
2020-01-07     6
2020-01-08     7
2020-01-09     8
2020-01-10     9

# write data
test_eod.write('STOCK',data=dfa)
# query data
test_eod.item('STOCK').to_pandas()
            data
2020-01-01     0
2020-01-02     1
2020-01-03     2
2020-01-04     3
2020-01-05     4
2020-01-06     5
2020-01-07     6
# append data
test_eod.append('STOCK',data=dfb)
# query again
test_eod.item('STOCK').to_pandas()
            data
2020-01-03     2
2020-01-04     3
2020-01-08     7
2020-01-01     0
2020-01-02     1
2020-01-05     4
2020-01-06     5
2020-01-07     6
2020-01-09     8
2020-01-10     9

pystore manages overlaping data, so i'm quite sure that's not the issue. The issue seems to be coming from dask, however there's no "sort_index" in dask only sort values and set index, which is not what we need here.

i wonder if this is caused by dask droping support for fastparquet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants