Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VersionStore appends very slow in symbol with many rows #625

Open
bmoscon opened this issue Sep 10, 2018 · 6 comments
Open

VersionStore appends very slow in symbol with many rows #625

bmoscon opened this issue Sep 10, 2018 · 6 comments

Comments

@bmoscon
Copy link
Collaborator

bmoscon commented Sep 10, 2018

Symbol consists of a datetime indexed dataframe (single index column, date). There are 7 other data columns:

1 of type int64, 1 of type float64, and 5 of type object/string.

Once there were more than 50m-60m rows, the appends started taking very large amounts of time - 10-20 minutes. The appends were on the order of 90k-150k rows.

I've attached a code profiler output that shows where the bottlenecks are. Some are in arctic, some are on the network side. the machines are connected on gige links, so the fact that so much time was spent on network traffic leads me to believe large parts of the data are being read back and forth (not sure why that would be).
profile.txt

@bmoscon
Copy link
Collaborator Author

bmoscon commented Sep 10, 2018

I'll try and generate some test code that reproduces the issue

@bmoscon
Copy link
Collaborator Author

bmoscon commented Sep 12, 2018

import random
import time
import uuid
import sys

from arctic import Arctic
import pandas as pd


def gen_data_no_str(start: str, stop: str):
    interval = pd.interval_range(start=pd.Timestamp(start), end=pd.Timestamp(stop), freq='6H')
    for i in interval:
        start = i.left
        end = i.right

        length = random.randint(90000, 150000)
        index = pd.date_range(start=start, end=end, periods=length)
        df = pd.DataFrame({'venue': [146342] * length,
                           'symbol': [7777] * length,
                           'price': [random.randint(1, 100) + random.random() for _ in range(length)],
                           'amount': [random.randint(1, 1000) for _ in range(length)],
                           'id': [random.randint(1, 10000000000000) for _ in range(length)],
                           'side': [random.randint(0,1) for _ in range(length)],
        }, index=pd.DatetimeIndex(index, name='date'))

        yield df


def gen_data(start: str, stop: str):
    def side():
        if random.randint(1,2) == 1:
            return 'BUY'
        return 'SIDE'
    interval = pd.interval_range(start=pd.Timestamp(start), end=pd.Timestamp(stop), freq='6H')
    for i in interval:
        start = i.left
        end = i.right

        length = random.randint(90000, 150000)
        index = pd.date_range(start=start, end=end, periods=length)
        df = pd.DataFrame({'timestamp': [str(i) for i in index],
                           'venue': ['NYSE'] * length,
                           'symbol': ['QQQ'] * length,
                           'price': [random.randint(1, 100) + random.random() for _ in range(length)],
                           'amount': [random.randint(1, 1000) for _ in range(length)],
                           'id': [str(uuid.uuid4()) for _ in range(length)],
                           'side': [side() for _ in range(length)],
        }, index=pd.DatetimeIndex(index, name='date'))

        yield df



def main(f):
    random.seed(time.time())
    a = Arctic('127.0.0.1')
    if 'repro' in a.list_libraries():
        a.delete_library('repro')
    a.initialize_library('repro')

    lib = a['repro']
    size = 0
    for data in f('2018-01-01', '2018-09-01'):
        start = time.time()
        lib.append('test-data', data)
        end = time.time()
        size += len(data)
        print("Wrote dataframe of len {}. Took {} seconds".format(len(data), end-start))
        print("Total size: {}".format(size))


if __name__ == '__main__':
    func = gen_data_no_str
    if len(sys.argv) == 2 and sys.argv[1] == 'str':
        func = gen_data
    
    main(func)

You can run the code as-is and it will generate non string dataframes. The time still grows linearly for each append, but more slowly than it does for the dataframe with strings. You can enable that by adding str to the command line after the name of the python script.

@bmoscon
Copy link
Collaborator Author

bmoscon commented Sep 12, 2018

@dimosped @jamesblackburn

@bmoscon
Copy link
Collaborator Author

bmoscon commented Sep 12, 2018

For datafarmes with no strings, its taking about 3 to 3.2 seconds per append by the time it gets to 17m rows. (From a start of about 0.2 seconds).

For dataframes with strings it starts off similarly (0.5 seconds per append at start) and makes it to 3+ seconds an append by the time it gets to about 5m rows. By the time it hits 17m rows its taking about
10 seconds.

@bmoscon
Copy link
Collaborator Author

bmoscon commented Sep 13, 2018

also, not sure what happened here, but major spike in append times!

note: total size is total number of rows in the symbol

Wrote dataframe of len 137756. Took 35.16470694541931 seconds
Total size: 45751841
Wrote dataframe of len 90127. Took 41.80634117126465 seconds
Total size: 45841968
Wrote dataframe of len 96921. Took 55.48118996620178 seconds
Total size: 45938889
Wrote dataframe of len 97367. Took 77.72691798210144 seconds
Total size: 46036256
Wrote dataframe of len 146128. Took 131.6923909187317 seconds
Total size: 46182384
Wrote dataframe of len 145066. Took 169.6080777645111 seconds
Total size: 46327450
Wrote dataframe of len 131074. Took 227.51464676856995 seconds
Total size: 46458524

@bmoscon
Copy link
Collaborator Author

bmoscon commented Feb 2, 2019

@dimosped i know we discussed this a while back - any chance we can try and narrow this down and fix it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants