VersionStore appends very slow in symbol with many rows #625

bmoscon · 2018-09-10T16:49:27Z

Symbol consists of a datetime indexed dataframe (single index column, date). There are 7 other data columns:

1 of type int64, 1 of type float64, and 5 of type object/string.

Once there were more than 50m-60m rows, the appends started taking very large amounts of time - 10-20 minutes. The appends were on the order of 90k-150k rows.

I've attached a code profiler output that shows where the bottlenecks are. Some are in arctic, some are on the network side. the machines are connected on gige links, so the fact that so much time was spent on network traffic leads me to believe large parts of the data are being read back and forth (not sure why that would be).
profile.txt

bmoscon · 2018-09-10T16:49:47Z

I'll try and generate some test code that reproduces the issue

bmoscon · 2018-09-12T23:22:38Z

import random
import time
import uuid
import sys

from arctic import Arctic
import pandas as pd


def gen_data_no_str(start: str, stop: str):
    interval = pd.interval_range(start=pd.Timestamp(start), end=pd.Timestamp(stop), freq='6H')
    for i in interval:
        start = i.left
        end = i.right

        length = random.randint(90000, 150000)
        index = pd.date_range(start=start, end=end, periods=length)
        df = pd.DataFrame({'venue': [146342] * length,
                           'symbol': [7777] * length,
                           'price': [random.randint(1, 100) + random.random() for _ in range(length)],
                           'amount': [random.randint(1, 1000) for _ in range(length)],
                           'id': [random.randint(1, 10000000000000) for _ in range(length)],
                           'side': [random.randint(0,1) for _ in range(length)],
        }, index=pd.DatetimeIndex(index, name='date'))

        yield df


def gen_data(start: str, stop: str):
    def side():
        if random.randint(1,2) == 1:
            return 'BUY'
        return 'SIDE'
    interval = pd.interval_range(start=pd.Timestamp(start), end=pd.Timestamp(stop), freq='6H')
    for i in interval:
        start = i.left
        end = i.right

        length = random.randint(90000, 150000)
        index = pd.date_range(start=start, end=end, periods=length)
        df = pd.DataFrame({'timestamp': [str(i) for i in index],
                           'venue': ['NYSE'] * length,
                           'symbol': ['QQQ'] * length,
                           'price': [random.randint(1, 100) + random.random() for _ in range(length)],
                           'amount': [random.randint(1, 1000) for _ in range(length)],
                           'id': [str(uuid.uuid4()) for _ in range(length)],
                           'side': [side() for _ in range(length)],
        }, index=pd.DatetimeIndex(index, name='date'))

        yield df



def main(f):
    random.seed(time.time())
    a = Arctic('127.0.0.1')
    if 'repro' in a.list_libraries():
        a.delete_library('repro')
    a.initialize_library('repro')

    lib = a['repro']
    size = 0
    for data in f('2018-01-01', '2018-09-01'):
        start = time.time()
        lib.append('test-data', data)
        end = time.time()
        size += len(data)
        print("Wrote dataframe of len {}. Took {} seconds".format(len(data), end-start))
        print("Total size: {}".format(size))


if __name__ == '__main__':
    func = gen_data_no_str
    if len(sys.argv) == 2 and sys.argv[1] == 'str':
        func = gen_data
    
    main(func)

You can run the code as-is and it will generate non string dataframes. The time still grows linearly for each append, but more slowly than it does for the dataframe with strings. You can enable that by adding str to the command line after the name of the python script.

bmoscon · 2018-09-12T23:22:55Z

@dimosped @jamesblackburn

bmoscon · 2018-09-12T23:49:32Z

For datafarmes with no strings, its taking about 3 to 3.2 seconds per append by the time it gets to 17m rows. (From a start of about 0.2 seconds).

For dataframes with strings it starts off similarly (0.5 seconds per append at start) and makes it to 3+ seconds an append by the time it gets to about 5m rows. By the time it hits 17m rows its taking about
10 seconds.

bmoscon · 2018-09-13T16:56:02Z

also, not sure what happened here, but major spike in append times!

note: total size is total number of rows in the symbol

Wrote dataframe of len 137756. Took 35.16470694541931 seconds
Total size: 45751841
Wrote dataframe of len 90127. Took 41.80634117126465 seconds
Total size: 45841968
Wrote dataframe of len 96921. Took 55.48118996620178 seconds
Total size: 45938889
Wrote dataframe of len 97367. Took 77.72691798210144 seconds
Total size: 46036256
Wrote dataframe of len 146128. Took 131.6923909187317 seconds
Total size: 46182384
Wrote dataframe of len 145066. Took 169.6080777645111 seconds
Total size: 46327450
Wrote dataframe of len 131074. Took 227.51464676856995 seconds
Total size: 46458524

bmoscon · 2019-02-02T22:08:13Z

@dimosped i know we discussed this a while back - any chance we can try and narrow this down and fix it?

shashank88 added the performance label Jan 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VersionStore appends very slow in symbol with many rows #625

VersionStore appends very slow in symbol with many rows #625

bmoscon commented Sep 10, 2018

bmoscon commented Sep 10, 2018

bmoscon commented Sep 12, 2018 •

edited

Loading

bmoscon commented Sep 12, 2018

bmoscon commented Sep 12, 2018

bmoscon commented Sep 13, 2018 •

edited

Loading

bmoscon commented Feb 2, 2019

VersionStore appends very slow in symbol with many rows #625

VersionStore appends very slow in symbol with many rows #625

Comments

bmoscon commented Sep 10, 2018

bmoscon commented Sep 10, 2018

bmoscon commented Sep 12, 2018 • edited Loading

bmoscon commented Sep 12, 2018

bmoscon commented Sep 12, 2018

bmoscon commented Sep 13, 2018 • edited Loading

bmoscon commented Feb 2, 2019

bmoscon commented Sep 12, 2018 •

edited

Loading

bmoscon commented Sep 13, 2018 •

edited

Loading