-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VersionStore appends very slow in symbol with many rows #625
Comments
I'll try and generate some test code that reproduces the issue |
import random
import time
import uuid
import sys
from arctic import Arctic
import pandas as pd
def gen_data_no_str(start: str, stop: str):
interval = pd.interval_range(start=pd.Timestamp(start), end=pd.Timestamp(stop), freq='6H')
for i in interval:
start = i.left
end = i.right
length = random.randint(90000, 150000)
index = pd.date_range(start=start, end=end, periods=length)
df = pd.DataFrame({'venue': [146342] * length,
'symbol': [7777] * length,
'price': [random.randint(1, 100) + random.random() for _ in range(length)],
'amount': [random.randint(1, 1000) for _ in range(length)],
'id': [random.randint(1, 10000000000000) for _ in range(length)],
'side': [random.randint(0,1) for _ in range(length)],
}, index=pd.DatetimeIndex(index, name='date'))
yield df
def gen_data(start: str, stop: str):
def side():
if random.randint(1,2) == 1:
return 'BUY'
return 'SIDE'
interval = pd.interval_range(start=pd.Timestamp(start), end=pd.Timestamp(stop), freq='6H')
for i in interval:
start = i.left
end = i.right
length = random.randint(90000, 150000)
index = pd.date_range(start=start, end=end, periods=length)
df = pd.DataFrame({'timestamp': [str(i) for i in index],
'venue': ['NYSE'] * length,
'symbol': ['QQQ'] * length,
'price': [random.randint(1, 100) + random.random() for _ in range(length)],
'amount': [random.randint(1, 1000) for _ in range(length)],
'id': [str(uuid.uuid4()) for _ in range(length)],
'side': [side() for _ in range(length)],
}, index=pd.DatetimeIndex(index, name='date'))
yield df
def main(f):
random.seed(time.time())
a = Arctic('127.0.0.1')
if 'repro' in a.list_libraries():
a.delete_library('repro')
a.initialize_library('repro')
lib = a['repro']
size = 0
for data in f('2018-01-01', '2018-09-01'):
start = time.time()
lib.append('test-data', data)
end = time.time()
size += len(data)
print("Wrote dataframe of len {}. Took {} seconds".format(len(data), end-start))
print("Total size: {}".format(size))
if __name__ == '__main__':
func = gen_data_no_str
if len(sys.argv) == 2 and sys.argv[1] == 'str':
func = gen_data
main(func) You can run the code as-is and it will generate non string dataframes. The time still grows linearly for each append, but more slowly than it does for the dataframe with strings. You can enable that by adding |
For datafarmes with no strings, its taking about 3 to 3.2 seconds per append by the time it gets to 17m rows. (From a start of about 0.2 seconds). For dataframes with strings it starts off similarly (0.5 seconds per append at start) and makes it to 3+ seconds an append by the time it gets to about 5m rows. By the time it hits 17m rows its taking about |
also, not sure what happened here, but major spike in append times! note:
|
@dimosped i know we discussed this a while back - any chance we can try and narrow this down and fix it? |
Symbol consists of a datetime indexed dataframe (single index column, date). There are 7 other data columns:
1 of type int64, 1 of type float64, and 5 of type object/string.
Once there were more than 50m-60m rows, the appends started taking very large amounts of time - 10-20 minutes. The appends were on the order of 90k-150k rows.
I've attached a code profiler output that shows where the bottlenecks are. Some are in arctic, some are on the network side. the machines are connected on gige links, so the fact that so much time was spent on network traffic leads me to believe large parts of the data are being read back and forth (not sure why that would be).
profile.txt
The text was updated successfully, but these errors were encountered: