Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to add sink feature to data downloaders #71

Open
Arkoniak opened this issue Apr 3, 2021 · 9 comments
Open

Proposal to add sink feature to data downloaders #71

Arkoniak opened this issue Apr 3, 2021 · 9 comments

Comments

@Arkoniak
Copy link
Member

Arkoniak commented Apr 3, 2021

I've been playing with yahoo data source and one thing occurs to me: in its current implementation user is locked in TimeArray. It's not always convenient, user may prefer to work with other data formats, DataFrames, Temporal or maybe some other custom format. What I am proposing is to give an interface like this:

data = yahoo("SPY", <SINK>)
# for example
data = yahoo("SPY", DataFrame) # download data and export it to DataFrame
data = yahoo("SPY", Temportal)   # download data and export it to Temporal
...

Now, SINK can be anything: DataFrame, TimeArray or whatever user want. We can emit by default TimeArray for example, but that wouldn't limit user.

In order to do that we can wrap CSV.File in special structure which should conform Tables.jl protocol. The idea that if we for example define yahoo as

function yahoo(sym::AbstractString = "^GSPC", opt::YahooOpt = YahooOpt(), sink = DataFrame)
    host = rand(["query1", "query2"])
    url  = "https://$host.finance.yahoo.com/v7/finance/download/$sym"
    res  = HTTP.get(url, query = opt)
    @assert res.status == 200
    csv = CSV.File(res.body, missingstrings = ["null"])
    return sink(csv)
end

then this function is providing a DataFrame sink by default. In order for it to work for the TimeArray, one should only implement csv -> TimeArray interface which can look like

function TimeArray(csv)
    sch = TimeSeries.Tables.schema(csv)
    TimeArray(csv, timestamp = first(sch.names)) |> cleanup_colname!
end

and something similar for Temporal.

The problem with this direct approach is that it is very non-general. If in some other data source datetime column wouldn't be located at the first position it will break. So, we can do something smarter, like defining a structure

struct TimeDataWrapper{T1, T2}
   meta::T1
   data::T2
end

and use it

  sch = (; schema = TimeSeries.Tables.schema(csv), timestamp = 1) # or something similar
  timedatawrapper = TimeDataWrapper(sch, csv)
  return sink(timedatawrapper)

This structure should implement corresponding Tables.jl methods and at the same time should provide the necessary information in meta field (like where datetime column is located). So, every sink which can use this structure can convert data source to its own format without any problems.

We can do it in a few small steps

  1. make this change in MarketData.jl. As long as TimeDataWrapper lives inside MarketData.jl, functions like TimeArray(x::TimeDataWrapper) is not a type piracy. As a result, we get the function that can extract its data to TimeArray and DataFrame formats. Just to clarify, DataFrame support is coming from the fact that TimeDataWrapper follows Tables.jl API.
  2. Extract this functionality to a separate lightweight package MarketDataInterface.jl and ask the owner of TimeSeries.jl to provide support for this package.
  3. We can try to work with the owner of Temporal.jl and ask him to provide support.
  4. I am currently reviving Timestamps.jl and can write necessary support for them as well.

As a result, we will have a generic method, which can work with multiple sinks, and instead of forcing users what package to choose for financial data, they will be able to use a single package for data sourcing and any package they like for further data processing. It's a win-win situation.

As a further step, Quandl.jl can be revived and it can go through the same procedure. So we will have multiple financial data sources with the same consistent logic.

If this proposal is ok, I can try to go with the first step and we will see how it works out.

@iblislin
Copy link
Member

iblislin commented Apr 4, 2021

I think current TimeArray supports Tables.jl protocol already. So the output can be converted to DataFrame or CSV easily.

@iblislin
Copy link
Member

iblislin commented Apr 4, 2021

@iblislin
Copy link
Member

iblislin commented Apr 4, 2021

And about the issue: changing signature into the form yahoo(ticker, opt, sink).

The problem with this direct approach is that it is very non-general. If in some other data source datetime column wouldn't be located at the first position it will break.

From my point of view, AbstractTimeSeries is the only type that cares datetime column position. So just make a generated function for them, treating them as a special sink is enough.

  1. I am currently reviving Timestamps.jl and can write necessary support for them as well.

oh, what is the blueprint of this pkg? (And I never investigated it before)

As a result, we will have a generic method, which can work with multiple sinks, and instead of forcing users what package to choose for financial data, they will be able to use a single package for data sourcing and any package they like for further data processing. It's a win-win situation.

Still not sure about that this kind of generic method is a worthy cost or not.

  • If AbstractTimeSeries is the only one special sink, I will make a generated function. And, IIRC, there is a yahoo function implemented in Temporal already. So, what's the case/scenario that user might need this function works with Timestamps?
  • If not, then I will consider your solution.

@Arkoniak
Copy link
Member Author

Arkoniak commented Apr 5, 2021

Well, here are some more thoughts in support of this proposal.

  1. Performance. It is actually really great that TimeArrays are compatible with Tables.jl interface, but converting through them is an extra step. In the proposed scheme you have HTTP.body -> CSV.Rows -> lightweight Wrapper -> Sink. Here CSV.Rows act as a very performant and lightweight wrapper around HTTP request, and Wrapper is also a lightweight wrapper around CSV.Rows (or any other Table.jl format) so the transformation from request to final sink requires a minimal amount of resources. But if we go through TimeArray then we are making resource-heavy data materialization only to be thrown away at the next step. It's may be not a big deal, compared to other steps which one should do with the data, but still it's rather inefficient.

  2. Time array specialty. As I see it, Tables.jl is a generic table interface so it doesn't specialize for the time series. And time series have additional features compared to generic tables since they have DateTime axis. This makes pure Tables.jl interface one-way road. You can easily transform time array to any table (for example TimeArray -> DataFrame) but you can't go back without diving into implementation details, i.e. you have to tell which axis is time axis explicitly. The proposed interface is a small wrapper over Tables.jl which should also provide this extra information, so one can transform between different time array representations just as easily as between TimeArray -> DataFrame. And it includes not only current Temporal and TimeArray implementations, it should work for any data format which may appear in the future. And we have no guarantees that it wouldn't appear.

  3. Human factor and other data sources. Aside from yahoo/google/quandl which are implemented to some degree here and in Temporal.jl there is a lot of other data sources which may already exist in a form of a package or can appear at some point. To name a few:

and so on.

Now, when a new author creates a package, there is usually a high barrier for including external large dependencies. I think, that it is much easier to convince package maintainer to include a lightweight and stable interface, which will provide conversion abilities for his package than to convince to include full fledged TimeSeries.jl, especially if he is not going to use it (some may prefer to work with DataFrames, others may prefer to roll out there own formats).

I see this proposal as step forward to union all authors, so they do not fracture ecosystem. And MarketData.jl can be used as a good example how to do things right way.

@Arkoniak
Copy link
Member Author

Arkoniak commented Apr 5, 2021

oh, what is the blueprint of this pkg? (And I never investigated it before)

I guess it's better to move it to a corresponding issue/zulip, so not to deviate from the main topic.

But shortly, there is no blueprint as a written document, except of the original documentation: https://timestampsjl.readthedocs.io/en/latest/. It's going to be changed to some degree of course, but mainly you can think about it as a row-wise table. It has some drawbacks, but also provides huge advantages: https://julialang.zulipchat.com/#narrow/stream/282925-backtesting/topic/Timeseries.20format/near/232870653 I've used this approach in my implementation of backtesting strategy (it's called TimedEvent, but it is absolutely the same thing as Timestamp):

By the way, it's one of the reason, why i proposed this change to MarketData.jl, it was much easier to convert CSV.Rows to Timestamps directly then by using TimeArray.

@Arkoniak
Copy link
Member Author

Arkoniak commented Apr 8, 2021

I have made small prototype, which one can toy with, to feel whether this approach is good or bad.

https://github.com/Arkoniak/ProtoMarketData.jl

@iblislin
Copy link
Member

Well, here are some more thoughts in support of this proposal.

  1. Performance. ...
  2. Time array specialty.
  3. Human factor and other data sources.

Ah, okay. I think your point is that you want an interface to gluing the gap of time series data struct and normal table-like data, right? (but I think the scope is quite bigger than the topic in original post :p)

hmm, I just recalled that I asked about a related issue (JuliaData/Tables.jl#40)

https://github.com/Arkoniak/ProtoMarketData.jl

Great, I'm going to try it out.

@Arkoniak
Copy link
Member Author

hmm, I just recalled that I asked about a related issue (JuliaData/Tables.jl#40)

Ah, that's quite an interesting discussion. But after playing for some time with "timeseries" interface, I do not think it is necessary to ask to add such a functionality to Tables.jl (of course that would make the system less disjoint, but that's all). One can use additional package which:

  1. Define notion of timeaxis and timeseries related packages may overload this function as they see fit.
  2. For timeseries agnostic packages (like DataFrames.jl) this interface package can set first column as an index column. It'll cover 99% cases.

Ah, okay. I think your point is that you want an interface to gluing the gap of time series data struct and normal table-like data, right? (but I think the scope is quite bigger than the topic in original post :p)

Well, I am more interested in gluing together time series data structures (current and future one). And yes, it's bigger than original topic, but at the same time it is side effect of the original intention to have "sink" agnostic data source. It's rather tiresome to have different packages where each one invent it's own way to store resulting data.

@iblislin
Copy link
Member

iblislin commented May 8, 2021

@Arkoniak I'm still busy on TimeSeries.jl (JuliaStats/TimeSeries.jl#482).

How about just go with your interface package (https://github.com/Arkoniak/ProtoMarketData.jl) right now?
Maybe transfer it to this org and apply the new interface for MarketData.jl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants