Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add automatic time-series conversion to read_table #238

Closed
milktrader opened this issue Apr 7, 2013 · 15 comments
Closed

Add automatic time-series conversion to read_table #238

milktrader opened this issue Apr 7, 2013 · 15 comments

Comments

@milktrader
Copy link

There are two ways to go about this. One is to add kwargs to readtable and the other is to add a new method call it readtimetable or some such.

This is important because it will decide how DataFrames prefers to handle time series. For some background, here is how R's read.zoo is designed.

> args(read.zoo)
function (file, format = "", tz = "", FUN = NULL, regular = FALSE, 
 index.column = 1, drop = TRUE, FUN2 = NULL, split = NULL,    
 aggregate = FALSE, ..., text) 

If readtable is enhanced, the kwarg time=true can be added. I think the important args that should be adopted include format, which tells Calendar.parse how to treat the string, tz, which allows users to override their local setting and index.column which defines which column is the time-series (most cases it's the first column)

This references a separate issue on the Quandl package, which opens up an API to open source time-based data in a wide variety of disciplines.

milktrader/Quandl.jl#1 (comment)

@milktrader
Copy link
Author

Some ideas on coding.

 #converting a column of strings to type Calendar

time_array = Calendar.parse("yyyy-MM-dd", string_array[:,1])

# above line taking the format argument (probably called fmt) and column to convert (timecol)

time_array = Calendar.parse(fmt, string_array[:,timecol])

# build the DataFrame called df here

# check the order is ascending

if time_array[1] > time_array[2]
  flipud!(df)
end

@HarlanH
Copy link
Contributor

HarlanH commented Apr 7, 2013

What are the implications for dependencies on other packages?

I'd generally prefer using kwargs for something like this, instead of
another method, everything else being equal...

On Sun, Apr 7, 2013 at 1:55 PM, milktrader [email protected] wrote:

There are two ways to go about this. One is to add kwargs to readtableand the other is to add a new method call it
readtimetable or some such.

This is important because it will decide how DataFrames prefers to handle
time series. For some background, here is how R's read.zoo is designed.

args(read.zoo)function (file, format = "", tz = "", FUN = NULL, regular = FALSE, index.column = 1, drop = TRUE, FUN2 = NULL, split = NULL, aggregate = FALSE, ..., text)

If readtable is enhanced, the kwarg time=true can be added. I think the
important args that should be adopted include format, which tells
Calendar.parse how to treat the string, tz, which allows users to
override their local setting and index.column which defines which column
is the time-series (most cases it's the first column)

This references a separate issue on the Quandl package, which opens up an
API to open source time-based data in a wide variety of disciplines.

milktrader/Quandl.jl#1milktrader/Quandl.jl#1 (comment)


Reply to this email directly or view it on GitHubhttps://github.com//issues/238
.

@milktrader
Copy link
Author

Calendar would be a dependency, but it's lightweight and stable. Yeah, I prefer add time=true to readtable as well.

@milktrader
Copy link
Author

I have this working in a DataFrames branch with the following code (and alias). I didn't add a kwargs to the method (still need to see that implemented). This is from the io.jl file.

#function read_table{T <: String}(filename::T)
function read_table{T<:String}(filename::T, fmt::String, indexby::Int)
    # Do inference for missing configuration settings
    separator = determine_separator(filename)
    quotation_character = DEFAULT_QUOTATION_CHARACTER
    missingness_indicators = DEFAULT_MISSINGNESS_INDICATORS
    header = true
    nrows = determine_nrows(filename, header)
    io = open(filename, "r")
    column_names = determine_column_names(io, separator, quotation_character, header)
    df = read_table(io,
                    separator,
                    quotation_character,
                    missingness_indicators,
                    header,
                    column_names,
                    nrows)
  close(io)
#############################
  within!(df, quote
               Date = parse_date($fmt, $df[:,$indexby])
               end)
  df["Date"] = IndexedVector(df["Date"])
  if df[1,1] > df[2,1]
    flipud!(df)
  end
 #############################
  return df
end

 read_table(filename) = read_table(filename, "yyyy-MM-dd", 1)

This also requires Calendar to work

@tshort
Copy link
Contributor

tshort commented Apr 8, 2013

After looking at it, I think I'd prefer that this be in a separate function. Although, I'd use this functionality, I think it clutters read_table too much. I think a more Julian approach is to have more (smaller) functions.

read_table will probably undergo some API churn now that we have kwargs, so I could see this being in or out of read_table depending on what happens with other decisions.

@milktrader
Copy link
Author

What about putting a read_time_table in TimeSeries instead? This would avoid the need to depend on Calendar for just one function. TimeSeries already depends on Calendar

@tshort
Copy link
Contributor

tshort commented Apr 8, 2013

That's probably a good idea. Also, a function that operates on an existing
DataFrame might be good.

On Sun, Apr 7, 2013 at 10:11 PM, milktrader [email protected]:

What about putting a read_time_table in TimeSeries instead? This would
avoid the need to depend on Calendar for just one function. TimeSeriesalready depends on
Calendar


Reply to this email directly or view it on GitHubhttps://github.com//issues/238#issuecomment-16029415
.

@milktrader
Copy link
Author

Alright, I'd be willing to do that. I'll just copy and paste (I think it was John's) io.jl code and adapt it for time series. I'm not sure how it does what it does, but I can sort out the pieces. I'll leave this open for someone else to close, but I think the Calendar dependency will convince most this is the best course.

Also, if anyone wants to write it before I do, please feel free. I'm glad to add any contributors.

@tshort
Copy link
Contributor

tshort commented Apr 8, 2013

Do you really have to adapt the io.jl code? I was thinking you might just
be able to call read_table and manipulate the results to get it in the
form you want. If that's not the case, we should design read_table to be
flexible enough to allow this.

On Mon, Apr 8, 2013 at 7:49 AM, milktrader [email protected] wrote:

Alright, I'd be willing to do that. I'll just copy and paste (I think it
was John's) io.jl code and adapt it for time series. I'm not sure how it
does what it does, but I can sort out the pieces. I'll leave this open for
someone else to close, but I think the Calendar dependency will convince
most this is the best course.

Also, if anyone wants to write it before I do, please feel free. I'm glad
to add any contributors.


Reply to this email directly or view it on GitHubhttps://github.com//issues/238#issuecomment-16045953
.

@johnmyleswhite
Copy link
Contributor

Tom has a point: why can't you build up a DataFrame and then handle the conversion to time series columns after the fact?

@milktrader
Copy link
Author

Okay, much better. Thanks.

@milktrader
Copy link
Author

Well it turns out I implemented this already for testing, so just cleaning it up and now I'll export read_time

@milktrader
Copy link
Author

Now that we have Datetime I thought it might be worth revisiting this issue. The current readtable can parse a Float or an Int but when it comes to a string that is clearly a date, it simply parses it as a string of a date. The Datetime package can handle a reasonable amount of date strings and the code to convert the string to Datetime is basically a one-liner.

This does require a dependency on Datetime obviously, but that would be the only drawback. If (when?) Datetime gets merged into Base this would become a non-issue. The benefit is that it provides a cohesive, consistent interface to using a time-aware Dataframe in time modeling (JuliaLang/METADATA.jl#336 (comment)), online datasets (milktrader/Quandl.jl#8) and even time plotting (GiovineItalia/Gadfly.jl#78)

@milktrader milktrader reopened this Oct 29, 2013
@milktrader
Copy link
Author

Okay, I said a one-liner (I think you can still do this but why) and here is a few lines that would recognize and convert a Time column.

    for col in colnames(x)
       ismatch(r"(?i)date", col)? 
       x[col] = Date[date(d) for d in x[col]]:
       Nothing
     end

h/t @dcjones for the constructor comprehension.

@milktrader
Copy link
Author

Added a pull request #386

Maybe conversation about this idea can move over there? I'll close this one since it's pretty old (it should still have a reference on the pull request).

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants