Add automatic time-series conversion to read_table #238

milktrader · 2013-04-07T17:55:10Z

There are two ways to go about this. One is to add kwargs to readtable and the other is to add a new method call it readtimetable or some such.

This is important because it will decide how DataFrames prefers to handle time series. For some background, here is how R's read.zoo is designed.

> args(read.zoo)
function (file, format = "", tz = "", FUN = NULL, regular = FALSE, 
 index.column = 1, drop = TRUE, FUN2 = NULL, split = NULL,    
 aggregate = FALSE, ..., text)

If readtable is enhanced, the kwarg time=true can be added. I think the important args that should be adopted include format, which tells Calendar.parse how to treat the string, tz, which allows users to override their local setting and index.column which defines which column is the time-series (most cases it's the first column)

This references a separate issue on the Quandl package, which opens up an API to open source time-based data in a wide variety of disciplines.

milktrader/Quandl.jl#1 (comment)

The text was updated successfully, but these errors were encountered:

milktrader · 2013-04-07T18:00:31Z

Some ideas on coding.

 #converting a column of strings to type Calendar

time_array = Calendar.parse("yyyy-MM-dd", string_array[:,1])

# above line taking the format argument (probably called fmt) and column to convert (timecol)

time_array = Calendar.parse(fmt, string_array[:,timecol])

# build the DataFrame called df here

# check the order is ascending

if time_array[1] > time_array[2]
  flipud!(df)
end

HarlanH · 2013-04-07T18:00:42Z

What are the implications for dependencies on other packages?

I'd generally prefer using kwargs for something like this, instead of
another method, everything else being equal...

On Sun, Apr 7, 2013 at 1:55 PM, milktrader [email protected] wrote:

There are two ways to go about this. One is to add kwargs to readtableand the other is to add a new method call it
readtimetable or some such.

This is important because it will decide how DataFrames prefers to handle
time series. For some background, here is how R's read.zoo is designed.

args(read.zoo)function (file, format = "", tz = "", FUN = NULL, regular = FALSE, index.column = 1, drop = TRUE, FUN2 = NULL, split = NULL, aggregate = FALSE, ..., text)

If readtable is enhanced, the kwarg time=true can be added. I think the
important args that should be adopted include format, which tells
Calendar.parse how to treat the string, tz, which allows users to
override their local setting and index.column which defines which column
is the time-series (most cases it's the first column)

This references a separate issue on the Quandl package, which opens up an
API to open source time-based data in a wide variety of disciplines.

milktrader/Quandl.jl#1 milktrader/Quandl.jl#1 (comment)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/238
.

milktrader · 2013-04-07T18:02:02Z

Calendar would be a dependency, but it's lightweight and stable. Yeah, I prefer add time=true to readtable as well.

milktrader · 2013-04-07T22:20:28Z

I have this working in a DataFrames branch with the following code (and alias). I didn't add a kwargs to the method (still need to see that implemented). This is from the io.jl file.

#function read_table{T <: String}(filename::T)
function read_table{T<:String}(filename::T, fmt::String, indexby::Int)
    # Do inference for missing configuration settings
    separator = determine_separator(filename)
    quotation_character = DEFAULT_QUOTATION_CHARACTER
    missingness_indicators = DEFAULT_MISSINGNESS_INDICATORS
    header = true
    nrows = determine_nrows(filename, header)
    io = open(filename, "r")
    column_names = determine_column_names(io, separator, quotation_character, header)
    df = read_table(io,
                    separator,
                    quotation_character,
                    missingness_indicators,
                    header,
                    column_names,
                    nrows)
  close(io)
#############################
  within!(df, quote
               Date = parse_date($fmt, $df[:,$indexby])
               end)
  df["Date"] = IndexedVector(df["Date"])
  if df[1,1] > df[2,1]
    flipud!(df)
  end
 #############################
  return df
end

 read_table(filename) = read_table(filename, "yyyy-MM-dd", 1)

This also requires Calendar to work

tshort · 2013-04-08T02:05:23Z

After looking at it, I think I'd prefer that this be in a separate function. Although, I'd use this functionality, I think it clutters read_table too much. I think a more Julian approach is to have more (smaller) functions.

read_table will probably undergo some API churn now that we have kwargs, so I could see this being in or out of read_table depending on what happens with other decisions.

milktrader · 2013-04-08T02:11:29Z

What about putting a read_time_table in TimeSeries instead? This would avoid the need to depend on Calendar for just one function. TimeSeries already depends on Calendar

tshort · 2013-04-08T11:17:17Z

That's probably a good idea. Also, a function that operates on an existing
DataFrame might be good.

On Sun, Apr 7, 2013 at 10:11 PM, milktrader [email protected]:

What about putting a read_time_table in TimeSeries instead? This would
avoid the need to depend on Calendar for just one function. TimeSeriesalready depends on
Calendar

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/238#issuecomment-16029415
.

milktrader · 2013-04-08T11:49:41Z

Alright, I'd be willing to do that. I'll just copy and paste (I think it was John's) io.jl code and adapt it for time series. I'm not sure how it does what it does, but I can sort out the pieces. I'll leave this open for someone else to close, but I think the Calendar dependency will convince most this is the best course.

Also, if anyone wants to write it before I do, please feel free. I'm glad to add any contributors.

tshort · 2013-04-08T12:22:57Z

Do you really have to adapt the io.jl code? I was thinking you might just
be able to call read_table and manipulate the results to get it in the
form you want. If that's not the case, we should design read_table to be
flexible enough to allow this.

On Mon, Apr 8, 2013 at 7:49 AM, milktrader [email protected] wrote:

Alright, I'd be willing to do that. I'll just copy and paste (I think it
was John's) io.jl code and adapt it for time series. I'm not sure how it
does what it does, but I can sort out the pieces. I'll leave this open for
someone else to close, but I think the Calendar dependency will convince
most this is the best course.

Also, if anyone wants to write it before I do, please feel free. I'm glad
to add any contributors.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/238#issuecomment-16045953
.

johnmyleswhite · 2013-04-08T12:47:10Z

Tom has a point: why can't you build up a DataFrame and then handle the conversion to time series columns after the fact?

milktrader · 2013-04-08T14:36:17Z

Okay, much better. Thanks.

milktrader · 2013-04-08T15:55:42Z

Well it turns out I implemented this already for testing, so just cleaning it up and now I'll export read_time

milktrader · 2013-10-29T21:13:37Z

Now that we have Datetime I thought it might be worth revisiting this issue. The current readtable can parse a Float or an Int but when it comes to a string that is clearly a date, it simply parses it as a string of a date. The Datetime package can handle a reasonable amount of date strings and the code to convert the string to Datetime is basically a one-liner.

This does require a dependency on Datetime obviously, but that would be the only drawback. If (when?) Datetime gets merged into Base this would become a non-issue. The benefit is that it provides a cohesive, consistent interface to using a time-aware Dataframe in time modeling (JuliaLang/METADATA.jl#336 (comment)), online datasets (milktrader/Quandl.jl#8) and even time plotting (GiovineItalia/Gadfly.jl#78)

milktrader · 2013-10-29T21:19:08Z

Okay, I said a one-liner (I think you can still do this but why) and here is a few lines that would recognize and convert a Time column.

    for col in colnames(x)
       ismatch(r"(?i)date", col)? 
       x[col] = Date[date(d) for d in x[col]]:
       Nothing
     end

h/t @dcjones for the constructor comprehension.

milktrader · 2013-10-30T18:18:26Z

Added a pull request #386

Maybe conversation about this idea can move over there? I'll close this one since it's pretty old (it should still have a reference on the pull request).

milktrader closed this as completed Apr 8, 2013

milktrader reopened this Oct 29, 2013

milktrader closed this as completed Oct 30, 2013

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add automatic time-series conversion to read_table #238

Add automatic time-series conversion to read_table #238

milktrader commented Apr 7, 2013

milktrader commented Apr 7, 2013

HarlanH commented Apr 7, 2013

milktrader commented Apr 7, 2013

milktrader commented Apr 7, 2013

tshort commented Apr 8, 2013

milktrader commented Apr 8, 2013

tshort commented Apr 8, 2013

milktrader commented Apr 8, 2013

tshort commented Apr 8, 2013

johnmyleswhite commented Apr 8, 2013

milktrader commented Apr 8, 2013

milktrader commented Apr 8, 2013

milktrader commented Oct 29, 2013

milktrader commented Oct 29, 2013

milktrader commented Oct 30, 2013

Add automatic time-series conversion to read_table #238

Add automatic time-series conversion to read_table #238

Comments

milktrader commented Apr 7, 2013

milktrader commented Apr 7, 2013

HarlanH commented Apr 7, 2013

milktrader commented Apr 7, 2013

milktrader commented Apr 7, 2013

tshort commented Apr 8, 2013

milktrader commented Apr 8, 2013

tshort commented Apr 8, 2013

milktrader commented Apr 8, 2013

tshort commented Apr 8, 2013

johnmyleswhite commented Apr 8, 2013

milktrader commented Apr 8, 2013

milktrader commented Apr 8, 2013

milktrader commented Oct 29, 2013

milktrader commented Oct 29, 2013

milktrader commented Oct 30, 2013