Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

No such file or directory: 'sample_logs/counter_2018-05-01.log' #7

Open
pdurbin opened this issue Jan 30, 2019 · 3 comments
Open

No such file or directory: 'sample_logs/counter_2018-05-01.log' #7

pdurbin opened this issue Jan 30, 2019 · 3 comments

Comments

@pdurbin
Copy link
Contributor

pdurbin commented Jan 30, 2019

I'm finding that when I run Counter Processor for the first time I have to touch or create files for all the days leading up to the days that I actually have logs.

For example, if counter_2018-05-08.log is my only log file, I'll touch the files for 01 through 07 like this:

touch sample_logs/counter_2018-05-01.log
touch sample_logs/counter_2018-05-02.log
touch sample_logs/counter_2018-05-03.log
touch sample_logs/counter_2018-05-04.log
touch sample_logs/counter_2018-05-05.log
touch sample_logs/counter_2018-05-06.log
touch sample_logs/counter_2018-05-07.log

In practice, I just touch all the possible dates like this:

for i in `echo {01..31}`; do
  # avoid errors like: No such file or directory: 'sample_logs/counter_2018-05-01.log'
  touch sample_logs/counter_2018-05-$i.log
done

Is this a bug or am I simply confused or doing something wrong?

@sfisher
Copy link
Contributor

sfisher commented Jan 31, 2019

Hi Philip,

Sorry for the problems. By default it will process everything for a monthly report that it thinks it needs to and hasn't been processed yet (up until yesterday) and it also tracks state in the state/statefile.json to track what daily logs have already been processed for the month already. It will not re-process days into the database logs that have already been processed into it (based on the last_processed_day value).

It has a bunch of automation in there for updating the logs up until yesterday (or the last day of the month), whichever was last.

One option is to set the state in the json file so that it thinks it has already processed up to day 7 for that month and year. That is pretty ugly to have to edit the state/statefile.json each time, though. Your solution is probably better to just create blank files.

Another option, I think, is to explicitly change the log_name_pattern to process just one file or process files in more manual way. It replaces the "(yyyy-mm-dd)" string with actual dates and if it doesn't exist in the filename pattern it doesn't replace the date.

Something like this might do the job since it doesn't have a string like (yyyyy-mm-dd) to indicate a filename replacement.

YEAR_MONTH=2018-05 LOG_NAME_PATTERN="log/counter_2018-05-08.log" UPLOAD_TO_HUB=False ./main.py

I'm looking at the code at https://github.com/CDLUC3/counter-processor/blob/6f9459f25cb1c7c01660edbf744a4c049ce8abed/config/config.py
and the "def filenames_to_process():" method and it may create the same filename multiple times (once each for the days it thinks it hasn't processed yet), but it might be easy to change that behavior by returning earlier if no date replacement pattern is found with the literal filename.

The other thing: I'm not sure what dates it would think it had processed already for that month. Clearly if someone is going to feed different files in manually then they'll need to track stuff themselves and not re-process data that has already gone into the database. Most likely it would eliminate the duplicate lines as "double-clicks" anyway, but it would be a waste of time to re-read duplicate log files.

Another other option might be to explicitly let people set the "last_processed_day" (via an environment variable or similar) for the month manually, so it would only process log files with names after the last_processed_day until the end date.

I believe the SIMULATE_DATE option just tells it to suck in files up until the day before that date. If not set, then the end of the day before the current day or the end of the month is the cutoff for the end date.

Probably in the long term it would make sense to have more manual ways to specify some of these things for people who want to manage processing more manually. Also, splitting the option for sucking log lines into the database vs calculating stats would be nice.

I can look and test a little more in the morning to see if there is an easy way to skip some files daily logs earlier in a month.

@sfisher
Copy link
Contributor

sfisher commented Jan 31, 2019

Thanks for the feedback, Phillip.

I've simplified the processing model a little for those who don't want to do full, in-order, daily log processing for past data for a month. It's in the branch https://github.com/CDLUC3/counter-processor/tree/non-daily-logs . It's kind of a quick fix and there are ways it could be more clear, but I think it works. Give it a try and if things look OK to you I'll merge it into our master branch and create another release.

Here is an example like yours with the slightly revised code that I tested out and it worked for me.

YEAR_MONTH=2019-01 LOG_NAME_PATTERN="/path/to/my/log/counter_2019-01-27.log" UPLOAD_TO_HUB=False SIMULATE_DATE=2019-01-28 ./main.py
  • Year and month is set manually.
  • The log name pattern doesn't have the (yyyy-mm-dd) in it which would normally be replaced so the log name is taken literally as only one log file.
  • The simulated date is one day after the end of the data you wish to process. This will save the state so it indicates everything up until the end of 2019-01-27 has been processed.
  • You do not need to touch and create empty files this way if you're missing data at the beginning of a month and that missing data will not be included in the report. (Of course you want to process full data for the month, in order, if you have it.)

We will also be doing something like this soon for Dryad and they are producing full monthly files (rather than daily) for back-processing their old usage data.

I assume we'll be using this too, probably something like this for each month and do it one month at a time rather than with daily logs for their historical data.

YEAR_MONTH=2014-01 LOG_NAME_PATTERN="/path/to/my/log/counter_2014-01.log" UPLOAD_TO_HUB=False ./main.py

PS. I believe "simulate date" is unnecessary when doing an old full month report since the default end is for the end of the month so long as the clock time is after the end of the month.

@pdurbin
Copy link
Contributor Author

pdurbin commented Feb 1, 2019

@sfisher thanks for the detailed brain dump and the proposed code change at 92f7d8f. It looks like it will give us more flexibility in choosing between yyyy-mm-dd or not. For my reference, the code on our end is here:

https://github.com/IQSS/dataverse/blob/3d7fdcce0fef4bef39158ecd39a1d74617a3758d/src/main/java/edu/harvard/iq/dataverse/makedatacount/MakeDataCountLoggingServiceBean.java#L46

@matthew-a-dunlap and I discussed all of this yesterday and I think we're going to stick to our touch solution for now rather than changing our code and docs but again, we appreciate the flexibility we will have in the future if you merge that commit to master.

Thanks again, especially simply for confirming that some kind of workaround or code change is necessary. This was on our list to ask you about! (Please stay tuned for at least one more unrelated question. 😄 ) Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants