R - Formatting R Dates from a log date/time stamp
17/09/09 15:26 Filed in: R Language | statistics
Scenario - I have Snort IDS log data which has a date/time stamp that looks like this "Tue Sep 15 09:22:09 -0600 2009" I need to be able to turn the character string date/time stamp into an “R” Date object so I can do comparisons and subset extracts.
Simple date conversions I have down, no problem:
But when it comes to the type of date/time string format I have above, I can't figure out a format string that will work.
The timezone offset is one part that causes problems. Building up to a working format string for the full time stamp string, I can make it as far as:
(apparently year defaults to current year when it's not specified in the format string). Because the Year comes after the timezone offset, I have to deal with the timezone offset in the format string.
But when I get to the timezone offset value I can't use "%z" or "%Z" because those are "output only"
I'm close, but can't incorporate the timezone offset field in the date/time stamp string.
What am I missing? Tweet me @esawdust with any suggestions for how to get around this problem with just the as.Date method.
I suppose one workaround is to split the date/time string into its component parts, reassemble it into a string as.Date can deal with, but that seems to defeat one of the purposes of as.Date's format capability.
Any advice for how to translate a "Tue Sep 15 09:22:09 -0600 2009" into an R Date object?
[Update 9/18/09 - posted my question to the Nabble R-Help forum (which is great, BTW)]
This works out to be close - Kudos go to Gabor for the basic solution. However there are some issues with the code as you will see in a bit.
In order to use the gsubfn you have to first install the package:
after a fairly lengthy download with dependencies, you can run the R code:
You can see the day from the result is 21 not 15 - this is because the offset used was 600 (hours) instead of 6 hours. The fix is to interpret the offset value as a time in hours/minutes (some timezones are not on an even hour - but on a 1/2 hour for example.) But for most timezone purposes, you can divide the offset by 100 and get the right offset.
That offset fix (except for TZ’s that live in a 1/2 hour offset - a case I don’t have to worry about) looks like this:
Date is now correct, but the time seems odd - it’s the same as given. However, if I change the offset in the time stamp string, the resulting chron time also changes, so it’s being used by as.chron as follows:
More playing with as.POSIXct and as.chron is in order to figure out what’s going on.
Get the current time on the system (which lives in the MDT TZ), as a time relative to GMT:
You can see these two times are 6 hours apart (local system is -6):
That generates the correct local time, so if the time is converted to GMT it works fine. However, if you do the computation with another TZ,
It’s clear that as.chron is not internally using the timezone in the given time object, p.
Other examples of “R” times not behaving well (at least according to the docs), if you look at the ISOdate() function, it takes a Timezone that’s to be used in conversion. It appears to be completely ignored as well, as the following examples show:
At least this is a codeable solution to get Chron or DateTime class objects from a log timestamp like "Tue Sep 15 09:22:09 -0600 2009" if the time is first converted to GMT. So, that’s unfortunate, but is the reality.
What’s is nice I found is that it will work just as well on a collection of dates. Say you created multiple dates in a collection:
Then hand strapply() the dates vector instead of a literal string and it will convert both in the same call:
Summary
All told with the “R” date/time classes, my conclusion is this:
If you can do your date conversion in the data outside of “R” you are probably better off to convert the date in the data before import. The “R” date/time classes are clunky and produce unexpected results in most cases except the simplest (pure GMT or local times, but arbitrary timezones are not well handled in “R”.)
Simple date conversions I have down, no problem:
But when it comes to the type of date/time string format I have above, I can't figure out a format string that will work.
The timezone offset is one part that causes problems. Building up to a working format string for the full time stamp string, I can make it as far as:
(apparently year defaults to current year when it's not specified in the format string). Because the Year comes after the timezone offset, I have to deal with the timezone offset in the format string.
But when I get to the timezone offset value I can't use "%z" or "%Z" because those are "output only"
I'm close, but can't incorporate the timezone offset field in the date/time stamp string.
What am I missing? Tweet me @esawdust with any suggestions for how to get around this problem with just the as.Date method.
I suppose one workaround is to split the date/time string into its component parts, reassemble it into a string as.Date can deal with, but that seems to defeat one of the purposes of as.Date's format capability.
Any advice for how to translate a "Tue Sep 15 09:22:09 -0600 2009" into an R Date object?
[Update 9/18/09 - posted my question to the Nabble R-Help forum (which is great, BTW)]
This works out to be close - Kudos go to Gabor for the basic solution. However there are some issues with the code as you will see in a bit.
In order to use the gsubfn you have to first install the package:
after a fairly lengthy download with dependencies, you can run the R code:
You can see the day from the result is 21 not 15 - this is because the offset used was 600 (hours) instead of 6 hours. The fix is to interpret the offset value as a time in hours/minutes (some timezones are not on an even hour - but on a 1/2 hour for example.) But for most timezone purposes, you can divide the offset by 100 and get the right offset.
That offset fix (except for TZ’s that live in a 1/2 hour offset - a case I don’t have to worry about) looks like this:
Date is now correct, but the time seems odd - it’s the same as given. However, if I change the offset in the time stamp string, the resulting chron time also changes, so it’s being used by as.chron as follows:
More playing with as.POSIXct and as.chron is in order to figure out what’s going on.
Get the current time on the system (which lives in the MDT TZ), as a time relative to GMT:
You can see these two times are 6 hours apart (local system is -6):
That generates the correct local time, so if the time is converted to GMT it works fine. However, if you do the computation with another TZ,
It’s clear that as.chron is not internally using the timezone in the given time object, p.
Other examples of “R” times not behaving well (at least according to the docs), if you look at the ISOdate() function, it takes a Timezone that’s to be used in conversion. It appears to be completely ignored as well, as the following examples show:
At least this is a codeable solution to get Chron or DateTime class objects from a log timestamp like "Tue Sep 15 09:22:09 -0600 2009" if the time is first converted to GMT. So, that’s unfortunate, but is the reality.
What’s is nice I found is that it will work just as well on a collection of dates. Say you created multiple dates in a collection:
Then hand strapply() the dates vector instead of a literal string and it will convert both in the same call:
Summary
All told with the “R” date/time classes, my conclusion is this:
If you can do your date conversion in the data outside of “R” you are probably better off to convert the date in the data before import. The “R” date/time classes are clunky and produce unexpected results in most cases except the simplest (pure GMT or local times, but arbitrary timezones are not well handled in “R”.)
asdfasdf