R Language

R - Accessing elements of a list from strsplit

Scenario - In order to extract certain parts of a string, it’s useful to use the R strsplit function, but it returns a list and it’s not immediately obvious how to access what you want in the list. Here are some attempts followed by the solution.



It's not clear how to access any aspect of the split list elements.


Finally, stumble on the form needed to address an element of the split list:


R - Formatting R Dates from a log date/time stamp

Scenario - I have Snort IDS log data which has a date/time stamp that looks like this "Tue Sep 15 09:22:09 -0600 2009" I need to be able to turn the character string date/time stamp into an “R” Date object so I can do comparisons and subset extracts.

Simple date conversions I have down, no problem:


But when it comes to the type of date/time string format I have above, I can't figure out a format string that will work.

The timezone offset is one part that causes problems.  Building up to a working format string for the full time stamp string, I can make it as far as:



(apparently year defaults to current year when it's not specified in the format string).  Because the Year comes after the timezone offset, I have to deal with the timezone offset in the format string.

But when I get to the timezone offset value I can't use "%z" or "%Z" because those are "output only"


I'm close, but can't incorporate the timezone offset field in the date/time stamp string.

What am I missing?   Tweet me @esawdust with any suggestions for how to get around this problem with just the as.Date method.

I suppose one workaround is to split the date/time string into its component parts, reassemble it into a string as.Date can deal with, but that seems to defeat one of the purposes of as.Date's format capability.

Any advice for how to translate a "Tue Sep 15 09:22:09 -0600 2009" into an R Date object?

[Update 9/18/09 - posted my question to the Nabble R-Help forum (which is great, BTW)]


This works out to be close - Kudos go to Gabor for the basic solution. However there are some issues with the code as you will see in a bit.

In order to use the gsubfn you have to first install the package:



after a fairly lengthy download with dependencies, you can run the R code:



You can see the day from the result is 21 not 15 - this is because the offset used was 600 (hours) instead of 6 hours. The fix is to interpret the offset value as a time in hours/minutes (some timezones are not on an even hour - but on a 1/2 hour for example.) But for most timezone purposes, you can divide the offset by 100 and get the right offset.

That offset fix (except for TZ’s that live in a 1/2 hour offset - a case I don’t have to worry about) looks like this:



Date is now correct, but the time seems odd - it’s the same as given. However, if I change the offset in the time stamp string, the resulting chron time also changes, so it’s being used by as.chron as follows:


More playing with as.POSIXct and as.chron is in order to figure out what’s going on.

Get the current time on the system (which lives in the MDT TZ), as a time relative to GMT:



You can see these two times are 6 hours apart (local system is -6):


That generates the correct local time, so if the time is converted to GMT it works fine. However, if you do the computation with another TZ,



It’s clear that as.chron is not internally using the timezone in the given time object, p.

Other examples of “R” times not behaving well (at least according to the docs), if you look at the ISOdate() function, it takes a Timezone that’s to be used in conversion. It appears to be completely ignored as well, as the following examples show:



At least this is a codeable solution to get Chron or DateTime class objects from a log timestamp like "Tue Sep 15 09:22:09 -0600 2009" if the time is first converted to GMT. So, that’s unfortunate, but is the reality.

What’s is nice I found is that it will work just as well on a collection of dates. Say you created multiple dates in a collection:



Then hand strapply() the dates vector instead of a literal string and it will convert both in the same call:


Summary

All told with the “R” date/time classes, my conclusion is this:

If you can do your date conversion in the data outside of “R” you are probably better off to convert the date in the data before import. The “R” date/time classes are clunky and produce unexpected results in most cases except the simplest (pure GMT or local times, but arbitrary timezones are not well handled in “R”.)

R - Showing a frequency table in a matrix format

Scenario - Attack types in Snort IDS, class types, are reported in the event logs. To get a frequency analysis, after parsing the log data into “R”, I can extract the classtypes of all the log entries. From that it’s handy to get a frequency table showing how many times a particular classification of attack occurred. But the output format is not very readable or easy to parse by other scripts. This recipe shows how to dump a frequency table as a 2-column matrix.

# lists the types of attacks found in the data - based on Classtype
> classtypes = factor( snortabbrev$Classtype )

> str(classtypes)
 Factor w/ 9 levels "","attempted-admin",..: 4 4 6 6 6 6 2 2 6 6 ...

> table(classtypes)
classtypes
                                  attempted-admin          attempted-recon           attempted-user            misc-activity              misc-attack 
                      18                       93                       35                       21                       30                       12 
 protocol-command-decode        unsuccessful-user web-application-activity 
                       2                        2                      287 

> as.matrix(table(classtypes))
                         [,1]
                           18
attempted-admin            93
attempted-recon            35
attempted-user             21
misc-activity              30
misc-attack                12
protocol-command-decode     2
unsuccessful-user           2
web-application-activity  287

Works just as well with a summary command

> as.matrix(summary(classtypes))
                         [,1]
                           18
attempted-admin            93
attempted-recon            35
attempted-user             21
misc-activity              30
misc-attack                12
protocol-command-decode     2
unsuccessful-user           2
web-application-activity  287

R - Extracting all Factors except empty element ""

Scenario: In a project I’m working on, there are security attacks shown in system logs. As part of the log data, snort alerts, there is a class type which is the general category or classification of an attack. In some cases, the attack type is apparently unknown and comes out as an “empty”. There are times I want to see everything but the unknown attack types.


# lists the types of attacks found in the data - based on Classtype
classtypes = factor( snortabbrev$Classtype )

some factors come back with an empty element such as:

> levels(classtypes)
[1] ""                         "attempted-admin"          "attempted-recon"          "attempted-user"           "misc-activity"            "misc-attack"              "protocol-command-decode"
[8] "unsuccessful-user"        "web-application-activity"
> as.matrix(levels(classtypes))
      [,1]                     
 [1,] ""                       
 [2,] "attempted-admin"        
 [3,] "attempted-recon"        
 [4,] "attempted-user"         
 [5,] "misc-activity"          
 [6,] "misc-attack"            
 [7,] "protocol-command-decode"
 [8,] "unsuccessful-user"      
 [9,] "web-application-activity"
>

Here's how to get rid of the "" if you don't want to consider it:

> levels(classtypes) != ""
[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
> classtypes_culled = levels(classtypes)[ levels(classtypes) != ""]
[1] "attempted-admin"          "attempted-recon"          "attempted-user"           "misc-activity"            "misc-attack"              "protocol-command-decode"  "unsuccessful-user"      
[8] "web-application-activity"
> 

> as.matrix( classtypes_culled )
     [,1]                     
[1,] "attempted-admin"        
[2,] "attempted-recon"        
[3,] "attempted-user"         
[4,] "misc-activity"          
[5,] "misc-attack"            
[6,] "protocol-command-decode"
[7,] "unsuccessful-user"      
[8,] "web-application-activity"
> 

R Language Recipes

Synopsis - I started this R Language Recipes section of the site because I am learning “R” - a very powerful, open source language for statistical number crunching. I’m also reviewing the manuscript “R in Action” - an upcoming book from Manning Publications and so I’m learning the language through that exercise as well.

Even as a career software developer, there are things about the R language which are not obvious. In the process of stumbling onto solutions, I want to save what I’m learning. These are little recipes I’ve learned the hard way and don’t want to have to look up again or figure out when I need them again. So, this is my stash of “R” recipes or phrases. If you get something from them, great, otherwise, I just needed a place for reference and notes.
asdfasdf