• r

Extracting IP addresses from traffic logs

I’ve got multiple zip files, each containing daily logs of traffic info to a website. Each line is one entry, containing ip address, date, and many other things about the visitor. What I need is an output of 1 csv file, with date and ip address as columns.

First I need to unzip the files, read the unzipped files into one giant data frame, so I can easily extract the info I need.

Here goes:

#specify zip directory
>fileList <- list.files(path="/Users/csmith/Dropbox/zip", pattern=".zip")
#unzip and create new directory
+	unzip(i,exdir="/Users/csmith/Dropbox/unzip")

#specify unzip directory
>fileList<- list.files(path="/Users/csmith/Dropbox/unzip")
#read tables, create distinct rows
>df<-lapply(fileList, function(i){
+	read.table(i,header=FALSE,sep="\t")

#bind tables

Now I have one giant data frame, with 1 column, V1. I’m going to use V1 as a general template to extract the data that I need.

#create column for date, which is a duplicate of V1, except I remove everything after the colon.
>df$date<-gsub( ":.*$", "", df$V1)

Here I get rid of extra stuff in the date column.

#remove everything before space in column date
>df$date<-gsub(".* ", "", df$date)
#remove bracket in column date

Now that I have the date column the way I want it, I need an ip address column.

#create column for ip address, remove everything after space in column V1
>df$ip<-gsub(" .*$", "", df$V1)
#remove column V1 because I don't need it.

Some ip addresses show up twice because they take multiple actions on the site. I don’t need duplicates.

#only show unique ip/date combos

I also only want IPv4 IP addresses.

#remove IPv6 IP addresses
>df <- df[-grep("\\:",df$ip),]
# write to csv

There you have it.