So recently Bureau of Labor Statistics released the Oct. 2011 unemployment data. This is not a discussion of it’s validity nor it’s impact, but it is a post on how to visualize it. This post is also for my posterity, I’ve wanted to be able to do this for a while, and it’ll serve as a reference i.e. the map is my own, but the methods are pieced together from other sources.
So you can go over to the BLS Local Area Stats Page and get the data if you’d like to follow along.
First the data need to form-up so we can use it in R, where we’ll create the map. I (there may be better ways) copied the chart from the link into macvim. Then through a couple
s///g‘s I was able to get the file into csv format, which means we’re ready to open R.
There are two libraries we’ll be using to help us with this visualization,
ggplot2  and
So of course we’ll load them into our session:
Now that we have the library uploaded, we need to get the unemployment data in the session.
unemp <- read.csv("data.csv", header = F)
names(unemp) <- c("region", "percent")
unemp$region <- tolower(unemp$region)
1 north dakota 3.5
2 nebraska 4.2
3 south dakota 4.5
4 new hampshire 5.3
5 vermont 5.6
6 wyoming 5.7
So what we’re going to do next is create a single data.frame from two merged ones.
ggplot2 uses long and lat to map the data to the states, so we’ll need to associate the unemployment numbers with those long and lat number.
state_df <- map_data("state")
merged <- merge(state_df, unemp, by="region")
merged <- merged[order(merged$order),]
Great, so now the only step left is to create the map.
ggplot(merged, aes(long, lat, group = group)) +
+ geom_polygon(aes(fill = unemp), colour = alpha("white", 1/2), size = 0.2) +
+ geom_polygon(data = state_df, colour = "white", fill = NA)
And the finished product should look something like this:
 Hint: the space between the state name and the number is a tab, \t.
 I’ve been using ggplot2 for a couple weeks now, and it is awesome – highly recommended.
If there are two (can be generalized to n) classes and both follow the same distribution (but with different parameters) it is possible to predict which class an observations comes from.
Here I’ll try to predict a sample’s gender based on their height. The distribution of a person’s height is more or less normal. There are two parameters of a normal distribution. I’ll consider the easy case in this post: males and females have different average heights, but the distributions have the same standard deviation.
For the graphs and subscripts, male = 1, female = 0.
There are 3 things to do:
- Make sure our data is roughly normal. If our prediction is predicated on the data being normal, it data better be normal.
- Derive the decision rule.
- Test how well our rule works.
The data set we’ll be using is from the Journal of Statistics Education. I’ve stripped out most the the information except for height, and gender.
Again this looks good.
Deriving the Decision Rule
Great, so the the data is normal, but what’s next. We’ll make the decision to classify a case to a gender if the probability of that case being male is greater than that case being female. Or, formally,
Because we’ve assumed normality let’s put the pdf’s the inequality.
Remember, we assumed that the standard deviations were the same.
It’s fairly obvious form the equations that when,
the original inequality will hold. Now if we do some algebra we can see that when
the case will be classified as male. To visualize this, it would be a vertical line through the average of the means. Anything on the right male, on the left female.
To see how well our decision rule works the data needs to be split into a training set – to put actual numbers to the rule – then a testing set to see how well the prediction works.
I’ll be using R to do the analysis. All the data is in a data.frame
First we’ll split the data.frame into the training and testing sets.
> nr <- nrow(hw)
> hw.shuffle <- hw[sample.int(nr),]
> hw.train <- hw.shuffle[1:as.integer(nr*.7),]
> hw.test <- hw.shuffle[as.integer(nr*.7):nr,]
So now that the data is split into the two separate sets the mean of the training set can be tested against the test set.
> tapply(hw.train$height, hw.train$gender, mean)
Which means from decision rule derived above anything larger than the average of the 164.79 and 177.77, which is 171.28, will be classified as male, and under will be classified as female.
Now to set up the classification.
> hw.train.mean <- mean(c(164.79,177.77))
> hw.test$classify <- rep(0, (nrows(hw.test))
> hw.test$classify <- ifelse(hw.test$height > hw.train.mean, 1, 0)
> hw.test$classify <- as.factor(hw.test$classify)
> hw.test$classify <- as.factor(hw.test$classify)
> tab <- table(hw.test$gender, hw.test$classify)
0 62 12
1 17 61
The table shows the number predicted vs the actual number. Meaning there are 74 females in our test, and we correctly predicted 62 of them.
This is pretty good, of 152 test cases, the decision rule correctly predicted 123 correct or ~81%. It could be made potentially better by assuming a different standard deviations between factors.
And to wrap it up, and nice graph showing the rule overlaid with polynomial density.
- There are much better ways to check for normality, but this’ll do there.
- Remember that when you multiply both sides of an inequality by a negative number you switch the inequality.