Introduction

I was inspired by the White Wine data set but wanted to do something different. So I spent a month or so scraping, extracting, and cleaning a collection of reviews of beers from ratebeer.com. Without going into gory detail, I used a python module (called ratebeer) to scrape and extract information about 13,000 breweries, 147,000 beers, and 2,140,000 beer reviews. Unfortunately, mid-way through, ratebeer changed their interface, and the scripts failed. As a result, I have approximately half the breweries (6,358) accounted for.

I spent a couple of weeks debugging and trying to re-write the python module, but have decided that I’m going to proceed with this analysis with the partial data-set: the point of the project is not debugging screen-scraping code in Python. So, everything beyond must be taken with a grain of salt - although I believe I have a representative sample, 50% of my data set is missing. For example, there are no breweries whose names start with the letters A, C, F, G, H, and L; and B, M, and S are partial extracts.

It’s really important to note that because of this, I’ve lost some big commercial breweries such as Anheiser-Busch: As a result, we’re exploring the data set but cannot really make strong statements about the state of the beer world as a whole.

Detailed data cleaning takes place in process_data.R (submitted with this assignment), which produces the tables loaded by this markdown document. I only extracted reviews for beers with more than nine reviews. After extracting reviews, I also used the Data Science Toolkit to geotag the breweries and reviewers. Afterward, I calculated the distance between the reviewer and the brewery and put that value into a column called reviewdist

This is our foundation data set - one row per review per beer per brewery. It has a lot of repetition but allows us to use dplyr’s grouping and summary tools easily.

## 'data.frame':    1 obs. of  51 variables:
##  $ appearance           : int 3
##  $ aroma                : int 6
##  $ beer_url             : chr "/beer/bt-5-hop-bitter/39628/"
##  $ date                 : chr "2010-12-03"
##  $ overall              : int 11
##  $ palate               : int 3
##  $ rating               : num 2.8
##  $ taste                : int 5
##  $ text                 : chr "HP-The Dove,BSE,golden with no head,aroma of hops and pine,taste of strong hops,grapefruit,pine and cedar and some alcohol.."
##  $ user_location        : chr "Suffolk, ENGLAND"
##  $ user_name            : chr "Garrat"
##  $ user_lat             : num 52.2
##  $ user_lon             : num 1
##  $ user_country         : chr "United Kingdom"
##  $ user_continent       : chr "Europe"
##  $ beers_has_fetched    : chr "True"
##  $ abv                  : num 4.1
##  $ brewed_at            : chr ""
##  $ brewery              : chr "B&T Brewery"
##  $ brewery_url          : chr "/brewers/bt-brewery/1948/"
##  $ calories             : int 123
##  $ description          : chr "Cask; Seasonal - Autumn. Has also been available bottle conditioned. \nSeasonal beer brewed with green hops. \nUses Challenger,"| __truncated__
##  $ ibu                  : int NA
##  $ img_url              : chr "http://res.cloudinary.com/ratebeer/image/upload/w_120,c_limit/beer_39628.jpg"
##  $ mean_rating          : num 3.08
##  $ beer_name            : chr "B&T 5 Hop Bitter"
##  $ num_ratings          : int 10
##  $ overall_rating       : int 44
##  $ seasonal             : chr "Autumn"
##  $ style                : chr "Bitter"
##  $ style_rating         : int 55
##  $ style_url            : chr "/beerstyles/bitter/20/"
##  $ tags                 : chr "[bramling cross, fuggles, cascade, challenger, bottle conditioned]"
##  $ weighted_avg         : num NA
##  $ breweries_has_fetched: logi TRUE
##  $ city                 : chr "Shefford"
##  $ brewery_name         : chr "B&T Brewery"
##  $ postal_code          : chr "SG17 5DZ"
##  $ state                : chr "Bedfordshire"
##  $ street               : chr "B & T Brewery, 3E-3F St. Francis Way"
##  $ telephone            : chr "01462 815080"
##  $ type                 : Factor w/ 6 levels "Brew Pub","Brew Pub/Brewery",..: 6
##  $ web                  : chr "http://www.banksandtaylor.com/"
##  $ location             : chr "Shefford Bedfordshire England"
##  $ lat                  : num 52
##  $ lon                  : num -0.5
##  $ country              : chr "United Kingdom"
##  $ continent            : chr "Europe"
##  $ first_letter         : chr "B"
##  $ reviewdist           : num 104
##  $ beer_type            : Factor w/ 3 levels "Ale","Lager",..: 1
## [1] 351599     51

The data set contains multitudes. We dropped miscellaneous producers of sake, mead, and cider. We only kept those beers that had 10 or more reviews. After dropping non-beer entries and those with invalid geodata or other attributes, we have:

Univariate Plots Section

You can see how gap-toothed the brewery list is, thanks to the forced stop to data collection when ratebeer.com changed their interface:

Our data set is very long-tailed to the right - in almost every instance, whether we’re counting the distribution of the numbers of beers, reviewers, or styles, a large number of entries have a tiny number of instances, with notable outliers to the right. The following histograms are almost always transformed with a log10 x-axis.

Beers come in a variety of styles. The ‘counter’ variable below shows how many beers per style are included in the data set.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0   100.8   195.0   318.8   418.0  1978.0