Introduction

I was inspired by the White Wine data set but wanted to do something different. So I spent a month or so scraping, extracting, and cleaning a collection of reviews of beers from ratebeer.com. Without going into gory detail, I used a python module (called ratebeer) to scrape and extract information about 13,000 breweries, 147,000 beers, and 2,140,000 beer reviews. Unfortunately, mid-way through, ratebeer changed their interface, and the scripts failed. As a result, I have approximately half the breweries (6,358) accounted for.

I spent a couple of weeks debugging and trying to re-write the python module, but have decided that I’m going to proceed with this analysis with the partial data-set: the point of the project is not debugging screen-scraping code in Python. So, everything beyond must be taken with a grain of salt - although I believe I have a representative sample, 50% of my data set is missing. For example, there are no breweries whose names start with the letters A, C, F, G, H, and L; and B, M, and S are partial extracts.

It’s really important to note that because of this, I’ve lost some big commercial breweries such as Anheiser-Busch: As a result, we’re exploring the data set but cannot really make strong statements about the state of the beer world as a whole.

Detailed data cleaning takes place in process_data.R (submitted with this assignment), which produces the tables loaded by this markdown document. I only extracted reviews for beers with more than nine reviews. After extracting reviews, I also used the Data Science Toolkit to geotag the breweries and reviewers. Afterward, I calculated the distance between the reviewer and the brewery and put that value into a column called reviewdist

This is our foundation data set - one row per review per beer per brewery. It has a lot of repetition but allows us to use dplyr’s grouping and summary tools easily.

## 'data.frame':    1 obs. of  51 variables:
##  $ appearance           : int 3
##  $ aroma                : int 6
##  $ beer_url             : chr "/beer/bt-5-hop-bitter/39628/"
##  $ date                 : chr "2010-12-03"
##  $ overall              : int 11
##  $ palate               : int 3
##  $ rating               : num 2.8
##  $ taste                : int 5
##  $ text                 : chr "HP-The Dove,BSE,golden with no head,aroma of hops and pine,taste of strong hops,grapefruit,pine and cedar and some alcohol.."
##  $ user_location        : chr "Suffolk, ENGLAND"
##  $ user_name            : chr "Garrat"
##  $ user_lat             : num 52.2
##  $ user_lon             : num 1
##  $ user_country         : chr "United Kingdom"
##  $ user_continent       : chr "Europe"
##  $ beers_has_fetched    : chr "True"
##  $ abv                  : num 4.1
##  $ brewed_at            : chr ""
##  $ brewery              : chr "B&T Brewery"
##  $ brewery_url          : chr "/brewers/bt-brewery/1948/"
##  $ calories             : int 123
##  $ description          : chr "Cask; Seasonal - Autumn. Has also been available bottle conditioned. \nSeasonal beer brewed with green hops. \nUses Challenger,"| __truncated__
##  $ ibu                  : int NA
##  $ img_url              : chr "http://res.cloudinary.com/ratebeer/image/upload/w_120,c_limit/beer_39628.jpg"
##  $ mean_rating          : num 3.08
##  $ beer_name            : chr "B&T 5 Hop Bitter"
##  $ num_ratings          : int 10
##  $ overall_rating       : int 44
##  $ seasonal             : chr "Autumn"
##  $ style                : chr "Bitter"
##  $ style_rating         : int 55
##  $ style_url            : chr "/beerstyles/bitter/20/"
##  $ tags                 : chr "[bramling cross, fuggles, cascade, challenger, bottle conditioned]"
##  $ weighted_avg         : num NA
##  $ breweries_has_fetched: logi TRUE
##  $ city                 : chr "Shefford"
##  $ brewery_name         : chr "B&T Brewery"
##  $ postal_code          : chr "SG17 5DZ"
##  $ state                : chr "Bedfordshire"
##  $ street               : chr "B & T Brewery, 3E-3F St. Francis Way"
##  $ telephone            : chr "01462 815080"
##  $ type                 : Factor w/ 6 levels "Brew Pub","Brew Pub/Brewery",..: 6
##  $ web                  : chr "http://www.banksandtaylor.com/"
##  $ location             : chr "Shefford Bedfordshire England"
##  $ lat                  : num 52
##  $ lon                  : num -0.5
##  $ country              : chr "United Kingdom"
##  $ continent            : chr "Europe"
##  $ first_letter         : chr "B"
##  $ reviewdist           : num 104
##  $ beer_type            : Factor w/ 3 levels "Ale","Lager",..: 1
## [1] 351599     51

The data set contains multitudes. We dropped miscellaneous producers of sake, mead, and cider. We only kept those beers that had 10 or more reviews. After dropping non-beer entries and those with invalid geodata or other attributes, we have:

Univariate Plots Section

You can see how gap-toothed the brewery list is, thanks to the forced stop to data collection when ratebeer.com changed their interface:

Our data set is very long-tailed to the right - in almost every instance, whether we’re counting the distribution of the numbers of beers, reviewers, or styles, a large number of entries have a tiny number of instances, with notable outliers to the right. The following histograms are almost always transformed with a log10 x-axis.

Beers come in a variety of styles. The ‘counter’ variable below shows how many beers per style are included in the data set.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0   100.8   195.0   318.8   418.0  1978.0

We have a long-tailed distribution of beer styles. While India Pale Ales are very commonly-made beers (2399), there are many beer styles represented by between 0 and 200 instances.

Let’s look at other aspects of beers:

##    beer_url          review_count      brewery          brewery_url       
##  Length:25504       Min.   :  1.00   Length:25504       Length:25504      
##  Class :character   1st Qu.:  2.00   Class :character   Class :character  
##  Mode  :character   Median :  7.00   Mode  :character   Mode  :character  
##                     Mean   : 13.79                                        
##                     3rd Qu.: 15.00                                        
##                     Max.   :456.00                                        
##                                                                           
##   beer_name                          style      
##  Length:25504       India Pale Ale (IPA): 1978  
##  Class :character   American Pale Ale   : 1267  
##  Mode  :character   Bitter              : 1171  
##                     Golden Ale/Blond Ale: 1165  
##                     Imperial Stout      :  914  
##                     Imperial IPA        :  872  
##                     (Other)             :18137  
##              brewery_type   brewery_country    brewery_continent 
##  Brew Pub          : 3486   Length:25504       Length:25504      
##  Brew Pub/Brewery  : 3254   Class :character   Class :character  
##  Client Brewer     :  837   Mode  :character   Mode  :character  
##  Commercial Brewery: 4632                                        
##  Contract Brewer   :   48                                        
##  Microbrewery      :13247                                        
##                                                                  
##       abv              ibu        first_letter        mean_rating   
##  Min.   : 0.010   Min.   :  1.0   Length:25504       Min.   :0.500  
##  1st Qu.: 4.800   1st Qu.: 24.0   Class :character   1st Qu.:2.950  
##  Median : 5.500   Median : 37.0   Mode  :character   Median :3.256  
##  Mean   : 6.205   Mean   : 44.5                      Mean   :3.201  
##  3rd Qu.: 7.100   3rd Qu.: 63.0                      3rd Qu.:3.550  
##  Max.   :57.700   Max.   :240.0                      Max.   :5.000  
##  NA's   :1119     NA's   :21108                                     
##  weight_factor     beer_type          lat              lon           
##  Min.   :   0.50   Ale  :13481   Min.   :-37.86   Min.   :-159.7199  
##  1st Qu.:   7.60   Lager: 3859   1st Qu.: 40.00   1st Qu.: -91.9673  
##  Median :  20.70   Other: 1624   Median : 44.77   Median : -75.4955  
##  Mean   :  44.92   NA's : 6540   Mean   : 44.63   Mean   : -54.2495  
##  3rd Qu.:  47.60                 3rd Qu.: 51.25   3rd Qu.:  -0.2417  
##  Max.   :1832.80                 Max.   : 67.09   Max.   : 159.9500  
## 

The overwhelming majority of beers had a tiny number of reviews. When we start crunching information about reviews, we’ll be looking at those with enough to be perhaps reliable. So let’s lop off the low-review beers.

Note the log10 transformation on the x-axis, as this is a very right-skewed distribution - the beers that do have thousands of reviews are lost otherwise.

Other aspects of the reviewed beers:

Alcohol: ABV (Alcohol by Volume) looks normal, except for a few insane outliers. I looked them up and they all came out of the same brewery and are clearly a specialty of theirs.

## Source: local data frame [6 x 3]
## 
##     style                                        beer_name   abv
##    (fctr)                                            (chr) (dbl)
## 1 Eisbock                              Schorschbock Ice30% 30.00
## 2 Eisbock      Schorschbräu Schorschbock 31% Black Edition 31.00
## 3 Eisbock                    Schorschbräu Schorschbock 31% 30.86
## 4 Eisbock                    Schorschbräu Schorschbock 40% 39.44
## 5 Eisbock                    Schorschbräu Schorschbock 43% 43.38
## 6 Eisbock Schorschbräu Schorschbock 57% finis coronat opus 57.70

Here it is in a little finer detail with the outliers pulled:

Bitterness:

IBU (International Bitterness Units) have an interesting spiky look - as I increased the number of bins, it became more clear that the spikes come from people rounding IBU to the nearest 5. (An IBU of 100 is really, really, really bitter. Like double India Pale Ale bitter. And it looks like a beer with an IBU of 100 is a good marketing hook.)

Let’s have a look at where all these breweries are:

And while we’re at it, let’s find out who the reviewers tend to be.

##   user_name          review_count    user_location      mean_user_rating
##  Length:2262        Min.   :   1.0   Length:2262        Min.   :0.500   
##  Class :character   1st Qu.:   1.0   Class :character   1st Qu.:3.176   
##  Mode  :character   Median :   4.0   Mode  :character   Median :3.502   
##                     Mean   : 155.4                      Mean   :3.541   
##                     3rd Qu.:  32.0                      3rd Qu.:3.967   
##                     Max.   :9037.0                      Max.   :5.000   
##     user_lat         user_lon       user_country       user_continent    
##  Min.   :-43.53   Min.   :-123.12   Length:2262        Length:2262       
##  1st Qu.: 35.13   1st Qu.:-118.02   Class :character   Class :character  
##  Median : 43.70   Median : -77.04   Mode  :character   Mode  :character  
##  Mean   : 43.46   Mean   : -45.63                                        
##  3rd Qu.: 52.50   3rd Qu.:  10.66                                        
##  Max.   : 69.65   Max.   : 175.28

Even with a log10 scale, we see a typical social media pattern, in which a great many people produce a very small number of contributions, but a few outliers create a significant portion of the overall body of work. (The managers of ratebeer.com swear that it is, indeed, possible for a qualified beer expert to have thoughtfully considered 3,000 beers over the time the site has been up and going). The distribution is referred to as the Power Law, or a Pareto distribution. (See Lehman, “User participation in social media: Digg study.”)

The Pareto principle (also known as the 80–20 rule) states that, for many events, roughly 80% of the effects come from 20% of the causes. In this case, the top 20% of the reviewers have produced 96% of the reviews.

Where are they?

It’s worth noting that the geographic data for reviews is pretty terrible relative to the address-specific information we have for breweries. Many people, for example, just put “California” in their location, so they have all geocoded on top of each other in the Los Angeles area. Otherwise, we don’t know much about the individuals but their locations. We’ll look further into the reviews below.

Univariate Analysis

What is the structure of your dataset?

The dataset is a denormalized, wide listing of every review of every beer included in the data. We’ve used dplyr to build out summary tables of beers and breweries, based on this information. The underlying relationship is that each brewery has one or more beers, each of which has ten or more reviews. This is important: Although we recorded the number of reviews each beer has, we only collected reviews for beers with more than ten.

We have many, many categorical variables, and fewer continuous ones (principally the ratings and the ABV/IBU numbers). This means that our opportunities for numerical correlation are somewhat limited relative to our ability to consider one categorical attribute in light of one or more others.

What is/are the main feature(s) of interest in your dataset?

I am most interested in two facets of the dataset. First, the relationship between various attributes of the beer and its rating, and second, as mentioned in the introduction, whether beers have any kind of ‘home field advantage’ based on how close the reviewer is to the brewery.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

For the analysis of whether a beer’s attributes impact its rating, I will consider some sub-ratings (such as aroma and taste), style, abv, ibu, brewery type, and brewery location. For the analysis of whether a reviewer rates local beers higher, I will start with the absolute distance between a reviewer and a brewery. I may also investigate filtering on country and/or state. People in Kalamazoo, Michigan are far more likely to love the Tigers than the Cubs, even though Chicago is much closer. Do people close to Bell’s Brewery in Kalamazoo rate it more highly than residents of Colorado do?

Did you create any new variables from existing variables in the dataset?

I created one convenience variable for the sake of breaking up the data set by alpha, and then I calculated distance between the reviewer and the brewery from the coordinates I looked up when geo-tagging each review and brewery.

During the rollup procedure, I also created a weighted mean rating for breweries, so that beers with many reviews aren’t overshadowed by beers with one or two.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I did extensive tidying of the originally scraped data in order to get it into the form read by this paper - details of the tidying process are in process_data.R.

Virtually all the distributions in this data set are heavily right-skewed. For the purposes of visual review and analysis, I put the histograms on a log10 x-asis in order to make them more meaningful.

Bivariate Plots Section

The strongest relationships here are between alcohol and bitterness (and that makes intuitive sense: Most high-ABV beers are highly hopped as well), between alcohol and rating, and to a lesser degree, between bitterness and rating. So, we can say loosely that high-ABV and highly-hopped beers are generally better-rated. It’s also true that India Pale Ales (hoppy and often high-alcohol) are fashionable right now, which increases their numbers and apparently their ratings.

For the sake of readability: The brewery types are:

##   levels.breweryrollup.brewery_type.
## 1                           Brew Pub
## 2                   Brew Pub/Brewery
## 3                      Client Brewer
## 4                 Commercial Brewery
## 5                    Contract Brewer
## 6                       Microbrewery

From this splatter of data, I’m most interested in how the different types of brewers separated in the ratings. The big commercial breweries don’t come out looking too good.

Let’s dig into the reviewers a bit:

American reviewers are somewhat more positive about beers than their counterparts in other places (a culture of grade inflation at work?), but there is no relationship between the number of reviews produced and their ratings.

Let’s have a look at the ratings. What influences ratings on a given beer?

It wasn’t until I got to this plot that I realized the fun pattern in the Review Distance: as a general rule, lots of beers are reviewed within 2500 km of the brewery (or on the same continent). Then there’s a big gap and another smaller batch of beers are between 5,000 and 10,000 km. (Or one ocean away). This amuses me.

Here’s a more specific look at the correlation factors between various aspects. This plot offers two ways of visualizing correlation: the deeper the blue, the stronger the positive correlation is between two factors. (negative correlations would appear in shades of red but none came up). The correlation ellipses tell a more detailed story of the strength of the correlation and our confidence in predictions arising.

appearance aroma palate rating taste abv ibu reviewdist
appearance 1.0000000 0.4253924 0.4980460 0.5807414 0.4327424 0.2360711 0.1802814 0.0374421
aroma 0.4253924 1.0000000 0.5285586 0.8734295 0.7961256 0.3843898 0.2625306 0.1162562
palate 0.4980460 0.5285586 1.0000000 0.7307274 0.6166851 0.2967273 0.2090170 0.0690214
rating 0.5807414 0.8734295 0.7307274 1.0000000 0.9210155 0.4071846 0.2904858 0.1131280
taste 0.4327424 0.7961256 0.6166851 0.9210155 1.0000000 0.3648189 0.2465901 0.1115072
abv 0.2360711 0.3843898 0.2967273 0.4071846 0.3648189 1.0000000 0.4318264 0.2293455
ibu 0.1802814 0.2625306 0.2090170 0.2904858 0.2465901 0.4318264 1.0000000 0.0199245
reviewdist 0.0374421 0.1162562 0.0690214 0.1131280 0.1115072 0.2293455 0.0199245 1.0000000

For the purposes of this exercise, I really wanted there to be a ‘home field advantage’ for beers reviewed near home. Unfortunately, there is exactly zero evidence supporting my theory. On the other hand, there is a clear relationship between taste and aroma and the beer’s eventual rating. While we noted that ABV and IBU were correlated to rating, it’s easier to see in this broader context that the relationship is relatively weak. ABV also appears to be correlated to high ratings, but only to a point. High ABV doesn’t necessarily continue to mean high ratings.

Instead, what principally arose (from that earlier plot with color corresponding to Beer Type) was something I did not expect: people don’t like lagers as much as they do ales. I’m going to take a deeper look at the lager/ale question in the multivariable section, as I’d like to see whether the problem is that many beer conaisseurs do not care for commercial lagers (such as Bud Lite), or whether lagers are just less favored.

Let’s take a closer look at ratings in light of specific aspects of the beer:

The commercial/client/contract breweries fall short of the microbreweries and brewpubs: their median ratings are generally lower than the smaller breweries’ 25th percentiles.

Let’s consider ratings in terms of where the beers are from:

There is wide variation amongst the different countries. For kicks (yes, this is a multivariate plot, but followed my train of though while I was bivariat-ing), I’ve overlaid one point per review, color-coding it with the type of beer reviewed. We get British ales, German/European Lagers, and a blend of American ales and lagers.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

We came into this exercise wondering what influenced reviews of beers. On the whole, findings confirmed our assumptions going in (which is kind of disappointing and has clobbered the potential for this paper to grow into anything career-revolutionizing).

  • Large commercial breweries’ beers are not as well reviewed as those produced by brewpubs and microbreweries.
  • People aren’t that fond of lagers.
  • Reviewers’ ‘taste’ and ‘aroma’ sub-ratings are tightly correlated with the beer’s eventual overall rating.
  • To a lesser extent, higher alcohol content (to a point) and bitterness are correlated with higher reviews.
  • Despite my highest hopes, breweries do not enjoy a home-field advantage: the distance between the reviewer and the beer is completely unconnected to beer ratings. (corr: .113)

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I hadn’t expected there to be such variation among the countries’ reviews. Why, for instance, do reviewers love Norwegian beer but not Swedish beer?

And I had not come into this expecting such clear separation between ales and lagers. I want to look into that more.

What was the strongest relationship you found?

Most definitely between the sub-attributes “taste” and “aroma” and the overall rating.

Multivariate Plots Section

Let’s explore more about what people do or do not like about lagers: (Appearance and Palate are marked on a 1-5 scale, Aroma and Taste on a 1-10 scale. The scores add into an overall score that digests down into the 1-5 Rating.)

Let’s have a closer look at just ratings and how each factor impacts them:

I’ve normalized each of the attributes (Appearance and Palate) that make up an overall rating by doubling the values that are scored on a scale of 1-5 rather than 1-10.

Taste and Aroma are pretty much wired in a tight correlation to Rating - for lower-rated beers, people seem to be willing to give a somewhat higher score to Appearance and Palate before they also line up with the final rating for better-rated beers. It’s possible that, since these attributes are scored on a 1-5 scale, the minimum score is higher relative to the range of possible scores. I don’t think we’re looking at anything significant.

brewery_type beer_type median_rating number_of_ratings sd lager_ale_rating_ratio
Brew Pub Ale 3.340000 2821 0.4345558 0.8525234
Brew Pub Lager 3.036667 488 0.4431040 0.1474766
Brew Pub/Brewery Ale 3.400000 2578 0.4395995 0.8270773
Brew Pub/Brewery Lager 3.050000 539 0.4241495 0.1729227
Client Brewer Ale 3.100000 528 0.5671294 0.6633166
Client Brewer Lager 2.376667 268 0.7237597 0.3366834
Commercial Brewery Ale 3.028571 2399 0.5346603 0.5305175
Commercial Brewery Lager 2.737500 2123 0.5655992 0.4694825
Contract Brewer Ale 3.085000 42 0.4815728 0.9130435
Contract Brewer Lager 2.350000 4 0.7192299 0.0869565
Microbrewery Ale 3.385714 11298 0.3986643 0.8943952
Microbrewery Lager 3.100000 1334 0.4791834 0.1056048

From this table and plot we can conclude that the client/commercial breweries produce a large number of the reviewed lagers and that their ratings are lower, by a margin that exceeds the standard deviation from the mean.

Let’s simplify a bit.

This information helps us see that the big commercial breweries and client brewers (also high-volume operations) are making the majority of the lagers reviewed, and their ratio of lagers to ales is much higher than for the microbreweries. Let’s refine some more and see whether it’s the lagers themselves pulling the commercial breweries down or some other factor.

brewery_type beer_type mean_rating median_rating number_of_ratings ratio
Brew Pub Lager 3.082077 3.082077 488 1
Brew Pub/Brewery Lager 3.098543 3.098543 539 1
Client Brewer Lager 2.545358 2.545358 268 1
Commercial Brewery Lager 2.884549 2.884549 2123 1
Contract Brewer Lager 2.456156 2.456156 4 1
Microbrewery Lager 3.159099 3.159099 1334 1

The Commercial breweries certainly get a large number of reviews for lagers, and their reviews just aren’t very good. Do commercial breweries’ beers get more reviews? It stands to reason that their products are available across a larger market than the microbreweries, and so are more likely to be reviewed.

I’ve removed one 400-beer outlier (a microbrewery) to help this plot be more readable:

So, the commercial brewers still stand out with their lower ratings in general - and it appears that producing more beers is loosely tied to better ratings. This plot turned out to be a big disappointment, though. Although it has lots of interesting dimensions and pretty colours, it fails to tell much of a compelling story and I think I’ll leave it in a draft state.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation.

As I dug into this, I was able to tease out more information about the breweries and the ratings associated with their beers. On the whole, commercial brewers’ products don’t fare well.

Patterns arose that initially show that commercial breweries are not rated as highly as the microbreweries. That relationship, however, is clouded by the accompanying fact that lagers are generally much lower-rated than ales; and commercial breweries’ portfolios are much more biased toward lagers.

We don’t have access to sales statistics, so we don’t have a key part of teasing this analysis apart. I cannot tell whether certain beers get a relatively high number of reviews in light of how many are actually on the market. It’s impossible to tell whether the American Commercial Lagers (like Bud) are pulling brewery ratings down in proportion to their place in the market.

Were there any interesting or surprising interactions between features?

I got more and more interested in the lager/ale imbalance and, as noted above, would like to dig more into the state of the market. I think the plot that shows only lager ratings helps us see that micro-lager is generally better regarded than commercial lager, which makes me think the Budweiser factor is indeed at work.


Final Plots and Summary

Plot One

Description One

Here is a look at the various styles of beer in a way that helps you see how they are positioned in terms of alcohol content and bitterness. (A beer connoisseur would say the bitterness scale is actually a continuum between ‘malty’ (low IBU) and ‘hoppy’ (high IBU).)

I’ve shown the ales as circles and lagers as triangles. The colour is coded to the style’s mean rating. It is on a continuum from yellow (terrible) to dark blue (highly-rated). I think this is a interesting look at the overall world of beer styles and where your favourite brew might fit in the picture.

Plot Two

type review_count rating_mean rating_median reviewdist_mean reviewdist_median Corr_Dist_Rating
Brew Pub 17583 3.309828 3.4 3313.818 1315.8946 0.1024090
Brew Pub/Brewery 27908 3.400369 3.4 3819.123 2205.4558 0.1923701
Client Brewer 13102 2.921661 3.0 2306.509 881.5625 0.0386581
Commercial Brewery 99035 2.996840 3.1 2356.545 925.6238 0.0446273
Contract Brewer 734 2.944142 3.0 2027.733 991.3055 -0.0172197
Microbrewery 193237 3.391272 3.4 2691.654 945.5683 0.1235815

Description Two

I wanted to take a closer look at the hypothesis that a reviewer would favor a brewery that was closer to him or her. We’d expect a negative correlation between distance and the rating. As we saw in the initial analysis, the correlation is, in fact, positive and almost nonexistent (0.113). This plot and the accompanying table break down reviewer distance and ratings into a few more parts and some interesting things arose.

All these relationships are very, very weak, but it appears that the farther a reviewer gets from the micros and the brewpubs, the better the ratings are, especially for Brew Pubs/Breweries. This makes no sense to me, as I’d have assumed that a brew pub would be most likely to have a rabid local following. Now, the median review distance for Brew Pubs/Breweries is almost double that of either Brew Pubs or Micros, and so is the positive relationship between distance and beer. I cannot explain why the “Brew Pub/Brewery” class stands out, but this would bear more exploration. It’s possible that this is a data flaw from the incomplete data set I collected.

On the converse, although we have shown that people aren’t too keen on the big commercial breweries (especially their lagers), reviewers are faintly more likely to give them a higher score if they are far, far, away from the brewery. A visual glance seems to me to show a small bump in reviews once a reviewer is on another continent, and another if they can get to be on the opposite side of the planet from a commercial brewery.

Plot Three

Description Three

This plot finally stitches together the lager/ale story in a cleaner way. Although it was tempting (as with the final plot in the exploratory plots) to try to add in as many factors as possible, the picture just got too cluttered and didn’t tell much of a story. This plot helps us see that the differences between brewery types almost vanish when we just look at ales - but that the commercial lagers really pull down the ratings. The second quartile of the commercial brews is noticeably wide, indicating the number of low-performing beers in their portfolios.


Reflection

This exercise was complicated by both too much and too little data.

It bears repeating that this analysis is of only about half the total data set, thanks to the ratebeer.com site reformat that stopped my data gathering cold. I would like to believe that the patterns arising would hold true across the entire world of beer, but it’s impossible to know.

I then found that I had many different directions to analyze - beers, breweries, geographic, quantitative, qualitative, etc., and got caught in a position where I didn’t take a really deep dive into any one area but spread my analysis across the whole data set. The biggest surprise came with the lager/ale split. With more time, I’d dig into the various subtypes of beer (IPA vs. Scotch Ale vs. American Light Lager etc.) and how they are rated and reviewed.

With more data, specifically data that would help me understand sales volumes for the various beers, I’d start normalizing the numbers so that I could look at a more consistent rating scale - I’d very much like to know whether the ratio of reviews::sales for Bud Light is the same as for some of the well-regarded microbreweries.

This would affect how we weight the reviews and cast more light on whether specific beers create disproportionate pressure on a brewery’s or a style’s rating.

Finally, I’m still tempted to go back and fix the screen-scraping python code, because it bugs me that I haven’t been able to get the whole data set - but the effort-to-reward ratio may be pretty poor.