I was inspired by the White Wine data set but wanted to do something different. So I spent a month or so scraping, extracting, and cleaning a collection of reviews of beers from ratebeer.com. Without going into gory detail, I used a python module (called ratebeer) to scrape and extract information about 13,000 breweries, 147,000 beers, and 2,140,000 beer reviews. Unfortunately, mid-way through, ratebeer changed their interface, and the scripts failed. As a result, I have approximately half the breweries (6,358) accounted for.
I spent a couple of weeks debugging and trying to re-write the python module, but have decided that I’m going to proceed with this analysis with the partial data-set: the point of the project is not debugging screen-scraping code in Python. So, everything beyond must be taken with a grain of salt - although I believe I have a representative sample, 50% of my data set is missing. For example, there are no breweries whose names start with the letters A, C, F, G, H, and L; and B, M, and S are partial extracts.
It’s really important to note that because of this, I’ve lost some big commercial breweries such as Anheiser-Busch: As a result, we’re exploring the data set but cannot really make strong statements about the state of the beer world as a whole.
Detailed data cleaning takes place in process_data.R (submitted with this assignment), which produces the tables loaded by this markdown document. I only extracted reviews for beers with more than nine reviews. After extracting reviews, I also used the Data Science Toolkit to geotag the breweries and reviewers. Afterward, I calculated the distance between the reviewer and the brewery and put that value into a column called reviewdist
This is our foundation data set - one row per review per beer per brewery. It has a lot of repetition but allows us to use dplyr’s grouping and summary tools easily.
## 'data.frame': 1 obs. of 51 variables:
## $ appearance : int 3
## $ aroma : int 6
## $ beer_url : chr "/beer/bt-5-hop-bitter/39628/"
## $ date : chr "2010-12-03"
## $ overall : int 11
## $ palate : int 3
## $ rating : num 2.8
## $ taste : int 5
## $ text : chr "HP-The Dove,BSE,golden with no head,aroma of hops and pine,taste of strong hops,grapefruit,pine and cedar and some alcohol.."
## $ user_location : chr "Suffolk, ENGLAND"
## $ user_name : chr "Garrat"
## $ user_lat : num 52.2
## $ user_lon : num 1
## $ user_country : chr "United Kingdom"
## $ user_continent : chr "Europe"
## $ beers_has_fetched : chr "True"
## $ abv : num 4.1
## $ brewed_at : chr ""
## $ brewery : chr "B&T Brewery"
## $ brewery_url : chr "/brewers/bt-brewery/1948/"
## $ calories : int 123
## $ description : chr "Cask; Seasonal - Autumn. Has also been available bottle conditioned. \nSeasonal beer brewed with green hops. \nUses Challenger,"| __truncated__
## $ ibu : int NA
## $ img_url : chr "http://res.cloudinary.com/ratebeer/image/upload/w_120,c_limit/beer_39628.jpg"
## $ mean_rating : num 3.08
## $ beer_name : chr "B&T 5 Hop Bitter"
## $ num_ratings : int 10
## $ overall_rating : int 44
## $ seasonal : chr "Autumn"
## $ style : chr "Bitter"
## $ style_rating : int 55
## $ style_url : chr "/beerstyles/bitter/20/"
## $ tags : chr "[bramling cross, fuggles, cascade, challenger, bottle conditioned]"
## $ weighted_avg : num NA
## $ breweries_has_fetched: logi TRUE
## $ city : chr "Shefford"
## $ brewery_name : chr "B&T Brewery"
## $ postal_code : chr "SG17 5DZ"
## $ state : chr "Bedfordshire"
## $ street : chr "B & T Brewery, 3E-3F St. Francis Way"
## $ telephone : chr "01462 815080"
## $ type : Factor w/ 6 levels "Brew Pub","Brew Pub/Brewery",..: 6
## $ web : chr "http://www.banksandtaylor.com/"
## $ location : chr "Shefford Bedfordshire England"
## $ lat : num 52
## $ lon : num -0.5
## $ country : chr "United Kingdom"
## $ continent : chr "Europe"
## $ first_letter : chr "B"
## $ reviewdist : num 104
## $ beer_type : Factor w/ 3 levels "Ale","Lager",..: 1
## [1] 351599 51
The data set contains multitudes. We dropped miscellaneous producers of sake, mead, and cider. We only kept those beers that had 10 or more reviews. After dropping non-beer entries and those with invalid geodata or other attributes, we have:
You can see how gap-toothed the brewery list is, thanks to the forced stop to data collection when ratebeer.com changed their interface:
Our data set is very long-tailed to the right - in almost every instance, whether we’re counting the distribution of the numbers of beers, reviewers, or styles, a large number of entries have a tiny number of instances, with notable outliers to the right. The following histograms are almost always transformed with a log10 x-axis.
Beers come in a variety of styles. The ‘counter’ variable below shows how many beers per style are included in the data set.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.0 100.8 195.0 318.8 418.0 1978.0
We have a long-tailed distribution of beer styles. While India Pale Ales are very commonly-made beers (2399), there are many beer styles represented by between 0 and 200 instances.
Let’s look at other aspects of beers:
## beer_url review_count brewery brewery_url
## Length:25504 Min. : 1.00 Length:25504 Length:25504
## Class :character 1st Qu.: 2.00 Class :character Class :character
## Mode :character Median : 7.00 Mode :character Mode :character
## Mean : 13.79
## 3rd Qu.: 15.00
## Max. :456.00
##
## beer_name style
## Length:25504 India Pale Ale (IPA): 1978
## Class :character American Pale Ale : 1267
## Mode :character Bitter : 1171
## Golden Ale/Blond Ale: 1165
## Imperial Stout : 914
## Imperial IPA : 872
## (Other) :18137
## brewery_type brewery_country brewery_continent
## Brew Pub : 3486 Length:25504 Length:25504
## Brew Pub/Brewery : 3254 Class :character Class :character
## Client Brewer : 837 Mode :character Mode :character
## Commercial Brewery: 4632
## Contract Brewer : 48
## Microbrewery :13247
##
## abv ibu first_letter mean_rating
## Min. : 0.010 Min. : 1.0 Length:25504 Min. :0.500
## 1st Qu.: 4.800 1st Qu.: 24.0 Class :character 1st Qu.:2.950
## Median : 5.500 Median : 37.0 Mode :character Median :3.256
## Mean : 6.205 Mean : 44.5 Mean :3.201
## 3rd Qu.: 7.100 3rd Qu.: 63.0 3rd Qu.:3.550
## Max. :57.700 Max. :240.0 Max. :5.000
## NA's :1119 NA's :21108
## weight_factor beer_type lat lon
## Min. : 0.50 Ale :13481 Min. :-37.86 Min. :-159.7199
## 1st Qu.: 7.60 Lager: 3859 1st Qu.: 40.00 1st Qu.: -91.9673
## Median : 20.70 Other: 1624 Median : 44.77 Median : -75.4955
## Mean : 44.92 NA's : 6540 Mean : 44.63 Mean : -54.2495
## 3rd Qu.: 47.60 3rd Qu.: 51.25 3rd Qu.: -0.2417
## Max. :1832.80 Max. : 67.09 Max. : 159.9500
##
The overwhelming majority of beers had a tiny number of reviews. When we start crunching information about reviews, we’ll be looking at those with enough to be perhaps reliable. So let’s lop off the low-review beers.
Note the log10 transformation on the x-axis, as this is a very right-skewed distribution - the beers that do have thousands of reviews are lost otherwise.
Other aspects of the reviewed beers:
Alcohol: ABV (Alcohol by Volume) looks normal, except for a few insane outliers. I looked them up and they all came out of the same brewery and are clearly a specialty of theirs.
## Source: local data frame [6 x 3]
##
## style beer_name abv
## (fctr) (chr) (dbl)
## 1 Eisbock Schorschbock Ice30% 30.00
## 2 Eisbock Schorschbräu Schorschbock 31% Black Edition 31.00
## 3 Eisbock Schorschbräu Schorschbock 31% 30.86
## 4 Eisbock Schorschbräu Schorschbock 40% 39.44
## 5 Eisbock Schorschbräu Schorschbock 43% 43.38
## 6 Eisbock Schorschbräu Schorschbock 57% finis coronat opus 57.70
Here it is in a little finer detail with the outliers pulled:
Bitterness:
IBU (International Bitterness Units) have an interesting spiky look - as I increased the number of bins, it became more clear that the spikes come from people rounding IBU to the nearest 5. (An IBU of 100 is really, really, really bitter. Like double India Pale Ale bitter. And it looks like a beer with an IBU of 100 is a good marketing hook.)
Let’s have a look at where all these breweries are:
And while we’re at it, let’s find out who the reviewers tend to be.
## user_name review_count user_location mean_user_rating
## Length:2262 Min. : 1.0 Length:2262 Min. :0.500
## Class :character 1st Qu.: 1.0 Class :character 1st Qu.:3.176
## Mode :character Median : 4.0 Mode :character Median :3.502
## Mean : 155.4 Mean :3.541
## 3rd Qu.: 32.0 3rd Qu.:3.967
## Max. :9037.0 Max. :5.000
## user_lat user_lon user_country user_continent
## Min. :-43.53 Min. :-123.12 Length:2262 Length:2262
## 1st Qu.: 35.13 1st Qu.:-118.02 Class :character Class :character
## Median : 43.70 Median : -77.04 Mode :character Mode :character
## Mean : 43.46 Mean : -45.63
## 3rd Qu.: 52.50 3rd Qu.: 10.66
## Max. : 69.65 Max. : 175.28
Even with a log10 scale, we see a typical social media pattern, in which a great many people produce a very small number of contributions, but a few outliers create a significant portion of the overall body of work. (The managers of ratebeer.com swear that it is, indeed, possible for a qualified beer expert to have thoughtfully considered 3,000 beers over the time the site has been up and going). The distribution is referred to as the Power Law, or a Pareto distribution. (See Lehman, “User participation in social media: Digg study.”)
The Pareto principle (also known as the 80–20 rule) states that, for many events, roughly 80% of the effects come from 20% of the causes. In this case, the top 20% of the reviewers have produced 96% of the reviews.
Where are they?
It’s worth noting that the geographic data for reviews is pretty terrible relative to the address-specific information we have for breweries. Many people, for example, just put “California” in their location, so they have all geocoded on top of each other in the Los Angeles area. Otherwise, we don’t know much about the individuals but their locations. We’ll look further into the reviews below.
The dataset is a denormalized, wide listing of every review of every beer included in the data. We’ve used dplyr to build out summary tables of beers and breweries, based on this information. The underlying relationship is that each brewery has one or more beers, each of which has ten or more reviews. This is important: Although we recorded the number of reviews each beer has, we only collected reviews for beers with more than ten.
We have many, many categorical variables, and fewer continuous ones (principally the ratings and the ABV/IBU numbers). This means that our opportunities for numerical correlation are somewhat limited relative to our ability to consider one categorical attribute in light of one or more others.
I am most interested in two facets of the dataset. First, the relationship between various attributes of the beer and its rating, and second, as mentioned in the introduction, whether beers have any kind of ‘home field advantage’ based on how close the reviewer is to the brewery.
For the analysis of whether a beer’s attributes impact its rating, I will consider some sub-ratings (such as aroma and taste), style, abv, ibu, brewery type, and brewery location. For the analysis of whether a reviewer rates local beers higher, I will start with the absolute distance between a reviewer and a brewery. I may also investigate filtering on country and/or state. People in Kalamazoo, Michigan are far more likely to love the Tigers than the Cubs, even though Chicago is much closer. Do people close to Bell’s Brewery in Kalamazoo rate it more highly than residents of Colorado do?
I created one convenience variable for the sake of breaking up the data set by alpha, and then I calculated distance between the reviewer and the brewery from the coordinates I looked up when geo-tagging each review and brewery.
During the rollup procedure, I also created a weighted mean rating for breweries, so that beers with many reviews aren’t overshadowed by beers with one or two.
I did extensive tidying of the originally scraped data in order to get it into the form read by this paper - details of the tidying process are in process_data.R.
Virtually all the distributions in this data set are heavily right-skewed. For the purposes of visual review and analysis, I put the histograms on a log10 x-asis in order to make them more meaningful.
The strongest relationships here are between alcohol and bitterness (and that makes intuitive sense: Most high-ABV beers are highly hopped as well), between alcohol and rating, and to a lesser degree, between bitterness and rating. So, we can say loosely that high-ABV and highly-hopped beers are generally better-rated. It’s also true that India Pale Ales (hoppy and often high-alcohol) are fashionable right now, which increases their numbers and apparently their ratings.
For the sake of readability: The brewery types are:
## levels.breweryrollup.brewery_type.
## 1 Brew Pub
## 2 Brew Pub/Brewery
## 3 Client Brewer
## 4 Commercial Brewery
## 5 Contract Brewer
## 6 Microbrewery
From this splatter of data, I’m most interested in how the different types of brewers separated in the ratings. The big commercial breweries don’t come out looking too good.
Let’s dig into the reviewers a bit:
American reviewers are somewhat more positive about beers than their counterparts in other places (a culture of grade inflation at work?), but there is no relationship between the number of reviews produced and their ratings.
Let’s have a look at the ratings. What influences ratings on a given beer?
It wasn’t until I got to this plot that I realized the fun pattern in the Review Distance: as a general rule, lots of beers are reviewed within 2500 km of the brewery (or on the same continent). Then there’s a big gap and another smaller batch of beers are between 5,000 and 10,000 km. (Or one ocean away). This amuses me.
Here’s a more specific look at the correlation factors between various aspects. This plot offers two ways of visualizing correlation: the deeper the blue, the stronger the positive correlation is between two factors. (negative correlations would appear in shades of red but none came up). The correlation ellipses tell a more detailed story of the strength of the correlation and our confidence in predictions arising.
appearance | aroma | palate | rating | taste | abv | ibu | reviewdist | |
---|---|---|---|---|---|---|---|---|
appearance | 1.0000000 | 0.4253924 | 0.4980460 | 0.5807414 | 0.4327424 | 0.2360711 | 0.1802814 | 0.0374421 |
aroma | 0.4253924 | 1.0000000 | 0.5285586 | 0.8734295 | 0.7961256 | 0.3843898 | 0.2625306 | 0.1162562 |
palate | 0.4980460 | 0.5285586 | 1.0000000 | 0.7307274 | 0.6166851 | 0.2967273 | 0.2090170 | 0.0690214 |
rating | 0.5807414 | 0.8734295 | 0.7307274 | 1.0000000 | 0.9210155 | 0.4071846 | 0.2904858 | 0.1131280 |
taste | 0.4327424 | 0.7961256 | 0.6166851 | 0.9210155 | 1.0000000 | 0.3648189 | 0.2465901 | 0.1115072 |
abv | 0.2360711 | 0.3843898 | 0.2967273 | 0.4071846 | 0.3648189 | 1.0000000 | 0.4318264 | 0.2293455 |
ibu | 0.1802814 | 0.2625306 | 0.2090170 | 0.2904858 | 0.2465901 | 0.4318264 | 1.0000000 | 0.0199245 |
reviewdist | 0.0374421 | 0.1162562 | 0.0690214 | 0.1131280 | 0.1115072 | 0.2293455 | 0.0199245 | 1.0000000 |
For the purposes of this exercise, I really wanted there to be a ‘home field advantage’ for beers reviewed near home. Unfortunately, there is exactly zero evidence supporting my theory. On the other hand, there is a clear relationship between taste and aroma and the beer’s eventual rating. While we noted that ABV and IBU were correlated to rating, it’s easier to see in this broader context that the relationship is relatively weak. ABV also appears to be correlated to high ratings, but only to a point. High ABV doesn’t necessarily continue to mean high ratings.
Instead, what principally arose (from that earlier plot with color corresponding to Beer Type) was something I did not expect: people don’t like lagers as much as they do ales. I’m going to take a deeper look at the lager/ale question in the multivariable section, as I’d like to see whether the problem is that many beer conaisseurs do not care for commercial lagers (such as Bud Lite), or whether lagers are just less favored.
Let’s take a closer look at ratings in light of specific aspects of the beer:
The commercial/client/contract breweries fall short of the microbreweries and brewpubs: their median ratings are generally lower than the smaller breweries’ 25th percentiles.
Let’s consider ratings in terms of where the beers are from:
There is wide variation amongst the different countries. For kicks (yes, this is a multivariate plot, but followed my train of though while I was bivariat-ing), I’ve overlaid one point per review, color-coding it with the type of beer reviewed. We get British ales, German/European Lagers, and a blend of American ales and lagers.
We came into this exercise wondering what influenced reviews of beers. On the whole, findings confirmed our assumptions going in (which is kind of disappointing and has clobbered the potential for this paper to grow into anything career-revolutionizing).
I hadn’t expected there to be such variation among the countries’ reviews. Why, for instance, do reviewers love Norwegian beer but not Swedish beer?
And I had not come into this expecting such clear separation between ales and lagers. I want to look into that more.
Most definitely between the sub-attributes “taste” and “aroma” and the overall rating.
Let’s explore more about what people do or do not like about lagers: (Appearance and Palate are marked on a 1-5 scale, Aroma and Taste on a 1-10 scale. The scores add into an overall score that digests down into the 1-5 Rating.)
Let’s have a closer look at just ratings and how each factor impacts them:
I’ve normalized each of the attributes (Appearance and Palate) that make up an overall rating by doubling the values that are scored on a scale of 1-5 rather than 1-10.
Taste and Aroma are pretty much wired in a tight correlation to Rating - for lower-rated beers, people seem to be willing to give a somewhat higher score to Appearance and Palate before they also line up with the final rating for better-rated beers. It’s possible that, since these attributes are scored on a 1-5 scale, the minimum score is higher relative to the range of possible scores. I don’t think we’re looking at anything significant.
brewery_type | beer_type | median_rating | number_of_ratings | sd | lager_ale_rating_ratio |
---|---|---|---|---|---|
Brew Pub | Ale | 3.340000 | 2821 | 0.4345558 | 0.8525234 |
Brew Pub | Lager | 3.036667 | 488 | 0.4431040 | 0.1474766 |
Brew Pub/Brewery | Ale | 3.400000 | 2578 | 0.4395995 | 0.8270773 |
Brew Pub/Brewery | Lager | 3.050000 | 539 | 0.4241495 | 0.1729227 |
Client Brewer | Ale | 3.100000 | 528 | 0.5671294 | 0.6633166 |
Client Brewer | Lager | 2.376667 | 268 | 0.7237597 | 0.3366834 |
Commercial Brewery | Ale | 3.028571 | 2399 | 0.5346603 | 0.5305175 |
Commercial Brewery | Lager | 2.737500 | 2123 | 0.5655992 | 0.4694825 |
Contract Brewer | Ale | 3.085000 | 42 | 0.4815728 | 0.9130435 |
Contract Brewer | Lager | 2.350000 | 4 | 0.7192299 | 0.0869565 |
Microbrewery | Ale | 3.385714 | 11298 | 0.3986643 | 0.8943952 |
Microbrewery | Lager | 3.100000 | 1334 | 0.4791834 | 0.1056048 |
From this table and plot we can conclude that the client/commercial breweries produce a large number of the reviewed lagers and that their ratings are lower, by a margin that exceeds the standard deviation from the mean.
Let’s simplify a bit.
This information helps us see that the big commercial breweries and client brewers (also high-volume operations) are making the majority of the lagers reviewed, and their ratio of lagers to ales is much higher than for the microbreweries. Let’s refine some more and see whether it’s the lagers themselves pulling the commercial breweries down or some other factor.
brewery_type | beer_type | mean_rating | median_rating | number_of_ratings | ratio |
---|---|---|---|---|---|
Brew Pub | Lager | 3.082077 | 3.082077 | 488 | 1 |
Brew Pub/Brewery | Lager | 3.098543 | 3.098543 | 539 | 1 |
Client Brewer | Lager | 2.545358 | 2.545358 | 268 | 1 |
Commercial Brewery | Lager | 2.884549 | 2.884549 | 2123 | 1 |
Contract Brewer | Lager | 2.456156 | 2.456156 | 4 | 1 |
Microbrewery | Lager | 3.159099 | 3.159099 | 1334 | 1 |
The Commercial breweries certainly get a large number of reviews for lagers, and their reviews just aren’t very good. Do commercial breweries’ beers get more reviews? It stands to reason that their products are available across a larger market than the microbreweries, and so are more likely to be reviewed.
I’ve removed one 400-beer outlier (a microbrewery) to help this plot be more readable:
So, the commercial brewers still stand out with their lower ratings in general - and it appears that producing more beers is loosely tied to better ratings. This plot turned out to be a big disappointment, though. Although it has lots of interesting dimensions and pretty colours, it fails to tell much of a compelling story and I think I’ll leave it in a draft state.
As I dug into this, I was able to tease out more information about the breweries and the ratings associated with their beers. On the whole, commercial brewers’ products don’t fare well.
Patterns arose that initially show that commercial breweries are not rated as highly as the microbreweries. That relationship, however, is clouded by the accompanying fact that lagers are generally much lower-rated than ales; and commercial breweries’ portfolios are much more biased toward lagers.
We don’t have access to sales statistics, so we don’t have a key part of teasing this analysis apart. I cannot tell whether certain beers get a relatively high number of reviews in light of how many are actually on the market. It’s impossible to tell whether the American Commercial Lagers (like Bud) are pulling brewery ratings down in proportion to their place in the market.
I got more and more interested in the lager/ale imbalance and, as noted above, would like to dig more into the state of the market. I think the plot that shows only lager ratings helps us see that micro-lager is generally better regarded than commercial lager, which makes me think the Budweiser factor is indeed at work.
Here is a look at the various styles of beer in a way that helps you see how they are positioned in terms of alcohol content and bitterness. (A beer connoisseur would say the bitterness scale is actually a continuum between ‘malty’ (low IBU) and ‘hoppy’ (high IBU).)
I’ve shown the ales as circles and lagers as triangles. The colour is coded to the style’s mean rating. It is on a continuum from yellow (terrible) to dark blue (highly-rated). I think this is a interesting look at the overall world of beer styles and where your favourite brew might fit in the picture.
type | review_count | rating_mean | rating_median | reviewdist_mean | reviewdist_median | Corr_Dist_Rating |
---|---|---|---|---|---|---|
Brew Pub | 17583 | 3.309828 | 3.4 | 3313.818 | 1315.8946 | 0.1024090 |
Brew Pub/Brewery | 27908 | 3.400369 | 3.4 | 3819.123 | 2205.4558 | 0.1923701 |
Client Brewer | 13102 | 2.921661 | 3.0 | 2306.509 | 881.5625 | 0.0386581 |
Commercial Brewery | 99035 | 2.996840 | 3.1 | 2356.545 | 925.6238 | 0.0446273 |
Contract Brewer | 734 | 2.944142 | 3.0 | 2027.733 | 991.3055 | -0.0172197 |
Microbrewery | 193237 | 3.391272 | 3.4 | 2691.654 | 945.5683 | 0.1235815 |
I wanted to take a closer look at the hypothesis that a reviewer would favor a brewery that was closer to him or her. We’d expect a negative correlation between distance and the rating. As we saw in the initial analysis, the correlation is, in fact, positive and almost nonexistent (0.113). This plot and the accompanying table break down reviewer distance and ratings into a few more parts and some interesting things arose.
All these relationships are very, very weak, but it appears that the farther a reviewer gets from the micros and the brewpubs, the better the ratings are, especially for Brew Pubs/Breweries. This makes no sense to me, as I’d have assumed that a brew pub would be most likely to have a rabid local following. Now, the median review distance for Brew Pubs/Breweries is almost double that of either Brew Pubs or Micros, and so is the positive relationship between distance and beer. I cannot explain why the “Brew Pub/Brewery” class stands out, but this would bear more exploration. It’s possible that this is a data flaw from the incomplete data set I collected.
On the converse, although we have shown that people aren’t too keen on the big commercial breweries (especially their lagers), reviewers are faintly more likely to give them a higher score if they are far, far, away from the brewery. A visual glance seems to me to show a small bump in reviews once a reviewer is on another continent, and another if they can get to be on the opposite side of the planet from a commercial brewery.
This plot finally stitches together the lager/ale story in a cleaner way. Although it was tempting (as with the final plot in the exploratory plots) to try to add in as many factors as possible, the picture just got too cluttered and didn’t tell much of a story. This plot helps us see that the differences between brewery types almost vanish when we just look at ales - but that the commercial lagers really pull down the ratings. The second quartile of the commercial brews is noticeably wide, indicating the number of low-performing beers in their portfolios.
This exercise was complicated by both too much and too little data.
It bears repeating that this analysis is of only about half the total data set, thanks to the ratebeer.com site reformat that stopped my data gathering cold. I would like to believe that the patterns arising would hold true across the entire world of beer, but it’s impossible to know.
I then found that I had many different directions to analyze - beers, breweries, geographic, quantitative, qualitative, etc., and got caught in a position where I didn’t take a really deep dive into any one area but spread my analysis across the whole data set. The biggest surprise came with the lager/ale split. With more time, I’d dig into the various subtypes of beer (IPA vs. Scotch Ale vs. American Light Lager etc.) and how they are rated and reviewed.
With more data, specifically data that would help me understand sales volumes for the various beers, I’d start normalizing the numbers so that I could look at a more consistent rating scale - I’d very much like to know whether the ratio of reviews::sales for Bud Light is the same as for some of the well-regarded microbreweries.
This would affect how we weight the reviews and cast more light on whether specific beers create disproportionate pressure on a brewery’s or a style’s rating.
Finally, I’m still tempted to go back and fix the screen-scraping python code, because it bugs me that I haven’t been able to get the whole data set - but the effort-to-reward ratio may be pretty poor.