# Statistically Accurate Ratings

In a previous post on ratings I noted some issues with using the mean vs the media for the rating. A few days ago Jeff Atwood posted on user ratings and specifically how to sort a set of items that are rated by users. Jeff’s post included material from Evan Miller’s article, How Not To Sort By Average Rating, on the same subject.

Below is the relevant portion of Evan Miller’s article:

CORRECT SOLUTION: Score = Lower bound of Wilson score confidence interval for a Bernoulli parameter

Say what: We need to balance the proportion of positive ratings with the uncertainty of a small number of observations. Fortunately, the math for this was worked out in 1927 by Edwin B. Wilson. What we want to ask is: Given the ratings I have, there is a 95% chance that the “real” fraction of positive ratings is at least what? Wilson gives the answer. Considering only positive and negative ratings (i.e. not a 5-star scale), the lower bound on the proportion of positive ratings is given by:

(For a lower bound use minus where it says plus/minus.) Here p is the observed fraction of positive ratings, zα/2 is the (1-α/2) quantile of the standard normal distribution, and n is the total number of ratings. The same formula implemented in Ruby:

require ‘statistics2′

def ci_lower_bound(pos, n, power)
if n == 0
return 0
end
z = Statistics2.pnormaldist(1-power/2)
phat = 1.0*pos/n
(phat + z*z/(2*n) – z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
end

pos is the number of positive rating, n is the total number of ratings, and power refers to the statistical power: pick 0.10 to have a 95% chance that your lower bound is correct, 0.05 to have a 97.5% chance, etc.

Now for any item that has a bunch of positive and negative ratings, use that function to arrive at a score appropriate for sorting on, and be confident that you are using a good algorithm for doing so.

Sadly everybody has simply quoted the mathematical forumla and not even given links to the material on how to derive the formula. I was able to find an article by Keith Dunnigan here that gave an outline of how it is derived along with several other confidence intervals. Hopefully later I can take a look at my textbooks and do a full derivation.

# Arbitrage

I’m going to demonstrate a simple example using forward agreements for stocks of what we mean when we say there is no arbitrage.

First assume that there is a risk free investment that yields continuous interest at rate $r$, typically treasury bonds are used as the risk free investment. Now assume we have a stock $S$ that has price $S_0$ at time 0. Suppose you enter into a forward agreement to pay $F_{0,T}$ at time T for the stock $S$.

Now using the risk free investment we can say that $F_{0,T}$ is worth $F_{0,T}e^{-rT}$ at time 0 (this is because if we invested $F_{0,T}e^{-rT}$ in the risk free investment at time 0 we would have $F_{0,T}$ money at time T).

The assumption of no arbitrage means that $F_{0,T}e^{-rT} = S_0$ since by investing $F_{0,T}e^{-rT}$ in treasury bonds we will own the stock at time T and by buying the stock we also own it at time T.

Thus by assuming no arbitrage the forward price is $F_{0,T} = S_0e^{rT}.$

This is one example of using the assumption of no arbitrage. The Wikipedia article on arbitrage shows the other types of arbitrage.

# Audio Analysis, Tools of the Trade

I’ve been reading a little bit about audio analysis and doing a lot of searching for audio libraries. There are two audio analysis methods I want to compute.

1. Compute the loudness of an audio file. That is add up the PCM output and divide by the length in seconds of the audio file. This outputs the average loudness of the file. It’s is an easy measure to compute and might be useful for categorizing music (I haven’t yet tried sorting my music by this “loudness” measure).

2. Do a spectral analysis of the audio file. That is compute the discrete (fast) fourier transform of the file and go from there. As the wikipedia article shows the discrete fourier transform requires significant analysis to get good spectral components for an audio file (see the Spectral Analysis section). For audio analysis there is also the Discrete Time Fourier Transform which is just a variant of the discrete fourier transform.

As of right now I’ve written the code to do 1. but I don’t have anything for 2.. Before I get too excited about doing the spectral analysis of audio files I need to read Example Applications DFT. And also brush up on my fourier analysis.

For implementing the spectral analysis I’m going to use the FFTW library developed at MIT.

I’ve listed below the packages and libraries for audio analysis and fourier analysis that I came across in my search.
SciPy (main python scientific platform (fourier analysis uses the FFTW library), CLAM (large audio analysis framework, lots of dependencies), Baudline Spectrum Analyzer, Matlab (no explanation needed), Octave (Matlab clone), FFTW.

I’m going to use the FFTW library for my program because the other libraries such as scipy etc. require 10′s of dependencies. And I want my program to rely on only a few dependencies. If writing C code gets too onerous I’ll switch to SciPy.

# Musical Distance

As of most of you, I play my music through the computer. One of the popular settings for music players is the “shuffle” or “random” mode. That is the next song is randomly (or semi-randomly) selected from the playlist. This was a great innovation for listening to overlooked song in your music collection.

The only problem is that the “shuffle” mode mixes clashing genres that is different genres are played next to each other. This is avoidable by filtering your playlist to include only a specific genre (say rock) and then use the “shuffle” mode to play only rock songs. But for me that’s only a partial solution. Because when the music player lists only rock songs it includes songs by “Rammstein” (heavy metal) and songs by “Jonathon Coulton” (folk rock). In particular I consider Rammstein “hard” music and “Jonathon Coulton” soft music.

I think “Rammstein” and “Jonathon Coulton” are musically far apart. Not as far apart as Classical Music and Rap. But not as close as Classical Music and Jazz. The idea of music genres quantifies the idea of musical closeness to some degree. But I want a quantitative measurement for how close two songs are musically. Similar to how we say that a song is “soft” or “loud/hard”.

The idea of musical distance is that if two songs are musically close together then their musical distance is a small number. Conversely if two songs are musically far apart then their musical distance is large. That is I want a metric (in the mathematical sense) for songs.

The only trouble is I know nothing about audio analysis. I know what a spectrum is but that’s about it. Julius O. Smith III at Stanford has been kind enough to post online his textbooks on the topic of music/audio processing and analysis at Mathematics of the Discrete Fourier Transform with Audio Applications, Introduction to Digital Filter with Audio Applications, and Physical Audio Signal Processing
for Virtual Musical Instruments and Audio Effects
. When I have a couple of weeks to spare (hah) I’ll try giving the books a read to see what I can come up with.

# Packing Calculations

Todays problem is from Pugh’s great book “Real Mathematical Analysis“, it’s problem 23 in chapter 1.

Problem Statement:

Let b(R) and s(R) be the number of integer unit cubes in $\textbf{R}^M$ that intersect the ball and the sphere of radius R respectively, centered at the origin. By integer unit cube I mean a unit cube which has its vertices on the integer lattice.

(a) Let m = 2 and calculate the limits

$\textrm{lim}_{R \rightarrow \infty} \frac{s(R)}{b(R)}$ and $\textrm{lim}_{R \rightarrow \infty} \frac{s(R)^2}{b(R)}.$

(b) Take $m \ge 3.$ What exponent k makes the limit

$\textrm{lim}_{R \rightarrow \infty} \frac{s(R)^k}{b(R)}$

interesting?

(c) Let c(R) be the number of integer unit cubes that are contained in the ball of radius R, centered at the origin. Calculate

$\textrm{lim}_{R \rightarrow \infty} \frac{c(R)}{b(R)}.$

(d) Shift the ball to a new, arbitrary center (not on the integer lattice) and re-calculate the limits.

End Problem Statement

This problem doesn’t involve finding a packing but instead involves counting the number of unit cubes after the packing has been chosen (in this case we’re packing a ball with integer unit cubes). So I find it to be a very interesting problem.

The first simplification I make is to only consider the first quadrant of the ball with radius R. By symmetry $s(R) = 4 s_{\textrm{quad 1}}(R)$ and similarly for $b(R) = 4 b_{\textrm{quad 1}}(R),$ where $s_{\textrm{quad 1}}$ and $b_{\textrm{quad 1}}$ are the functions $s$ and $b$ restricted to the first quadrant. For the rest of this solution I only work with the first quadrant. Let $B_R = \{(x, y) \in \textbf{R}^2 : x^2 + y^2 \le R^2, x \ge 0, y \ge 0\}.$ The area of $B_R$ is $\pi R^2/4$ so as a first approximation $b(R) \approx \pi R^2/4.$

After some thought the best way to count the number of unit cubes in $s(R)$ is to imagine tracing out the curve $y = R\sin(\theta), \textrm{ } x = R\cos(\theta)$ as $\theta$ goes from 0 to $\pi/2.$ Let c be the number of unit cubes the curve has been traced through. Initially c = 0. As we trace out the curve, every time $y \in \textbf{N}$ we add 1 to c since the curve has entered a new unit square, similarly if $x \in \textbf{N}$, 1 is added to c. If y and x are both integers then only 1 should be added to c (this happens very rarely so I mostly ignore this case).

For a small example let R = 2. Then x goes from 2 to 0, so $x \in \textbf{N}$ 3 times, similarly y goes from 0 to 2 so $y \in \textbf{N}$ 3 times.

This gives us an initial count of 6 for $s(2)$ but we need to handle the corner cases.

The corner cases are (x = 0, y = 2) and (x = 2, y = 0). Note that x = 0, y = 2 means both x and y are integers so that square was counted twice instead of once, also don’t forget that the square was counted when the curve entered it. So we need to subtract two from 3 + 3 giving 3 + 3 – 2. Also when x = 2, y = 0 this square is again counted twice so we need to subtract 1 from 3 + 3 – 2 giving 3 + 3 – 3 = 3 Thus s(2) = 3.

Example 2: R = 4, x goes from 4 to 0 and y goes from 0 to 4 so we have 5 + 5 – 3 (remember the corner cases) = 7.

From the above derivations $s(R) = 2R - 3.$ To a first approximation the formula for $s(R)$ in m dimensions is $mR.$

To calculate $b(R)$ first note that if the packing of $B_R$ was perfect than b(R) would be $\pi R^2/4.$ It’s not a perfect packing so b(R) is greater than $\pi R^2/4.$ To calculate how much more we first note that all the unit cubes used in counting s(R) have on average half their area in $B_R$ and half their area outside $B_R$ (remember we’re taking the limit to large R, so the result follows from symmetry).

There are $2R - 3$ of these unit cubes that are only half in $B_R.$ So they cover an area of R – 3/2 inside $B_R$ so b(R) must be $\pi R^2/4 - R + 3/2$ (the part of $B_R$ covered by unit cubes that are completely contained in $B_R$) + 2R – 3 (the unit cubes that are part-in and part-out) = $\pi R^2/4 + R - 3/2.$

To summarize:

$s(R) = 2R - 3, \textrm{ } b(R) = \pi R^2/4 + R - 3/2,$

$c(R) = b(R) - s(R) = \pi R^2/4 - R + 3/2.$

Now that we’ve worked out b(R), s(R), and c(R) the answers are:

(a) 0 and $8/\pi.$

(b) m, higher dimensions follow the same methods of calculations.

(c) 1.

(d) Hehe, I haven’t tried this part maybe in a future post. My instinct is that the limits are identical.

The quadratic equation of the form $x^2 + ax + b = 0$ is one of the simplest nonlinear problems in one variable. The method I use for solving it can also be applied to the cubic and quartic equations but of course not the quintic.

Let $e_1 \textrm{ and } e_2$ be two (not necessarily different) solutions to $x^2 + ax + b = 0$. By the fundamental theorem of algebra $(x - e_1)(x - e_2) = x^2 + ax + b$. Multiplying out $x^2 -(e_1 + e_2) x + e_1e_2=x^2 + ax + b$. So $e_1 + e_2 = -a \textrm{ and } e_1e_2 = b.$ This is a non-linear system of equations in two variables.

Note that ${(e_1 - e_2)}^2 = e_1^2 + e_2^2 -2e_1e_2$ which is equal to ${(e_1 + e_2)}^2 - 4e_1e_2 .$ Thus $e_1 - e_2 = {(a^2 - 4b)}^{1/2}$ Which easily yields the solution $e_1 = -a/2 +- {(a^2 - 4b)}^{1/2}.$

It’s a fun exercise. The trick is writing the square of $e_1 - e_2$ and also assuming you already have two solutions.

# Listing All Subsets

A common problem in discrete math is listing all subsets for a given set say $S.$ There’s a trivial recursive algorithm for listing all subsets but I want to give an algorithm for listing subsets when $S$ is small.

Lets try out the algorithm for the set $S=\{a,b\}.$

The subsets are $\{\}, \{b\}, \{a\}, \{a,b\}.$

Now look at the numbers 0, 1, 2, 3 in binary: $00, 01, 10, 11.$

We can say that 00 represents the set $\{\}$, 01 the set $\{a\}$, 10 the set $\{b\}$ and 11 the set $\{a,b\}$. Remember to read the binary numbers from right to left. Given a binary number the set it represents corresponds to the one’s in the binary number, so if the binary number has a one in the 10th bit (where the 10th bit is the 10 character from the right end of the number) than the 10th element in the set $S$ is included in the subset.

This does mean we need an ordering for the set $S.$ But the ordering is only used when listing out the subsets so it’s doesn’t violate the condition that sets aren’t ordered (if we wanted to list the subsets of an infinite set this could be a problem).

Now it’s obvious how to list the subsets of a set $S.$

1. Create an array, A, of all the numbers from 0 to $2^{|S|} - 1.$

2. Iterate over the array A and for each element, n, of A list out the corresponding subset of $S.$ If n = 1 then the corresponding subset is $\{a\},$ where $S = \{a,b\}.$

Use bitwise arithmetic to retrieve the individual bits from a number. For getting the nth bit from a number, x, in java, c, etc. do “x & pow(2,n)”. If x has a 1 in the nth then “x & pow(2,n) == 2n” otherwise it’s == 0.

Thats it. Happy coding.

When viewing most youtube videos you can rate the video as in the following example.

The overall video rating, I assume, is the mean of the ratings, that is
$\textrm{video rating} = \frac{\Sigma rating_i}{\# \textrm{number of ratings}}.$

A problem with this rating technique is sensitivity to outliers e.g. people voting 0 or 5. One solution is for youtube to use the median instead of the mean. The median can be computed in linear time but the computation is significant, see selection algorithms.

But would using the median actually change the overall video rating? We need to analyze how people rate videos. Is someone’s rating of a video independent of the previous ratings?

It is possible that people will give a high rating if they believe the video has been rated too low and vice versa. In this scenario the mean is the same as the median!

It is also possible that people rate a video independent of the previous ratings in which case using the median is desirable.

To distinguish between which of these scenarios is more correct we can give the following two tests.

1. Use a common video in both tests and a large random set of users. This is so we can compare the tests against each other.

Test1: The video rating is given by the mean of the ratings and a user can see the video rating before they rate it.

Test2: The user cannot see the video rating before they rate the video and again the video rating is given by the mean of the ratings.

If the overall video rating is the same in both tests then it strongly suggests that users rate videos independent of the previous ratings. Of course if the distribution of user ratings is normal than the mean and median will be the same the same in Test1 and Test2. One way to modify the experiment if this happens is to add a few outlier ratings to Test1 and then see if the users correct the rating to what it would be without the outliers.

Now if the video rating in Test1 agrees with the median of the video ratings in Test2, then we know that using the mean is a good substitute for the median.

# More spoilage

The question of minimizing spoilage still interests me. Today I want to consider the consequences of relaxing one of the constraints on the problem. In particular I do not assume that the two different types of fruit cost the same per unit. And I want to minimize the total amount of wasted money (i.e. the amount of spoiled fruit times the unit cost.)

Definitions:

1. There are two types of fruit $A$ and $B$. The unit costs are $c_A$ and $c_B$ respectively.
2. We can consume $z$ units of fruit per day.
3. The goal is to minimize the cost of the total spoiled fruit.
4. The percent of fruit that spoils at the end of each day is $m_A d$ and $m_B d$ respectively for A and B, where $m_A$, $m_B$ are constants and $d$ is the number of days since buying the fruit (i.e. $d = 0$ for the first day.)

Calculations:
First, note that by minimizing the cost of spoiled fruit at the end of each day the total cost of spoiled fruit is minimized (I should prove this and I might in a future post. I think this comes from the fact that we’re solving a convex optimization problem?).

Given $x$ and $y$ units of fruit for A and B respectively.
We can consume $z_A$ and $z_B$ units of $A$ and $B$ respectively, where $z_A + z_B = z.$

The amount of spoiled fruit at the end of the day is
$m_Ad(x-z_A) + m_Bd(y-z_B),$ the cost of the spoiled fruit is
$m_Ad(x-z_A)c_A+m_Bd(y-z_B)c_B.$

Using $z_B = z - z_A,$ the cost of the spoiled fruit is $m_Ad(x-z_A)c_A+m_Bd(y-z+z_A)c_B.$ Expanding we get

$m_Ac_Adx-m_Ac_Adz_A+m_Bc_Bdy-m_Bc_Bdz+m_Bc_Bdz_A$
which reduces to $= z_A(m_Bc_B - m_Ac_A)d + m_Ac_Adx + m_Bc_Bdy - m_Bc_Bdz.$

Taking the derivative with respect to $z_A$ we obtain $(m_Bc_B - m_Ac_A)d.$
If $m_Bc_B - m_Ac_A$ is greater or equal to zero, then $z_A = 0, z_B = z$ minimizes the cost of spoiled fruit; otherwise $z_A = z, z_B = 0$ minimizes the cost of spoiled fruit.

From the above it’s clear that introducing the factor of fruit cost does not fundamentally alter the solution. But it does introduce some interesting factors, that warrant further consideration.

# Minimizing spoilage

Yesterday we bought some strawberries, cherries, apricots, and plums at the local farmers market. Infact we bought more than we can possible consume before some of the fruit spoils. The problem under consideration is how to minimize the total of amount of spoiled fruit.

This problem is interesting since different fruits spoil at different rates. I’ll consider the simplest case of just two types of fruits. There are different possible models for the rate of fruit spoilage. I think the simplest one is to assume that the percent of good fruit that spoils in a day increases as a linear function.

As an example we assume the following table for the percent of strawberries that spoil on a particular day

…………………Percent of fruit that spoils
Day 0……………………..0%

Day 1……………………33%

Day 2……………………66%

Day 3………………….100%

By the above table, on day two 66% of the strawberries that weren’t spoiled at the beginning of day two, are spoiled by the end of day two. For convenience sake I assume that we consume fruit at the end of a day.

Table for cherry spoilage
…………………….Percent spoiled
Day 0……………………..0%

Day 1……………………25%

Day 2……………………50%

Day 3…………………..75%

Day 4…………………100%

Now, given 100 strawberries, 100 cherries and the ability to consume 20 pieces of fruit per day, how should we schedule our fruit consumption to minimize the amount of fruit that spoils (i.e. goes unconsumed.)

My solution is to use a greedy algorithm, I’m not positive that it minimizes total spoilage but I believe it does (Another possibility is to use linear programming but I like the greedy algorithm more.)

The algorithm is as follows

Given two goods A and B, percent X of A will go bad the next day, percent Y of B will go bad the next day, we have n pieces of fruit A, m pieces of fruit B, and we can consume p pieces of fruit (where p Y.

The amount of fruit that spoils, S = X * (n – n1) + Y*(m – m1), where n1 and m1 are the amount of fruit A and B we consume respectively, and n1 + m1 = p.

Now note that n1 = p – m1, obtaining S = X*(n-p + m1) + Y*(m – m1) = X*n – X*p +X*m1 + Y*m – Y*m1 = X*(n-p)+Y*m +m1*(X-Y)

Looking at S, it’s apparent that S is minimized when m1 is minimized, that is n1 is maximized. So
we should consume the strawberries first and then consume the cherries.

That’s all for today.