Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some clarification for Chapter 8 Observer Bias model formulation #13

Open
alexklibisz opened this issue Jan 8, 2018 · 2 comments
Open

Comments

@alexklibisz
Copy link

Chapter 8 makes an interesting point about Observer Bias on the Red Line, but it took me a while to understand why the distribution over passengers' observed wait times is greater than the true wait times. After some thought it turns out I was assuming a more complicated model than the text. I don't think either model is unreasonable; my intuition just wasn't on the same page and I didn't find an explicit reason in the text to invalidate my model. The correct model might be obvious to most but perhaps the clarification below will help someone in the future:

The text reads:

The average time between trains, as seen by a ran- dom passenger, is substantially higher than the true average.
Why? Because a passenger is more like (sic) to arrive during a large interval than a small one. Consider a simple example: suppose that the time between trains is either 5 minutes or 10 minutes with equal probability. In that case the average time between trains is 7.5 minutes.
But a passenger is more likely to arrive during a 10 minute gap than a 5 minute gap; in fact, twice as likely. If we surveyed arriving passengers, we would find that 2/3 of them arrived during a 10 minute gap, and only 1/3 during a 5 minute gap. So the average time between trains, as seen by an arriving passenger, is 8.33 minutes.

For this to be true, I believe we have to assume a passenger arriving 0 minutes after the previous train has the same observed waiting time as a passenger arriving any arbitrary n > 0 minutes after the train. In other words, a passenger who just missed the previous train and waited the full gap is treated the same as a passenger who just barely made it the train.

My intuition was as follows: In reality, a passenger can arrive at the 9th minute of a 10 minute gap or the 4th minute of a 5 minute gap. Both passengers wait 1 minute. If you model it this way, the biased distribution actually shifts to the left. Why? Let's say there are two passengers arriving per minute (lam = 2). For a 2 minute gap, you might have the following wait times for 4 passengers: [0, 0, 1, 1]. For a 3 minute gap, you might have the following wait times for 6 passengers: [0, 0, 1, 1, 2, 2]. A passenger who waits 0 has arrived just before the train departs. For an n minute gap, wait time n-1 indicates the passenger arrived within the first minute after the previous train departed. From the 2-minute and 3-minute gaps above, you can deduce that across all trains P(wait n) < P(wait n-1). I.e., there is always be a chance for a passenger to wait 0 minutes. But for an e.g. 5 minute gap, it's impossible to wait 6 minutes.

Here is some code to simulate the process and the resulting histogram.

from math import floor
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)

n = 50000  # Number of trains.
l = 2     # Passengers arriving per minute.
T = np.random.normal(10, 2, n) # True time between trains.
W1 = []   # Passengers' observed waiting time (my initial formulation).
W2 = []   # Passengers' observed waiting time (Think Bayes Formulation).

for t in T:
    size = int(floor(t * l)) # This many passengers will end up on the next train.
    W1 += list(np.random.uniform(0, floor(t), size))
    W2 += list(np.ones(size) * t)

bins = int(T.max() - T.min())
plt.hist(T, color='red', bins=bins, alpha=0.3, normed=True, label='True wait $\mu=%.3lf$' % T.mean())
plt.hist(W1, color='blue', bins=bins, alpha=0.3, normed=True, label='Observed wait $\mu=%.3lf$' % np.mean(W1))
plt.hist(W2, color='green', bins=bins, alpha=0.3, normed=True, label='Observed wait simplified $\mu=%.3lf$' % np.mean(W2))
plt.legend(fontsize=8)
plt.show()

figure_1

@AllenDowney
Copy link
Owner

AllenDowney commented Jan 9, 2018 via email

@alexklibisz
Copy link
Author

@AllenDowney It's nothing urgent and not explicitly a problem. Just figured I'd post it in case someone else overcomplicates the problem like I did and gets confused at chapter 8. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants