What's the difference between probability and likelihood?
Discrete Random Variables
Suppose that you have a stochastic process that takes discrete values (e.g., outcomes of tossing a coin 10 times, number of customers who arrive at a store in 10 minutes etc). In such cases, we can calculate the probability of observing a particular set of outcomes by making suitable assumptions about the underlying stochastic process (e.g., probability of coin landing heads is pp and that coin tosses are independent).
Denote the observed outcomes by OO and the set of parameters that describe the stochastic process as θθ. Thus, when we speak of probability we want to calculate P(O|θ)P(O|θ). In other words, given specific values for θθ, P(O|θ)P(O|θ) is the probability that we would observe the outcomes represented by OO.
However, when we model a real life stochastic process, we often do not know θθ. We simply observe OO and the goal then is to arrive at an estimate for θθ that would be a plausible choice given the observed outcomes OO. We know that given a value of θθ the probability of observing OO is P(O|θ)P(O|θ). Thus, a 'natural' estimation process is to choose that value of θθ that would maximize the probability that we would actually observe OO. In other words, we find the parameter values θθ that maximize the following function:
L(θ|O)=P(O|θ)L(θ|O)=P(O|θ)
L(θ|O)L(θ|O) is called as the likelihood function. Notice that by definition the likelihood function is conditioned on the observed OO and that it is a function of the unknown parameters θθ.
Continuous Random Variables
In the continuous case the situation is similar with one important difference. We can no longer talk about the probability that we observed OO given θθ because in the continuous case P(O|θ)=0P(O|θ)=0. Without getting into technicalities, the basic idea is as follows:
Denote the probability density function (pdf) associated with the outcomes OO as: f(O|θ)f(O|θ). Thus, in the continuous case we estimate θθ given observed outcomes OO by maximizing the following function:
L(θ|O)=f(O|θ)L(θ|O)=f(O|θ)
In this situation, we cannot technically assert that we are finding the parameter value that maximizes the probability that we observe OO as we maximize the pdf associated with the observed outcomes OO.
Learn More :
Data Science
- What features would you use to predict the Uber ETA for ride requests?
- How would you evaluate the predictions of an Uber ETA model?
- Describe how you would build a model to predict Uber ETAs after a rider requests a ride.
- Suppose you're working as a data scientist at Facebook. How would you measure the success of private stories on Instagram, where only certain chosen friends can see the story?
- Precision vs Accuracy Vs Recall?
- Error vs variance vs bias?
- False negatives vs false positives? When is either one worse than the other?
- Describe your data science process start to finish?
- Data science vs machine learning vs AI?
- How would you find correlation between a categorical variable and a continuous variable?
- How do you treat null/missing values? Name 3 methodologies.
- How can outlier values be treated?
- What is data normalization? Name 2 normalization methodologies.
- What is the role/importance of data cleaning?
- What are success metrics vs tracking metrics?
- What kind of metric would you make to measure success of a program (marketing) and how do you define them?
- Let's say an app was getting a redesign. How do you know if the redesign was successful?
- We noticed a steep decline in users in a certain area of the world, how would you address/asses?
- What are the two methods used for the calibration in Supervised Learning?
- Which method is frequently used to prevent overfitting?
- What is the difference between heuristic for rule learning and heuristics for decision trees?
- What is Perceptron in Machine Learning?
- Explain the two components of Bayesian logic program?
- What are Bayesian Networks (BN) ?
- Why instance based learning algorithm sometimes referred as Lazy learning algorithm?