Algorithms Q&A - Recent questions and answers in MDP

Answered: Jumpy Car - Estimate/Solve the V* values for the following MDP

Sun, 26 Apr 2026 13:09:30 +0000

Hello Professor,

Since S5 has a terminal value of 100, the optimal strategy is to move toward S5. So for the states on the left side, S2, S3, and S4, the best action is R. For the states on the right side, S6, S7, and S8, the best action is L.

The terminal state values are:

V(S1) = 1.732

V(S5) = 100

V(S9) = 1.732

Because the MDP is symmetric around S5:

V(S2) = V(S8)

V(S3) = V(S7)

V(S4) = V(S6)

Let:

x = V(S2) = V(S8)

y = V(S3) = V(S7)

z = V(S4) = V(S6)

The discount factor is 0.9. There is no living reward, so the value comes only from discounted future values.

For S4, moving right gives:

z = 0.9(0.4(100) + 0.5z + 0.1z)

z = 0.9(40 + 0.6z)

z = 36 + 0.54z

0.46z = 36

z = 78.26

So:

V(S4) = V(S6) = 78.26

For S3, moving right gives:

y = 0.9(0.4z + 0.5(100) + 0.1y)

Substituting z = 78.26:

y = 0.9(0.4(78.26) + 50 + 0.1y)

y = 73.17 + 0.09y

0.91y = 73.17

y = 80.41

So:

V(S3) = V(S7) = 80.41

For S2, moving right gives:

x = 0.9(0.4y + 0.5z + 0.1x)

Substituting y = 80.41 and z = 78.26:

x = 0.9(0.4(80.41) + 0.5(78.26) + 0.1x)

x = 64.16 + 0.09x

0.91x = 64.16

x = 70.51

So:

V(S2) = V(S8) = 70.51

Therefore, the final values are:

V(S1) = 1.732

V(S2) = 70.51

V(S3) = 80.41

V(S4) = 78.26

V(S5) = 100

V(S6) = 78.26

V(S7) = 80.41

V(S8) = 70.51

V(S9) = 1.732

Final answer:

[1.732, 70.51, 80.41, 78.26, 100, 78.26, 80.41, 70.51, 1.732]

The values are not perfectly increasing as the states get closer to S5 because there is a chance of jumping two spaces. This means that being one step away from S5 is not always better than being two steps away, depending on the transition probabilities

Answered: Why is it useful to compute the marginal distribution instead of working with the full joint distribution?

Thu, 16 Apr 2026 07:44:10 +0000

Hello,

Here is my solution according to the given:

We are interested only in Y = umbrella usage, not in X = weather condition.

So instead of using the full joint distribution P(X,Y), it is useful to compute the marginal distribution:

P(Y) = sum over all weather conditions of P(X,Y)

From the table:

P(Umbrella = Yes) = 0.05 + 0.15 + 0.25 = 0.45

P(Umbrella = No) = 0.35 + 0.15 + 0.05 = 0.55

So the marginal distribution of umbrella usage is:

P(Y) =

Yes -> 0.45

No -> 0.55

Why this is useful:

If the only goal is to predict whether people carry umbrellas overall, then P(Y) is enough.

It gives a direct answer to the question of interest without keeping extra information about weather.

So the gain is:

Simpler model

Less computation

Easier prediction if weather is unknown or irrelevant

Direct estimate of total umbrella usage in the population

This is useful in AI when the system only cares about the final behavior, not the reason behind it.

For example:

Estimating total umbrella demand in shops

Planning inventory

Predicting how many umbrellas may be seen in public places

Any case where only the final action matters

What we lose by marginalizing:

When we marginalize out weather, we remove the connection between weather and umbrella usage.

So we lose the causal or explanatory relationship.

From the joint table we can see:

On sunny days, umbrella use is very low

On rainy days, umbrella use is very high

But after marginalization, we only know 45% carry umbrellas overall

We no longer know why

So marginalization is not exactly the same as saying causes do not exist, but it means we choose not to represent them in the model.

The cause may still be there, but it becomes hidden from our final distribution.

If weather is unobserved:

Marginalization helps because we can still model umbrella behavior even when X is missing.

That is useful when weather is a hidden variable.

In this sense, it is related to hidden state models such as HMMs.

In HMM-like thinking:

Weather can be treated as hidden state

Umbrella usage can be treated as observed behavior

Then even if we do not directly observe weather, we can still reason about observed actions through probabilities

Decision-making part:

If a city planner wants only the total average umbrella demand, then P(Y) may be sufficient.

Because it tells the planner that about 45% of people carry umbrellas.

But if the planner wants better forecasting under different conditions, then P(Y) is not enough.

They would need conditional probabilities such as P(Y|X).

Because umbrella demand clearly changes depending on whether it is sunny, cloudy, or rainy.

Important information loss:

The main thing that disappears is the dependence between X and Y.

In other words, we lose the structure of how weather affects umbrella usage.

This is important because two different environments could have the same marginal P(Y), even if their weather patterns are very different.

So yes, two very different weather distributions could produce the same overall umbrella usage.

That means P(Y) alone cannot tell us what is causing the behavior.

So as a final conclusion:

Marginalizing out weather is useful when we only care about predicting umbrella usage overall, because it gives a simpler and more direct distribution:

P(Umbrella = Yes) = 0.45

P(Umbrella = No) = 0.55

What we gain:

Simplicity

Lower complexity

Usable model even when weather is hidden or irrelevant

What we lose:

The relationship between weather and umbrella behavior

Explanatory and causal information

Ability to make condition-specific decisions

So P(Y) is enough for overall prediction, but not enough for deeper understanding or better weather-dependent decision making.

Answered: Evaluate an MDP given several observed episodes

Thu, 11 May 2023 12:58:28 +0000

Each state's value depends on the paths that continue from that state.

For A:
Episode 4: A -> x (sum = -10)
A = average( path sums ) = -10

For B:
Episode 1: B -> C -> D -> x (sum = +8)
Episode 2: B -> C -> D -> x (sum = +8)
B = average( path sums ) = (+8 + +8)/ 2 = 8

For C:
Episode 1: C -> D -> x (sum = +9)
Episode 2: C -> D -> x (sum = +9)
Episode 3: C -> D -> x (sum = +9)
Episode 4: C -> A -> x (sum = -11)
C = average( path sums ) = (9 + 9 + 9 - 11)/4 = 4

For D:
Episode 1: D -> x (sum = +10)
Episode 2: D -> x (sum = +10)
Episode 3: D -> x (sum = +10)
D = average( path sums ) = +10

For E:
Episode 1: E -> C -> D -> x (sum = 8)
Episode 2: E -> C -> A -> x (sum = -12)
E = average( path sums ) = (8 - 12)/2 = -2

Answered: Solve the V* values for this MDP

Mon, 08 May 2023 06:08:38 +0000

x4 = 0 + 0.9(0.9 * 1 + 0.1 * x4) => 0.91*x4 = 0.81 => x4 = 81/91 = 0.890109

x3 = 0 + 0.9(0.9 * x4 + 0.1* x3) => 0.91 * x3 = 0.81 * x4 => x3 = (81/91)^2 = 0.792295

x2 = 0 + 0.9(0.9 * x3 + 0.1 * x2) => 0.91 *x2 = 0.81 * x3 => x2 = (81/91)^3 = 0.7052301

Answered: Solve the V* values for this MDP - 5x5

Thu, 04 May 2023 02:50:52 +0000

At first glace, with its 17 empty spaces, this problem looks long. However, we can take advantage of the symmetry to reduce the number of variables to solve for to four.

01 x1 00 x1 01
x1 x2 x3 x2 x1
00 x3 x4 x3 00
x1 x2 x3 x2 x1
01 x1 00 x1 01

We can also see the optimal policies for each cell and avoid the need to do multiple calculations and calculate a maximum.

- Spaces with x1 should aim toward the terminal states with 1
- Spaces with x2 should move toward the x1 spaces
- Spaces with x3 should move toward x2 (since x4 in the middle is the farthest away from the terminals)
- Spaces with x4 should move toward x3 (no real choice)

Now, the value given a policy pi and the state s is

V(s) = sum (over all states s) of T(s,pi(s),s')*[R(s,pi(s),s') + gamma*V(s')]

x1 = 0.8(0 + 0.9*1) + 0.1(0 + 0.9*x1) + 0.1(0+ 0.9*x2)
x1 = 0.72 + 0.09*x1 + 0.09*x2

x2 = 0.8(0 + 0.9*x1) + 0.1(0 + 0.9*x1) + 0.1(0 + 0.9*x3)
x2 = 0.81*x1 + 0.09*x3

x3 = 0.8(0 + 0.9*x2) + 0.1(0 + 0.9*0) + 0.1(0 + 0.9*x4)
x3 = 0.72*x2 + 0.09*x4

x4 = 0.8(0 + 0.9*x3) + 0.1(0 + 0.9*x3) + 0.1(0 + 0.9*x3)
x4 = 0.90*x3