Wow. Its amazing how different it feels working in a lab
environment as opposed to a classroom. The schedule is much more
flexible, as are the deadlines, but for such a relaxed environment, it
is remarkably intimidating for a new-comer. While I felt very lucky to
be spending my summer with such an intellectual group, I also felt a
little inadequate. As well, everyone seemed to know exactly why they
were here and what they were trying to accomplish...I was still a
little shaky on the latter. I knew I would be working on knowledge
transfer for Markov Decision Processes (MDPs). Knowledge transfer is
the idea of learning a policy for one MDP, and using it on another
similar. A (very) long-term application for this concept is in
treatments for epilepsy...as it is impossoble to train an agent on an
actual human brain, it would be necessary to train it in a different
environment, then transfer it to the patient. However, this could only
be done if we could be assured that (assuming the training enviroment
was "close" enough to the actual one) the values of the policy would be
"similar" in both environments. So I knew why I was researching
knowledge transfer, but I wasn't quite sure how yet.
Shortly after I arrived in the lab, I was introduced to Norm Ferns, the PhD student whose research so far was what I was to build on for my project. Norm's work is centred around the idea of developing a metric to compare the "closeness" of states within an MDP. The metric was based on bisimulation, the idea that two states are equivalent if they have the same rewards, and the same probabilities of transitioning to equivalent states. This definition, besides being recursive, is quite a mouthful (and a mindful for that matter). I spent the rest of the week going through Norm's papers trying to get my head around the bisimulation metric and how it worked. In the big picture, the metric could be used to take a large MDP, and group states together in equivalence classes, based on the distance between as calculated by the metric. Then, this smaller MDP, where each state is a class of original states, could be solved in less time. Then using the policy form the smaller MDP on the original, one is guaranteed that the value functions will only differ by a bounded amount. So essentially, if you get satisfactory results in the simplified case, you can be assured that the same policy applied to the original will be comprable.
Monday morning began with a meeting with Norm in which he explained in
more depth, the metric described in his paper and how it had been
developed. The most interesting part of this explanation was how many
seemingly unrelated fields of math it tied together. Indeed, the
derivation of the metric involved everything from equivalence
relations (algebra), to lattices (logic and set theory), to Knaster-Tarski fixed point theorem (numerical
analysis), to empirical distributions (probability), to linear
programming, and finding the minimum cost of a flow (graph theory and
algorithms).
After the meeting, I began culling through the code used to compute the metric, and thinking about how I would need to adapt it to calculate the distance between two MDP's as opposed to two states in the same MDP. There were several stumbling blocks:
It is now Friday afternoon of my second week in the lab, and I am beginning to plan out how I am going to accomplish my research. It is exciting to finally understand the material well enough to know what questions to ask, even if I might not know how to answer them.
A meeting with profs Joelle Pineau and Doina Precup confirmed that I was on the right track with my "hack" of putting two MDP's in one file and running MDPSolve on it. However, it was indeed a hack, and it gave us much more information than we needed. Specifically, we didn't care about how close two states in the same MDP were to one another, or even about knowing the distance between every possible pair of states when one is taken from each MDP. The only thing we wanted to know was, given a mapping f:{S}->{S'} that mapped each state in MDP1 (S, A, P, R) to a unique state in MDP2 (S', A, P', R') (assuming |S| = |S'|), how "good" was this mapping?
Now "good" is open to interpretation. In theoretical terms, it means using some logical metric (like the one Norm and company had developed) to determine d(s, f(s)) for all states s in MDP1. From a more application-oriented viewpoint, it means "If I take a policy I learned on MDP1, and use it on MDP2, how much will my value function differ?". In fact, it turns out the first interepretation can be used to find that bound.
We began talking about ways to optimize the number of comparisons made by the algorithm, and ways to exploit the assumptions we'd made about the two MDPs: Same size state space (this could be removed pretty easily), same actions (this one is pretty much required in order to use the metric), the mapping, f is known, etc. Because of the nature of the Kantorovich metric, and its use of trajectories, it became evident that we could not reduce the number of comparisons done as much as we'd hoped. In fact the best we could do was modify the code so that it only compared two states if they were from different MDPs, but in fact, all of these such comparisons were needed to compute the metric.
Next we began trying to utilize the assumptions we'd made. We considered fixing a policy to get a markov chain and calculate the difference between the two environments under this policy. The advantage of this idea is that it reduces the number of actions per state to 1 (the one used by the policy) and thus, fewer samples, and trajectories are needed.
By the end of this week I was quite overwhelmed. The number of possibilities for how to approach this problem had suddenly become huge, and I wasn't quite sure if I understood any of them in enough depth to implement them. In desperation, I asked Norm if we could meet and discuss some of these ideas.
Meeting with Norm and Pablo (who is in charge of the implementation part of the bisimulation metric) was very helpful. For starters, it convinced me that I wasn't the only one who was a little confused about this stuff. I also began to understand why it was so much harder to compute the distance between states in different MDPs as opposed to when the states are in the same MDP. (It is because we cannot be guaranteed a bound on the value functions, and thus we do not have the right constraints to solve the linear program). Upon discovering this, we began to wonder if the metric was the correct way to measure these distances, after all, we were really concerned more about policy transfer than comparing states. To illustrate this, consider two MDP's: M1, M2, where M2 is identical to M1, except its rewards are all multiplied by 10. Clearly, a policy that is optimal on M1 is optimal on M2 as well, but the distance between states, since it is based on rewards and probabilities, might not reflect this.
To overcome this issue, we began to think about continuity rather than distance. Specifically, if the difference in values of two states in MDP1 is bounded by some delta, is it true that the difference of the values of their corresponding states in MDP2 is bounded by some epsilon dependant on delta only?
In math terms: ||v(s)-v(t)|| < delta == > ||v(f(s))-v(f(t))|| < epsilon?
I spent the rest of the week working on some proofs Norm had suggested I look at, including the concept of continuity as discussed above. The proofs helped me get a better idea of how the metric worked, and why, but I could not figure out how to apply the concepts to prove continuity. After what felt like a very unproductive week, I went to talk to Doina about some baby step I could take to start actually implementing something. This turned out to be a very productive meeting. First, Doina gave me a great description of the sampling algorithm in layman's terms (well, compared to what I'd read in papers), and then explained how it would be possible to modify it so that it took two MDPs as input, and only compared states if they were from different MDPs. It would be my job to modify the code so that it did this, and I was very much looking forward to a chance to dig into something a little more tangible than the abstract concepts I had been working with that week.
For once I entered the week with a clear cut idea of what I needed to get done. After looking over the existing code several times, I began to plan out how to modify it. At present, it loaded one MDP, initialized the necessary data structures, and ran the algorithm. I needed to load two MDPs, and for certain data structures (rewards matrices, transition probabilities, etc.) I needed two copies, and for others, (the Kantorovich matrix) I needed only one copy which both MDPs would share. The program was written in C++. and there was an MDP object already implemented. My first decision was whether to create two instances of the object, or to modify it to hold the data from two MDPs. Because I needed to access data from each MDP simultaneously, and they needed to share some data structures, I decided the best option was to modify the existing MDP object, to store the data for both MDPs.
In fact, this task was not as simple as I'd expected it to be. Aside from a lot of copying and pasting (if mdpNum==1 do this else if mdpNum==2, do that), there was some strategy in initializing data structures, as certain ones could not be initialized until all the data from each MDP was loaded. But this was easily overcome by breaking the initialization function into two separate ones.
After a couple days work, I had the modified code in some sort of working condition and began testing it. The first test I gave my code was a very simple 2-state MDP versus itself. I was hoping that the algorithm would determine that state 1 matched state 1 and state 2 matched state 2. Unfortunately, it didn't. The algorithm claimed that no states matched, and thus the debugging began. Finally, after printing out all the data inputted, I discovered my error. In fact, I had not been loading the second probability matrix correctly, and as a result the two MDPs were, well, different, so it was no wonder the algorithm wasn't finding them to be the same. At any rate, the bug turned out to be my forgetting to switch an index from 1 (for the first MDP) to 2 (for the second one) duh. Oh well, I suppose I needed to write those debugging print functions at some point...
It is nice to end the week on a good note, I suppose, even if I wasted a chunk of it looking for a rather trivial bug. Next week, in addition to actually testing the code on some more meaningful inputs than an MDP versus itself, there is a tutorial on bisimulation and MDPs, being given on Monday by another prof in the lab which I am looking forward to attending. It will be interesting and informative to get another look at the concepts, especially now that I understand the basic idea behind it.
This was not a very productive week in terms of making progress on my project, it was however, a very educational week. In fact, I spent most of the week attending talks on the theory behind bisimulation, which seems to extend infinitely in all directions. Indeed, it seemed the whirlwind tour Norm gave me barely scratched the surface of all the mathematical forces which manifest themselves through bisimulation. I will not attempt to summarize them here, as for some, the most I can recall is the whooshing sound they made as they flew over my head. However, some of the slides from the lectures are available here a word of warning though...these are the ones that got me very, very confused...intrigued, but confused.
Perhaps the best thing that I got out of the lectures was a renewed interest in math. I also realized that measure theory and set theory are pretty neat, and decided that taking more analysis courses in the fall might not be such a bad idea after all.
For now, though, I had to get back to work on my own project, which had reached a bit of a roadblock, in that my results seemed to vary greatly depending on which version (java or C++) of supposedly identical code I used. As a result, I was a bit wary of the results, and even more wary when I realized they failed to converge even after testing on ~1000 samples. This was a problem, as I was trying to get an estimate on a lower bound on how many samples were needed for a reasonable estimate.
Since summer had finally arrived in Montreal, I decided to abandon both math and computers for the weekend, and delve into the arts. Unfortunately, it poured on Saturday, so my plans of photographing the eclectic architecture of Montreal's Plateau neighborhood were postponed, but I did manage to get out on Sunday, and had a great time. It may take a while to get the prints up on my homepage though, as they are not digital, and therefor need to be tediously scanned in one by one.
I should try to have completely unproductive weeks more often, as it seems they are followed by very productive ones. This week, of course, fits into the latter category. It actually began the week before when I met with Joelle and Doina to discuss some approaches I could take with my project. They were very helpful, and gave me several ideas to work with:
The productivity continued on Tuesday monring when I met with Norm and Prakash to discuss from a mathematical point of view what we needed to prove to give some clout to the empirical results we were hoping to achieve from the experimments. Specifically, given two MDPs and a state mapping from the first to the second, we needed a bound on the value function of the induced policy, showing that if the two MDPs are behaviourally close in terms of bisimulation distance and the original policy is close to optimal then the induced policy is guaranteed to be close to optimal as well. This is quite a mouthful, especially, in paragraph form... I will try to decipher LaTeX and post a less verbose and more mathematical version at some point. We also discussed the idea of scaling rewards, to ensure that bisimulation distances were not magnified due to differences in rewards.
As an example of this, consider two very simple 2-state MDPs with 2 actions, flip and stay, each of which succeeds with some probability p. state 0 has a reward of 0 in both, however state 1 has a reward of 1 in the first MDP and 10 in the second. Clearly, the optimal policies on these are identical, but their bisimulation distances (based on value function) are quite different. We decided for the purpose of policy transfer, there was no harm in scaling rewards so that the smallest was always 0 and the largest was always 1.
This week was a breakthough week in terms of progress. To begin with, I finally found the bug in my code which was causing the odd results. It was very easy to fix, and I was able to get started on my benchmarking. However, waiting for the code to run on n samples, saving the results in a text file and running it on n+5 samples got very boring very quickly (unfortunately, the fix while improving the accuracy, had slowed down my runtime quite a lot, and even 100 samples now took a while to run). Laziness is the mother of invention, and, as such, I decided to take the time now and write a script which basically executed a for loop on whatever start, stop and increment parameters were specified, running the code with a different number of samples each time (ie, specifying 10, 40 10 would give you the results for 10 20 30 and 40 samples). These results were all saved in files, so the script could be left to run overnight, and the data could be analysed the next day. Of course if there was some dumb bug you forgot about, and the script didn't termiate properly, all your data was lost into the bowels of cyberspace. Overall, though I saved a lot of time, and managed to put together a few graphs showing how the accuracy of the metric increased as the number of samples did.
Norm also had a pretty productive week, and managed to prove the bound we were working on. Again, I fear this one is too mathematical to be outlined in a paragraph, so I will work on getting it into LaTeX, and post it that way. I also got my first taste of writing a scientific paper this week, as we have decided to submit this proof to an upcoming conference. I was given the job of writing an abstract, which was remarkably difficult (namely because you can't chicken out as I have done atleast twice in this journal and write things in math-ese rather than english...).
Another interesting event this week was Prakash's talk on Category Theory. The concept appeared very mathematical at first (it seemed very similar to group theory at first glance), but I am now beginning to see how it is an integral concept behind programming languages, a field which has always interested me.
This week was crunch week. With the paper deadline looming over our head (on June 30th), and Doina editing from long distance, we had our work cut out for us. I was given the job of designing an "illustration" to demonstrate the weaknesses/strengths of the bounds we had designed. There was also still some debate over whether or not reward scaling was a good idea. In a meeting with Norm and Prakash, Norm brought up a case where scaling rewards did in fact alter the dynamics of the environment. We decided that this might be a good example to include in the paper. At first glance it appears that I got off pretty easy: All I had to do was design a gridworld in which the dynamics were similar to those in the example, and Prakash had volunteered to write up an explanation for it.
However, it soon became evident that drawing such a gridworld free-hand was not the easiest task, unless you were aiming for something that looked like it belonged on a middle-school level poster-board. So I did what every good nerd does when they need to make something look professional: I turned to LaTeX. Except I didn't know about xfig. xfig is a program which lets you essentially draw things free hand, and then export them to a .tex file, giving you the best of both worlds. But like I said, I didn't know about this. Instead, I googled "pictures in LaTeX" and discovered that the options were somewhat rudimentary: Boxes and lines. Thus, I drew my gridworlds line by line. Needless to say, it was rather tedious, but the grids came out very regular (one would hope, as I drew them by plotting co-ordinates). When I showed my work to Prakash, he got a good laugh out of it, and compared it to writing an entire program in assembler. Especially when he saw my code peppered with comments such as "first vertical line". Shortly there-after I learned about two things: 1) Snap-to-grid: which makes it remarkably easier to draw perfect horizontal and vertical lines by hand, as opposed to plotting them. and 2) export to .tex file: which could apparently be done from my diagram editor too. Oh well, I guess I'll know for next time.
Towards the end of the week, we concluded that while our last minute efforts had been a good way to kick-start our research, we did not have a meaty enough paper to submit anytime soon. As a group, we decided it would be better to wait and do more work and submit a more thorough paper later, than trying to scrape one together for Friday. In the process, I got my project for the next few weeks.
We had proved a very nice theoretical bound on the difference in value functions between two MDP's using their Kantorovich distance, however, we were beginning to have our doubts about the tightness of the bound. Thus I had two tasks: First to develop a few similar gridworld with slight variations in goal location, determine the state-to-state Kantorovich distance and the value functions for each state and compare the two. And Secondly, to find an example where the bounds were obtained.
July began with a lovely fireworks show down at Old Port on Saturday, and a wonderful meal at Prakash's place on Sunday night. He invited the undergrads of our lab, and their other halves over not only to ejnoy a wonderful Indian dinner followed by some wonderful cheese, port and BBC sit-coms, but also for a lesson in Indian cuisine. It was a very educational experience, so much so in fact that I was inspired to look beyond my Patak's curry paste, and attempt curried eggplant from scratch, by mixing my own spices. Aside from having to grind up my own cardamom pods (I later learned that if I roasted them first, the grinding became much easier), the eggplant seemed to be a success, but then, as far as I'm concerned, eggplant=>tasty.
Monday saw the return of the work week. This was a bit of a recovery
week for two reasons: 1) I had spent the week before cramming for a paper
that didn't happen, and 2) I was counting down the days to the cottage, and
my brain was jumping the gun a bit. That is to say, I'd be working away,
when suddenly I'd notice I'd stopped typing, in fact, I'd stopped thinking
all-together and my mind was off on a dock somewhere in Northern Ontario. As
the week wore on, I became less persistent about reclaiming it. I did
however get some work done. I put together a small display of the data I'd
acquired while benchmarking how many samples were needed to estimate
distances accurately and presented it to Doina. Since she was going to away
at conferences for the some time, we came up with some projects I could work
on somewhat independently until her return. They included scaling the
results of my benchmarks, to dampen the effects of varied discount factors,
a project I have not yet tackled. The other (more initriguing) task, was
testing my results on larger gridworlds. I designed a simple 4-room
gridworld in which I could vary the goal location easily, and created three
mdps with slight variations in goal location. Then I approximated the
Kantorovich metric by sampling and saved the values I needed in a text file
(for my purposes, I only cared about how far state #1 in MDP1 was from state
#1 in MDP2, not how far it was from states 2, 3, 4... though the program
gave me all this data anyways). With some help from matlab. I plotted my
"world", and showed which states were the most "distant" when the goal
location changed. Not surprisingly, they were the ones closest to the
goal. What this meant was that, essentially, if we wanted to transfer data
from one MDP to another, we could reuse the original poicy, until right near
the goal, when it would need to be altered.
This is one of the images rendered in MatLab. the dark blue squares
denote walls, the lighter blue ones denote the closest states, and the dark
red ones denote the furthest states. In this example, the start state was
the bottom left corner, and the two goal states were the botton right state,
and the one directly above it. Testing the tightness of these bounds against
the actual values of the states would have to wait until after vacation.
Oh to be a cat,
Then the summer vacation
Would last forever.
I began comparing the actual difference in the value functions of my gridworlds to the distances estimated by the Kantorovich. I was not expecting the estimates to be exact, but I was quite surprised when they seemed to be almost consistently a factor of 10 larger than the actual difference in value functions. Part of the cause of this is that the Kantorovich is a very pessimistic metric. In fact, it is calculating something stronger than we need. Since the Kantorovich is not policy-specific, it assumes the "worst-case scenario" and chooses actions which will maximize the distance between states. This is not entirely surprising, as the Kantorovich metric is designed to determine bisimilar states, as opposed to states which behave the same under a specific policy. With this in mind, we began thinking of ways to modify the Kantorovich metric to get a more accurate estimate of how well a policy might transfer, as opposed to an upper-bound.
Despite being pessimistic, there were some aspects of the Kantorovich which are highly desiriable when trying to measure distance. Namely, it is a metric. This means that properties such as transitivity, and the triangle inequality hold, and we can use information from several different MDPs to learn how a policy might transfer from one to another. As well, it is continuous and stable, meaning, if an MDP is altered slightly, its distance from a second MDP will change proportionally. Once we begin to turn to policy specific measures, we lose many of the above-mentioned qualities. Despite this, we developed a measure of similarity, which, rather than taking the maximum distance between states over all actions, takes the weighted average of the distance between states according to some predescribed policy. This change cuts down the inflated distance between states, but loses much of the stability of the original metric. For starters, the new measure is not a metric, meaning knowing the distance between two MDPs tells us nothing about how they will compare to others (no transitivity, triangle inequality etc.). And just as we can design policies which minimize the differences between MDPs, we can also design ones that exploit them. This means that any similarity bounds we had figured out using the Kantorovich no longer hold with the policy-specific measure. It seems that while the Kantorovich metric is to rigid for our purposes, the policy-specific measure is too loose.
This week we began to look more closely at why the Kantorovich metric
inflated distances between MDPs so much. Specifically, even when the actions
were completely deterministic, the state-to-state distance still seemed much
larger than expected. The cause turned out to be the tendency for minor
differences in rewards to accumulate, in MDPs where there was a possibility
to "loop" back to the same state over and over, especially when a discount
factor of 0.9 was used. In order to get around this we began thinking about
ways we could modify the metric to put less emphasis on minor differences like
this, and more emphasis on general behavioural trends. To this end, we came up
with two possible courses of action. One idea was to use the original metric,
but with larger time-steps. This served the dual purpose of decreasing the
discount factor (as it was raised to a power depending on the size of the
timesteps), and essentially "zooming out" on the MDPs. In other words, suppose
we have two MDPs which cycle between a state with a reward of 1 and 0, at t=0
the first MDP is in the state with reward 1, while the second is in the state
with no reward. In determining the distance in states, at every iteration,
since one MDP has a reward of 1 and the other a reward of 0, there will be a
difference of 1 in the rewards. Using a discount factor of 0.9, this compounds
to an expected difference in value functions of about 10. However, using a
time step of length 2, we see that after 1 timestep (2 actions) the first MDP
has accumulated a reward of 1+c*0 (where c is the discount) while the second
one has accumulated a reward of 0+c*1. Using a discount factor of 0.9, this
amounts to a difference of about 0.1 per timestep, and 1 in total.
The second approach we discussed was a method called "diffusion". Diffusion
does not use bisimulation. Instead it assumes the start-state to be a source
and the goal to be a sink. Points scattered randomly on the MDP will be drawn
towards the goal, from the source. Thus if the system is left to stablize,
the value of a state can be inferred from the density of points in that
state. Both of these ideas are promising and we intend to investigate them
further in the coming months.
I spent the remaining weeks of summer once again brushing up my LaTeX skills by creating both a slide show, to present my work to other undergraduates in the school of Computer Science who had been doing research over the summer, as well as a final paper. Both of which can be found from the main page of this website. In fact, it was very interesting to see what everyone else had been working on over the summer, and there were some very intriguing projects going on. Details on the McGill Undergraduate Summer Research Symposium can be found here