<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://ianq.ai/feed.xml" rel="self" type="application/atom+xml" /><link href="https://ianq.ai/" rel="alternate" type="text/html" /><updated>2026-03-03T07:40:51-08:00</updated><id>https://ianq.ai/feed.xml</id><title type="html">Ian Quah</title><subtitle>Ph.D. Student in Neuro ∩ ML</subtitle><author><name>Ian Quah</name><email>ian@ianq.ai</email></author><entry><title type="html">The Uncertainty Of Shaving the Yak</title><link href="https://ianq.ai/hard-problems/" rel="alternate" type="text/html" title="The Uncertainty Of Shaving the Yak" /><published>2025-09-15T00:00:00-07:00</published><updated>2025-09-15T00:00:00-07:00</updated><id>https://ianq.ai/hard-problems</id><content type="html" xml:base="https://ianq.ai/hard-problems/"><![CDATA[<p><strong>Alt Title</strong>: The Uncomfortable Truth About Hard Problems</p>

<h1 id="the-discomfort-of-hard-problems">The Discomfort of Hard Problems</h1>

<p>At the beginning of my second year in my Ph.D., I remember having a discussion with my PI where he told me to run a validation for my experiment. Instead, I dragged my feet and chose to clean up my code (adding logging, modularizing functions and general refactoring). I told myself that my cleanup was worthwhile: when the validation showed positive results, I’d need scalable, correct code for larger simulations. But I was really avoiding the validation because I knew that if my experiment failed the validation step, I’d be back to the drawing board after “wasting” a year. Instead of embracing the possibility of failure and doing the emotionally hard work, I spent time on busy-but-productive tasks. As my PI suspected, the validation killed that experiment. Even though I knew experiments fail for countless reasons and that failure wasn’t a reflection of my abilities, I couldn’t help but feel like I had failed.</p>

<p>Having said that, that’s not the story I’m trying to tell here - what I described happens all the time in research and is often talked about. What I’m here to discuss is what happened later, over a year later in fact, where I found a new substrate to test the idea and that the code cleanup actually saved me time; cleaning up code (especially messy code) over a year later is nightmarish and bug-prone.</p>

<p>This experience reinforced something I’d been thinking about for a while: hard problems aren’t just difficult because they require specialized knowledge, or involve many complex steps; they’re hard because they exist in a liminal space where you genuinely can’t tell if you’re making progress or just spinning your wheels. We’re trained to break down problems into manageable steps, but what happens when you can’t tell if those steps are worthwhile? Is the work you’re putting in going to pay off down the line, or is it all wasted effort?</p>

<h2 id="the-yak-shaving-paradox">The Yak Shaving Paradox</h2>

<p>In the tech industry, there’s a concept called “shaving the yak” that perfectly captures this dillema; it has two almost contradictory definitions:</p>

<ol>
  <li>
    <p>Any apparently useless activity which, by allowing one to overcome intermediate difficulties, allows one to solve a larger problem.</p>
  </li>
  <li>
    <p>A less useful activity done consciously or subconsciously to procrastinate about a larger but more useful task.</p>
  </li>
</ol>

<p>This blog post has been the longest post (time-wise) I’ve worked on, because it was so hard to put into words what I was experiencing. It wasn’t until I was talking to one of my friends when that phrase about shaving the yak popped into my head, and everything flowed. That conversation made me realize that part of what makes research so difficult is that the same work can either be essential groundwork or elaborate procrastination (the busy-but-productive work), and you often can’t tell which you were doing until after the fact.</p>

<p>In my original example, I was clearly shaving the yak. If you had asked me after I killed that experiment, I would have lamented the time I had wasted and said that I clearly fell into the second camp. Yet even then, those improvements I made to my code have made revisiting the problem far easier. This ambiguity is what makes hard problems so uniquely uncomfortable. Let’s consider what happened immediately after I killed that experiment. I had to write a report on my research from that first year and I now faced another dilemma: I had “killed” that experiment by that point, so do I put in a lot of time writing that essay by generating and polishing figures, rerunning simulations that would take a while, and doing a more thorough literature review -OR- do I throw in the towel, say that the results weren’t what I expected and move on to the next problem? Both choices felt like they could be yak shaving. Polishing a failed experiment might be pointless busy work to avoid the difficult task of identifying a new research problem, or it might be essential documentation of what I’d learned. Cutting my losses and moving on to the next project might be productive progress, or it might be my way of avoiding the hard work of understanding why the first approach failed beyond just throwing my hands up and saying “biology is messy”. The reason this ambiguity persists is that hard problems unfold over such long timescales that you’re making decisions with incomplete information about what will actually matter.</p>

<p>I can’t help but think of reinforcement learning and the sparse reward problem, where we only get feedback after a long time, as well as the credit assignment problem, where it is difficult to disambiguate the exact steps we took that led to our current outcome.</p>

<h2 id="when-our-ego-gets-involved">When Our Ego Gets Involved</h2>

<p>The uncertainty becomes even more uncomfortable when we get emotionally invested in our approach. As I mentioned in my previous post, <a href="./2025-09-06-back-to-school.md">I can’t imagine going back</a>:</p>

<blockquote>
  <p>What I mean is this: you need a sort of “balance” where you have to care about your research, and I mean <strong>deeply</strong> care about your research - it has to fill you with wonder and make you really contemplate the hard questions - but at the same time you have to be able to take your failures (and there will be many) on the chin and try the next idea</p>
</blockquote>

<p>Many of us do what <a href="https://calnewport.com/">Cal Newport</a> dubs “knowledge work”, where it is work that benefits from thinking deeply about problems and solving them. Unsurprisingly, our work naturally ties our self-worth to our ideas, methods and solutions to the challenges we face; not unlike how someone who works as a carpenter might tie their self-worth to the furniture they craft and the design decisions they make given a client’s requirements. When you can’t tell if your work is productive or just elaborate avoidance, questioning your process and methodds feels like questioning your judgment and intelligence. Most of us took some form of a science class during our schooling years, where we learn about how to construct hypotheses by searching through literature, forming a testable question, and most importantly, defending our position.</p>

<p>My advisor asking me to carry out that validation scared me, because in a sense, that project was a part of who I was - it not only symbolized a year’s worth of time and effort, but was the product of everything I was, a <em>scientist</em>. Thankfully, because I had already established that sense of self-worth outside of academia, I shrugged it off quickly, but that experience stuck with me.
Learning to separate your self-worth from individual experiments or projects is imperative in navigating this ambiguity. Instead of being precious with your ideas, you accept that failure is multi-faceted and multi-factored; failure can happen because the biology is just too complicated, the recordings weren’t of high-enough resolution, or there was human error in the data collection.</p>

<h2 id="going-back-to-school">Going back to School</h2>

<p>That conversation I mentioned at the start was a bitter pill to swallow - it was emotionally painful to kill an experiment I’d spent a year working on. But it taught me something important about not being precious with ideas and learning to reflect on which form of yak shaving I’m doing (even if it’s an uncertain activity). In my previous blog post, I mentioned going back to academia because the research I had my eye on was progressing quickly, but that’s not the whole truth. I went back because I wanted to develop the skill of navigating both forms of yak shaving in a “sink or swim” environment.</p>

<p>I don’t think the discomfort I mentioned at the start, the push and pull between productive and unproductive work, ever fully goes away. What’s worse is how easy it is to forget this dynamic and get wrapped up in busy work - it happens all the time, and even to the best of us. But I <strong>do</strong> think you can develop better judgment about when to push through and when to step back, though it’s not something you can easily learn alone. You need honest mentors and a strong support network who will give you fair but direct feedback about your work. The skill isn’t eliminating ambiguity but learning to navigate it with guidance and practice.</p>

<p>This challenge isn’t exclusive to academia and I’m sure it applies across fields where long-term, uncertain work is the norm. To tie things back to the tech industry, there’s a reason junior engineers don’t work on “bigger picture” projects: it’s hard to know where a field might turn and how to pivot seamlessly when you can’t yet distinguish necessary preparation from elaborate avoidance by for example, moving to the new shiny tech stack.</p>]]></content><author><name>Ian Quah</name><email>ian@ianq.ai</email></author><category term="grad school" /><summary type="html"><![CDATA[Alt Title: The Uncomfortable Truth About Hard Problems]]></summary></entry><entry><title type="html">I can’t imagine going back</title><link href="https://ianq.ai/back-to-school/" rel="alternate" type="text/html" title="I can’t imagine going back" /><published>2025-08-26T00:00:00-07:00</published><updated>2025-08-26T00:00:00-07:00</updated><id>https://ianq.ai/back-to-school</id><content type="html" xml:base="https://ianq.ai/back-to-school/"><![CDATA[<p><strong>Spoiler</strong>: I could :)</p>

<p>A bit about me: I left the tech industry after 6 years, arguably at the peak of my career, to go back to grad school. When I shared this decision with my friends, after the initial congratulations and “I’m so happy for you”s, I often got a few questions: 1) why do you want to go back? And 2) what are you researching? I assume they were asking either out of politeness or to keep the conversation going, but a few years into my Ph.D. I thought I should finally sit down and answer the (arguably) easier one: the first.</p>

<p>If you’re here reading this post, you’re in one of a few camps: you’re contemplating grad school yourself after a few years away working somewhere, to which I say know that my experience is as someone who worked in tech and does computational neuroscience (highly related to my undergrad and what I worked on in industry), so YMMV; or you’re a friend who knew that I waffled through my in-person response and you’re hoping that I’ve fleshed out my thoughts; or I just shared this on LinkedIn or one of the grad school subreddits, in which case hi!</p>

<!--toc:start-->

<ul>
  <li><a href="#context">Context</a>
    <ul>
      <li><a href="#its-all-about-timing">It’s All About Timing</a></li>
      <li><a href="#you-have-to-be-ready">You Have to be “Ready”</a></li>
    </ul>
  </li>
  <li><a href="#returning-to-academia">Returning to Academia</a></li>
  <li><a href="#the-opportunity-cost-of-graduate-school">The Opportunity Cost of Graduate School</a></li>
  <li><a href="#closing-words">Closing Words</a>
<!--toc:end--></li>
</ul>

<h1 id="context">Context</h1>

<p>To properly answer why 2022 felt like the right time, we need to go back another 10 years, to 2012, when I first applied to CMU. At the time, a standard college application essay prompt was “Why do you want to join $COLLEGE_NAME?”, and for my CMU application I wrote about wanting to build brains from computers. The exact details I no longer remember, but it was something about researching how we could hook up enough CPUs together to simulate every neuron in the human brain simultaneously. Lofty ambitions for a kid who hadn’t even written his first line of code, but the admissions office liked it enough to admit me into the Cognitive Science program. Fast forward to 2017, when I graduated, and I felt like I was no closer to accomplishing my original goal. The only thing that had changed was I realized how lofty and underspecified my research question was. I was at a fork in the road: I could cut my losses and go into the tech industry, or I could go to graduate school. As I was mulling taking the GREs and applying to masters programs (my undergraduate GPA was abysmal after floundering around for my first two years), I ended up having a long chat with one of my research mentors at the time, which changed my trajectory through life. He sat me down and very kindly shared his own journey from his undergraduate degree to his Ph.D. His was slightly less meandering than mine, but he worked for a few years after graduating and then applied to graduate school once he was sure that it was what he wanted to do. He then gave me two insights to reflect on: the things I was interested in researching just weren’t in line with what the field was working on, and I wasn’t “ready” yet; even if I was accepted, I would probably drop out of the program. <strong>Ouch.</strong> Rough, but in hindsight, true.</p>

<h2 id="its-all-about-timing">It’s All About Timing</h2>

<p>Claude Shannon was a genius and his seminal work on information theory (which warranted a Ph.D.) seeded an entire new field. I am not Claude Shannon, and there was no way that I would be able to spawn an entirely new field around neuroscience-driven deep learning. So, I was left with a choice: apply to a professor who was working on the topics that interest me and pray that the subfield takes off, or put off my Ph.D. ambitions until I was confident that it was stable enough to sustain itself, but still early enough that I might be able to make a contribution to it. That’s a lot to hedge against, and I don’t think it would have worked out because (as I’ve learned over time). Perhaps more importantly, the questions that I found interesting when I graduated, I no longer find that interesting (or even answer-able); I no longer care about creating an artificial brain and I’ve instead become much more interested in studying structures in the brain and how they influence the kinds of computations we can do. It’s fair to say that my interests have changed since joining my current lab, the <a href="ahmedlab.science">Ahmed Lab</a>, but even then all my work is possible only because of pure dumb luck and good timing; as I was gaining faith that my field of research interest wouldn’t disappear overnight and applying to labs, the <em>D. melanogaster</em> community was having its own “upheaval” with the release of the <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC8903166/">Flywire</a> dataset. The dataset was a marvel of modern ML and intensive neuroscience work: the community worked together to map out the entire connectome of the fruit fly via image segmentation methods, and crowdsourcing and annotating neurons. If it weren’t for the release of this dataset (which opened up far more possibilities than my original reasons for joining the program), my research would look very different and I don’t think I would find it as interesting.</p>

<p><em>Pure dumb luck.</em></p>

<h2 id="you-have-to-be-ready">You Have to be “Ready”</h2>

<p>There’s something to be said about timing, but there’s also something to be said about who you are as a person and where you are emotionally. Looking back, I know that I just didn’t have what it would take to make it. Too much of my self-worth was tied to my research output and I just took failures far too hard, which is ironic to say coming from a Ph.D. trainee. What I mean is this: you need a sort of “balance” where you have to care about your research, and I mean <strong>deeply</strong> care about your research - it has to fill you with wonder and make you really <a href="./2025-09-12-work-on-hard-problems.md">contemplate the hard questions</a> - but at the same time you have to be able to take your failures (and there will be many) on the chin and try the next idea. In college I was a solid B- student (a solid C- student up to my junior year, when I started research), so my sense of purpose was intimately tied to my research and I couldn’t imagine life outside academia because so much of my “upturn” in confidence was tied to research. By taking a step away for a few years, I learned that I could very much exist outside of the research sphere. In fact, I could thrive, and this has honestly carried me through graduate school.</p>

<h1 id="returning-to-academia">Returning to Academia</h1>

<p>Before I begin, I have to state that I unequivocally believe that there are fields and industries where you can get by without a Ph.D. I knew extremely talented people who went on to work as researchers or research engineers at DeepMind or Google Brain after their undergraduate degree or their masters. I consider this the “best case” if you’re in that pool- you have a large moat of resources, fantastic pay, and access to world-class researchers all working on similar things. And then there are industries where the work doesn’t tie as directly with academia, and here you’re left with the unfortunate decision of whether you should return. I think neuroscience is one of those subfields. When I first joined imy lab, I was thinking about how we can use curiosity-based RL methods, or inverse RL methods to model behavior and use that behavior to better understand these neural systems. Over time, my work and research interests evolved into studying the relation between structure and function (see my lab’s webpage); without having access to the lab materials (reagents, gases, etc.) my work would be impossible. My work fundamentally would not be done in an industry setting, not because of a lack of resources, but because this sort of work isn’t yet marketable and there’s no clear path to monetization. Also, a lot of my work is researching some hypothesis about how manipulating some gene will influence behavior, which isn’t in line with a standard business model.</p>

<p>So, why did I return to graduate school? I returned because the kinds of questions I was interested in, I could not do alone and was very unlikely to do at a company. Why did I choose that year to return? Around the time I applied back in 2022, there were quite a few papers that guided my thinking and reasoning: <a href="https://www.sciencedirect.com/science/article/pii/S1364661319300610">Reinforcement Learning, Fast and Slow</a>, <a href="https://arxiv.org/pdf/2006.04439">Liquid Time-constant Networks</a>, and <a href="https://www.sciencedirect.com/science/article/pii/S0896627322008066">In vitro neurons learn and exhibit sentience when embodied in a simulated game-world</a>. It wasn’t so much that any one of those papers signaled to me “now’s the time”, but more that it seemed like research into what I was interested in weren’t just niche one-off papers, and gave me confidence that interest in these topics wouldn’t just dry up. These papers signaled a resurging interest in bringing biology, neuroscience, and machine learning together; whether to study behavior that’s extremely difficult to capture in the wild, to train more efficient models, or just crazy sci-fi stuff that it felt like the field had lost while chasing benchmarks. From attending <a href="https://www.neuroaiseattle.com/">NeuroAI Seattle</a> in both 2024 and 2025, I think that my hunch was right - hearing the speakers talk about the interesting work they did made me confident that the field had a strong future.</p>

<h1 id="the-opportunity-cost-of-graduate-school">The Opportunity Cost of Graduate School</h1>

<p>Let’s get one thing clear: the opportunity cost of grad school is immense, and it only gets worse the further out you are from your undergrad. From a career progression standpoint, the natural next step after a Ph.D. is a post-doc. position, but that pipeline doesn’t look much better: excellent researchers struggle to find positions all the time. From an earnings standpoint, simple back-of-the-envelope math puts the loss of earnings at over a million dollars in <em>raw salary</em>, not accounting for promotions, benefits, and compound interest. Which segues us into lifestyle changes; unless you come from money you need to adjust your spending habits. I like to think that I live rather simply, but even then, it’s still depressing how slowly my bank account numbers go up. Another lifestyle change you’ll have is an increased amount of stress and general pressure to publish and do research; at the Ph.D. trainee level, every moment you’re not researching is time that another researcher could be spending working on a similar idea (which might lead you to not getting that post-doc position) and at the PI level, you’re constantly worrying about funding for your lab and endlessly writing grants. I’d say it’s most similar to being a founder at a company, which might explain why so many freshly-minted Ph.D. holders migrate into tech. from a social standpoint, you will undoubtedly experience FOMO - your friends will be getting married, buying houses (fingers crossed) and hitting all the “milestones” of being an adult, while you’re “still in school”. I understand that academia has a reputation for sticking to itself, but I <strong>get it</strong> - the cadence your life follows is just fundamentally different from that of your friends. To people with no insight into the lifestyle, you’ve entered this phase of “delayed adulthood” (a comment I actually received), but in reality you’re just trading off certain things for other things.</p>

<h1 id="closing-words">Closing Words</h1>

<p>Despite all these drawbacks of returning after a few years, I will say that I think everyone should work for a bit outside academia before going back. Leaving academia undoubtedly changed my perspective vis-à-vis time management, productivity and <strong>self worth</strong>. If you’re working and have left academia for a bit, I’d say reflect on what I mentioned, particularly about doing research while working at a company. If you truly feel that you cannot accomplish it in that context, then I’d say take the leap. If you’re a friend who is reading this, do let me know what you think and hopefully this answered any questions you might have had!</p>]]></content><author><name>Ian Quah</name><email>ian@ianq.ai</email></author><category term="grad school" /><summary type="html"><![CDATA[Spoiler: I could :)]]></summary></entry><entry><title type="html">On Monads, Monoids and Endofunctors 1: The monoid</title><link href="https://ianq.ai/cats-big-data-monoid/" rel="alternate" type="text/html" title="On Monads, Monoids and Endofunctors 1: The monoid" /><published>2022-07-19T00:00:00-07:00</published><updated>2022-07-19T00:00:00-07:00</updated><id>https://ianq.ai/cats-big-data-monoid</id><content type="html" xml:base="https://ianq.ai/cats-big-data-monoid/"><![CDATA[<p><strong>Spoiler</strong>: Category theory has applications in machine learning</p>

<p>I’m a fan of code abstraction; I like how clean code looks and “feels”. I think that clean and good code is like art. And just like art can be categorized into styles such as Impressionism, Neo-Impressionism, and Post-Impressionism (all of which I like), we can also organize code.</p>

<p>In this post, I do not talk about functional vs. imperative vs. object-oriented programming but the mathematical structure in code. You might have heard of concepts such as monads, monoids, functors, etc. At an abstract level, these concepts lay out specific properties that we can use to describe how data can flow between various classes (in the programming sense, e.g., python, c++, java, etc.). The benefit here is that if your code fulfills the requirements laid out by these categories, you get certain guarantees about your program regarding results and how you can compose them together.</p>

<p>This is the first in a series of blog posts discussing categories in programming languages that hopefully help you notice patterns to write cleaner code. This series will not be mathematical and assumes no prior knowledge other than <code class="language-plaintext highlighter-rouge">python</code> (which you don’t even really need - it just provides a concrete example of what we’re doing).</p>

<p>We will continually expand on the following scenario throughout the series as we go from “ugly” unabstracted code to clean abstractions. It’s important to note that you (and I) have probably written code that fits into these concepts without even realizing it! The concepts introduced here are to make you more aware of what you are writing and make you notice these patterns, allowing you to reuse lots of code you have already written.</p>

<h1 id="1-initial-project">1) Initial Project</h1>

<p>You are working on a project involving “parallel” computation, e.g., you have multiple computers or processes on the same system. Concretely, you have 100 machines with identical datasets on them. You want to do a hyperparameter search, e.g., ten searches over each of the 100 machines, totaling 1K runs. For each run, you want to track some validation loss before returning the model with the lowest validation loss.</p>

<p><strong>Note</strong>: Throughout this post we assume that you have some <code class="language-plaintext highlighter-rouge">train</code> and <code class="language-plaintext highlighter-rouge">validate</code> method implemented.</p>

<h2 id="11-simple-scenario">1.1) Simple Scenario</h2>

<p>If you were to find the best model, you might have something like the following:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="n">Dataset</span> <span class="o">=</span> <span class="n">Tuple</span><span class="p">[</span><span class="n">NumericalArray</span><span class="p">,</span> <span class="n">NumericalArray</span><span class="p">]</span>
<span class="n">ValidationResults</span> <span class="o">=</span> <span class="n">List</span><span class="p">[</span><span class="nb">float</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">Node</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
    <span class="s">"""
    A compute node on a single machine
    """</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="n">Dataset</span><span class="p">,</span> <span class="n">hyperparameters</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">train_data</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">validation_data</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">conf</span> <span class="o">=</span> <span class="n">hyperparameters</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">validation_losses</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>  <span class="c1"># A Map 
</span>        <span class="s">"""Run the training and validation"""</span>
        <span class="k">for</span> <span class="n">conf</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">conf</span><span class="p">:</span>
            <span class="n">trained_model</span> <span class="o">=</span> <span class="n">train</span><span class="p">(</span><span class="n">conf</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">train_data</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">validation_losses</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">validate</span><span class="p">(</span><span class="n">trained_model</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">validation_data</span><span class="p">))</span>

    <span class="k">def</span> <span class="nf">report</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">validation_lists</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">ValidationResults</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">:</span>  <span class="c1"># A reduce
</span>        <span class="s">"""
        validation_lists = [
            [1, ..., 10]  # Node 1
            [0.1, ..., 1.0] # Node 2
            ....
        ]
        """</span>
        <span class="n">minimum</span> <span class="o">=</span> <span class="n">math</span><span class="p">.</span><span class="n">inf</span>
        <span class="k">for</span> <span class="n">arr</span> <span class="ow">in</span> <span class="n">validation_lists</span><span class="p">:</span>
            <span class="n">minimum</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">minimum</span><span class="p">,</span> <span class="nb">min</span><span class="p">(</span><span class="n">arr</span><span class="p">))</span>
        <span class="k">return</span> <span class="n">minimum</span>
</code></pre></div></div>

<p>And you would distribute this to all your nodes. After completing its computation, each node will send its report (a list of 10 floating numbers describing the validation loss) to a “reducer” node. The reducer node will accumulate all 1K results before reducing them to find the minimum value.</p>

<h2 id="12-complicated-scenario">1.2) Complicated Scenario</h2>

<p>The situation above is simple; the final node in the graph accumulates all the <code class="language-plaintext highlighter-rouge">report</code> results and finds the minimum, which is simple and doesn’t take up too much memory since floats are cheap to store.</p>

<p>However, what happens if we want to find more than just the minimum losses, and our data takes up much more memory? In this case, we would like to apply multiple reductions; Nodes 1-10 send their results to Reducer1, Nodes 11-20 send to Reducer2, and so forth. At the end, we have a final reducer which takes results from all the reducers for our final result</p>

<p>Our code now looks like the following:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Dataset</span> <span class="o">=</span> <span class="n">Tuple</span><span class="p">[</span><span class="n">NumericalArray</span><span class="p">,</span> <span class="n">NumericalArray</span><span class="p">]</span>
<span class="n">ValidationResults</span> <span class="o">=</span> <span class="n">List</span><span class="p">[</span><span class="nb">float</span><span class="p">]</span>
<span class="n">Data</span> <span class="o">=</span> <span class="n">Union</span><span class="p">[</span><span class="n">Dataset</span><span class="p">,</span> <span class="n">ValidationResult</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">Node</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
    <span class="s">"""
    A compute node on a single machine which either:
        - runs the hyperparameter search
        - runs a reduction on the data
    """</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="n">Data</span><span class="p">,</span> <span class="n">hyperparameters</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">):</span>
        <span class="s">"""
        In the case of our data being of instance `ValidationResults`, hyperparameters is an empty dictionary
        """</span>

        <span class="c1"># On our "reducer" nodes
</span>        <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">List</span><span class="p">):</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">data</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">train_data</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">validation_data</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">conf</span> <span class="o">=</span> <span class="n">hyperparameters</span>

        <span class="bp">self</span><span class="p">.</span><span class="n">validation_losses</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Run the training and validation"""</span>

        <span class="c1"># Our reduce step
</span>        <span class="k">if</span> <span class="nb">hasattr</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">validation_losses</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">data</span>
            <span class="k">return</span>
        <span class="c1"># Our map-and-run step
</span>        <span class="k">for</span> <span class="n">conf</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">conf</span><span class="p">:</span>
            <span class="n">trained_model</span> <span class="o">=</span> <span class="n">train</span><span class="p">(</span><span class="n">conf</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">train_data</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">validation_losses</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">validate</span><span class="p">(</span><span class="n">trained_model</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">validation_data</span><span class="p">))</span>

    <span class="k">def</span> <span class="nf">report</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">:</span>
        <span class="s">"""
        All of the results here get collected and saved
        """</span>
        <span class="k">return</span> <span class="nb">min</span><span class="p">(</span><span class="n">validation_losses</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="121-messiness">1.2.1) Messiness</h3>

<p>As we can see above, the code is quite messy. The messiness is because we have to care about the <em>underlying</em> data and what to do with it. We want to squint our eyes and abstract all the conditionals and checks.</p>

<p>Concretely, we would like to abstract away the values and make it cleaner, which we can do by the following:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Dataset</span><span class="p">():</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="n">Tuple</span><span class="p">[</span><span class="n">NumericalArray</span><span class="p">,</span> <span class="n">NumericalArray</span><span class="p">],</span> <span class="n">hyperparameters</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">train_data</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">validation_data</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">conf</span> <span class="o">=</span> <span class="n">hyperparameters</span>

        <span class="bp">self</span><span class="p">.</span><span class="n">validation_losses</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">conf</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">conf</span><span class="p">:</span>
            <span class="n">trained_model</span> <span class="o">=</span> <span class="n">train</span><span class="p">(</span><span class="n">conf</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">train_data</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">validation_losses</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">validate</span><span class="p">(</span><span class="n">trained_model</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">validation_data</span><span class="p">))</span>

    <span class="k">def</span> <span class="nf">report</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">:</span>
        <span class="k">return</span> <span class="nb">min</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">validation_losses</span><span class="p">)</span>

<span class="k">class</span> <span class="nc">ValidationResults</span><span class="p">():</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">float</span><span class="p">],</span> <span class="n">_</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Any</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">data</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span>

    <span class="k">def</span> <span class="nf">report</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">:</span>
        <span class="k">return</span> <span class="nb">min</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">data</span><span class="p">)</span>

<span class="n">Container</span> <span class="o">=</span> <span class="n">Union</span><span class="p">[</span><span class="n">Dataset</span><span class="p">,</span> <span class="n">ValidationResults</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">Node</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
    <span class="s">"""
    A compute node on a single machine which either:
        - runs the hyperparameter search
        - runs a reduction on the data
    """</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">container</span><span class="p">:</span> <span class="n">Container</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">container</span> <span class="o">=</span> <span class="n">container</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="c1"># The individual types handle their own run
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">container</span><span class="p">.</span><span class="n">run</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">report</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">:</span>
        <span class="c1"># The individual types handle their own reduction
</span>        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">container</span><span class="p">.</span><span class="n">report</span><span class="p">()</span>
</code></pre></div></div>

<p>As we can see, we defined two classes above, which will handle the <code class="language-plaintext highlighter-rouge">run</code> and <code class="language-plaintext highlighter-rouge">report</code> as necessary. By delegating the calls, we, as the programmer, do not have to care what the underlying <code class="language-plaintext highlighter-rouge">Container</code> is.</p>

<p>In my opinion, this is much cleaner! This way, we have decoupled the run logic from the underlying data type. All we need to do is call the appropriate values.</p>

<p>At a higher level, this is freeing because we can treat these class instances are abstract containers - as long as something follows the type-signatures of <code class="language-plaintext highlighter-rouge">run</code> and <code class="language-plaintext highlighter-rouge">report</code> from <code class="language-plaintext highlighter-rouge">Node</code>, it should, in theory, work out exactly as they expect.</p>

<h3 id="122-so-what">1.2.2) So what?</h3>

<p>However, none of this should be new to you. Creating an abstract interface to make code clean isn’t anything “interesting” in and of itself. Let’s go deeper.</p>

<h1 id="2-the-second-phase">2) The Second Phase</h1>

<p>In the second phase of the project, you decide that you want to add in things like:</p>

<ul>
  <li>running average</li>
  <li>standard deviation</li>
  <li>tracking the 100 best models in terms of validation losses</li>
</ul>

<p>Which would ultimately derail the structure we’ve got above…. or would it? Let’s take a look at the custom types we have defined so far:</p>

<p><code class="language-plaintext highlighter-rouge">Dataset</code></p>
<ul>
  <li>run</li>
  <li>report</li>
</ul>

<p><code class="language-plaintext highlighter-rouge">ValidationResult</code></p>
<ul>
  <li>run</li>
  <li>report</li>
</ul>

<p>We notice that our <code class="language-plaintext highlighter-rouge">Dataset</code> doesn’t change much, other than the <code class="language-plaintext highlighter-rouge">Dataset.run</code>. Our <code class="language-plaintext highlighter-rouge">ValidationResult</code> will change, but that’s understandable.</p>

<p><strong>Note</strong> In the following, I assume you’ll be keeping track of the top 100 best models in your own way. I’ll be “using” a heap, but I won’t include any logic for it because that’s not the point of this work.</p>

<h2 id="21-naive-approach">2.1) Naive Approach</h2>

<p>The naive approach (which would probably come to mind first) would be the following</p>

<p><strong>P.s</strong>: at the end of our reduce step, we have a dictionary of values which we must process to get whatever values you want.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Dataset</span><span class="p">():</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="n">Dataset</span> <span class="o">=</span> <span class="n">Tuple</span><span class="p">[</span><span class="n">NumericalArray</span><span class="p">,</span> <span class="n">NumericalArray</span><span class="p">],</span> <span class="n">hyperparameters</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">train_data</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">validation_data</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">conf</span> <span class="o">=</span> <span class="n">hyperparameters</span>
    
        <span class="bp">self</span><span class="p">.</span><span class="n">validation_losses</span> <span class="o">=</span> <span class="p">[]</span>


    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">validation_loss_min_heap</span> <span class="o">=</span> <span class="n">heapify</span><span class="p">([])</span>
        <span class="k">for</span> <span class="n">conf</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">conf</span><span class="p">:</span>
            <span class="n">trained_model</span> <span class="o">=</span> <span class="n">train</span><span class="p">(</span><span class="n">conf</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">train_data</span><span class="p">)</span>
            <span class="n">validation_losses</span> <span class="o">=</span> <span class="n">validate</span><span class="p">(</span><span class="n">trained_model</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">validation_data</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">validation_losses</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">validation_losses</span><span class="p">)</span>

            <span class="c1"># you do the checks and logic
</span>            <span class="bp">self</span><span class="p">.</span><span class="n">validation_loss_min_heap</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">validation_losses</span><span class="p">)</span>


    <span class="k">def</span> <span class="nf">report</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">float</span><span class="p">]:</span>
        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"min"</span><span class="p">:</span> <span class="nb">min</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">validation_losses</span><span class="p">),</span>
            <span class="s">"sum"</span><span class="p">:</span> <span class="nb">sum</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">validation_losses</span><span class="p">),</span>
            <span class="s">"count"</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">validation_losses</span><span class="p">)</span>
            <span class="s">"best_100"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">validation_loss_min_heap</span>
        <span class="p">}</span>

<span class="k">class</span> <span class="nc">ValidationResultDict</span><span class="p">():</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">Dict</span><span class="p">],</span> <span class="n">_</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">data</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span>

    <span class="k">def</span> <span class="nf">report</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="n">min_so_far</span> <span class="o">=</span> <span class="n">math</span><span class="p">.</span><span class="n">inf</span>
        <span class="n">sum_so_far</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="n">count_so_far</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="n">validation_loss_min_heap</span> <span class="o">=</span> <span class="n">heapify</span><span class="p">([])</span>
        <span class="k">for</span> <span class="n">data_dict</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">data</span><span class="p">:</span>
            <span class="n">min_so_far</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">min_so_far</span><span class="p">,</span> <span class="n">data_dict</span><span class="p">[</span><span class="s">"min"</span><span class="p">])</span>
            <span class="n">sum_so_far</span> <span class="o">=</span> <span class="n">sum_so_far</span> <span class="o">+</span> <span class="n">data_dict</span><span class="p">[</span><span class="s">"sum"</span><span class="p">]</span>
            <span class="n">count_so_far</span> <span class="o">=</span> <span class="n">count_so_far</span> <span class="o">+</span> <span class="n">data_dict</span><span class="p">[</span><span class="s">"count"</span><span class="p">]</span>

            <span class="c1"># you do the checks and logic
</span>            <span class="n">validation_loss_min_heap</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="n">data_dict</span><span class="p">[</span><span class="s">"best_100"</span><span class="p">])</span>

        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"min"</span><span class="p">:</span> <span class="n">min_so_far</span>
            <span class="s">"sum"</span><span class="p">:</span> <span class="n">sum_so_far</span><span class="p">,</span>
            <span class="s">"count"</span><span class="p">:</span> <span class="n">count_so_far</span>
            <span class="s">"best_100"</span><span class="p">:</span> <span class="n">validation_loss_min_heap</span>
        <span class="p">}</span>

<span class="n">Container</span> <span class="o">=</span> <span class="n">Union</span><span class="p">[</span><span class="n">Dataset</span><span class="p">,</span> <span class="n">ValidationResult</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">Node</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
    <span class="s">"""
    A compute node on a single machine which either:
        - runs the hyperparameter search
        - runs a reduction on the data
    """</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="n">Data</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">container</span> <span class="o">=</span> <span class="n">data</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="c1"># The individual types handle their own run
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">container</span><span class="p">.</span><span class="n">run</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">report</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">:</span>
        <span class="c1"># The individual types handle their own reduction
</span>        <span class="c1"># Also, you now have to process the returned dictionary
</span>        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">container</span><span class="p">.</span><span class="n">report</span><span class="p">()</span>
</code></pre></div></div>

<p>where we added custom code to track the state and update our dictionary container. However, as we can see, there is a LOT of similarity between the <code class="language-plaintext highlighter-rouge">Dataset.report</code> and <code class="language-plaintext highlighter-rouge">ValidationDatasetDict.report</code>. Can we make this cleaner?</p>

<p>To do so, we can first introduce the concept of a <a href="https://en.wikipedia.org/wiki/Monoid">monoid</a> but I wouldn’t bother reading that until after you’ve read this article.</p>

<h2 id="22-a-monoid">2.2) A monoid?</h2>

<p>How does a monoid help us? Well, what <strong>is</strong> a monoid? A monoid is a mathematical structure that has the following properties:</p>

<ul>
  <li>a binary operation that is associative i.e operation(a,b) == operation(b, a)</li>
  <li>closed i.e two instances of <code class="language-plaintext highlighter-rouge">BLABLABLA</code> will always output an instance of <code class="language-plaintext highlighter-rouge">BLABLABLA</code> when you apply the binary operation above</li>
  <li>an identity e.g 1 + 0 == 1 and 10 * 1 == 10 (0 and 1 being the identity respectively)</li>
</ul>

<p>Knowing this, could we abstract out our code? We’re making a bit of a jump below, but I promise I’ll add comments to the code. Let’s add a new class, <code class="language-plaintext highlighter-rouge">Summary</code>, which we define as the following:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Summary</span><span class="p">():</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">validation_loss</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">float</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
        <span class="s">"""
        We define an identity and non-identity instantiation

        There are 2 cases:
            - validation_loss is None:       where our compute node had an empty configuration file, or errored out
            - validation_loss is not None:   our computation node worked!

        """</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">if</span> <span class="n">validation_loss</span> <span class="ow">is</span> <span class="bp">None</span> <span class="k">else</span> <span class="mi">1</span>
        <span class="bp">self</span><span class="p">.</span><span class="nb">min</span> <span class="o">=</span> <span class="n">math</span><span class="p">.</span><span class="n">inf</span> <span class="k">if</span> <span class="n">validation_loss</span> <span class="ow">is</span> <span class="bp">None</span> <span class="k">else</span> <span class="n">validation_loss</span>
        <span class="bp">self</span><span class="p">.</span><span class="nb">sum</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">if</span> <span class="n">validation_loss</span> <span class="ow">is</span> <span class="bp">None</span> <span class="k">else</span> <span class="n">validation_loss</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">best_N</span> <span class="o">=</span> <span class="n">heapify</span><span class="p">([])</span> <span class="k">if</span> <span class="n">validation_loss</span> <span class="ow">is</span> <span class="bp">None</span> <span class="k">else</span> <span class="n">heapify</span><span class="p">([</span><span class="n">validation_loss</span><span class="p">])</span>

        <span class="bp">self</span><span class="p">.</span><span class="n">inplace</span> <span class="o">=</span> <span class="n">inplace</span>


    <span class="k">def</span> <span class="nf">reduce</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other</span><span class="p">:</span> <span class="n">Summary</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Summary</span><span class="p">:</span>
        <span class="s">"""
        We've defined an associative binary operation where 
            reduce(a, b) == reduce(b, a)

        and the output is always a summary! 
        """</span>
        <span class="n">to_assign</span> <span class="o">=</span> <span class="bp">self</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">inplace</span> <span class="k">else</span> <span class="n">Summary</span><span class="p">()</span>

        <span class="n">to_assign</span><span class="p">.</span><span class="n">count</span> <span class="o">+=</span> <span class="n">other</span><span class="p">.</span><span class="n">count</span>
        <span class="n">to_assign</span><span class="p">.</span><span class="nb">min</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="nb">min</span><span class="p">,</span> <span class="n">other</span><span class="p">.</span><span class="nb">min</span><span class="p">)</span>
        <span class="n">to_assign</span><span class="p">.</span><span class="nb">sum</span> <span class="o">+=</span> <span class="n">other</span><span class="p">.</span><span class="nb">sum</span>
        <span class="n">to_assign</span><span class="p">.</span><span class="n">best_N</span> <span class="o">=</span> <span class="n">merge_heaps</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">best_N</span><span class="p">,</span> <span class="n">other</span><span class="p">.</span><span class="n">best_N</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">to_assign</span>
</code></pre></div></div>

<p>We’ve done three things above:</p>

<ul>
  <li>defined an “identity” <code class="language-plaintext highlighter-rouge">Summary</code> to handle the case where we’ve errored out or our configuration was empty (for various reasons)</li>
  <li>defined a binary operation that is associative (we can reorder the terms in the function, and the result is the same)</li>
  <li>ensure that we always output a <code class="language-plaintext highlighter-rouge">Summary</code> type!</li>
</ul>

<h2 id="23-using-our-monoid">2.3) Using our monoid</h2>

<p>We can then restructure our code by noting a few things:</p>

<ul>
  <li>our <code class="language-plaintext highlighter-rouge">Dataset.report</code> will now always return a singleton <code class="language-plaintext highlighter-rouge">Summary</code></li>
  <li>our <code class="language-plaintext highlighter-rouge">ValidationResultDict</code> now accepts a <code class="language-plaintext highlighter-rouge">List[Summary]</code> on <code class="language-plaintext highlighter-rouge">__init__</code> as opposed to a <code class="language-plaintext highlighter-rouge">List[Dict]</code>, and it now outputs a <code class="language-plaintext highlighter-rouge">Summary</code></li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">class</span> <span class="nc">Dataset</span><span class="p">():</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="n">Tuple</span><span class="p">[</span><span class="n">NumericalArray</span><span class="p">,</span> <span class="n">NumericalArray</span><span class="p">],</span> <span class="n">hyperparameters</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">train_data</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">validation_data</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">conf</span> <span class="o">=</span> <span class="n">hyperparameters</span>


        <span class="c1"># Create one just to ensure we always have something when the `report` is called
</span>        <span class="c1"># This way even if we do a `report` we can be sure that the code won't error out
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">summary</span> <span class="o">=</span> <span class="p">[</span><span class="n">Summary</span><span class="p">()]</span>  

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">for</span> <span class="n">conf</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">conf</span><span class="p">:</span>
            <span class="n">trained_model</span> <span class="o">=</span> <span class="n">train</span><span class="p">(</span><span class="n">conf</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">train_data</span><span class="p">)</span>
            <span class="n">v</span> <span class="o">=</span> <span class="n">validate</span><span class="p">(</span><span class="n">trained_model</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">validation_data</span><span class="p">)</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">summary</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">report</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">List</span><span class="p">[</span><span class="n">Summary</span><span class="p">]:</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">summary</span>
        
<span class="k">class</span> <span class="nc">ValidationResult</span><span class="p">():</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">summary_list_of_lists</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">List</span><span class="p">[</span><span class="n">Summary</span><span class="p">]],</span> <span class="n">_</span><span class="p">,</span> <span class="n">reduce_immediately</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
        <span class="c1"># Reduce the LoL into a single list
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">summaries</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">summary_list_of_lists</span><span class="p">,</span> <span class="p">[])</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">reduce_immediately</span> <span class="o">=</span> <span class="n">reduce_immediately</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">return</span>

    <span class="k">def</span> <span class="nf">report</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">List</span><span class="p">[</span><span class="n">Summary</span><span class="p">]:</span>
        <span class="c1"># Option 1
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">reduce_immediately</span><span class="p">:</span>
            <span class="n">running_summary</span> <span class="o">=</span> <span class="n">Summary</span><span class="p">()</span>
            <span class="k">for</span> <span class="n">summary</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">summaries</span><span class="p">:</span>
                <span class="n">running_summary</span><span class="p">.</span><span class="nb">reduce</span><span class="p">(</span><span class="n">summary</span><span class="p">)</span>

            <span class="c1"># Insert into a list to keep the types nice and tidy
</span>            <span class="n">running_summary</span> <span class="o">=</span> <span class="p">[</span><span class="n">running_summary</span><span class="p">]</span>

        <span class="c1"># Option 2: reduce it all and then transmit, which saves bandwidth
</span>        <span class="k">else</span><span class="p">:</span>
            <span class="n">running_summary</span> <span class="o">=</span> <span class="p">[]</span> 
            <span class="k">for</span> <span class="n">summary</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">summaries</span><span class="p">:</span>
                <span class="n">running_summary</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">summary</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">running_summary</span>

<span class="n">Container</span> <span class="o">=</span> <span class="n">Union</span><span class="p">[</span><span class="n">Dataset</span><span class="p">,</span> <span class="n">ValidationResult</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">Node</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
    <span class="s">"""
    A compute node on a single machine which either:
        - runs the hyperparameter search
        - runs a reduction on the data
    """</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">:</span> <span class="n">Data</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">container</span> <span class="o">=</span> <span class="n">data</span>

    <span class="k">def</span> <span class="nf">run</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="c1"># The individual types handle their own run
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">container</span><span class="p">.</span><span class="n">run</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">report</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">List</span><span class="p">[</span><span class="n">Summary</span><span class="p">]:</span>
        <span class="c1"># The individual types handle their own reduction
</span>        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">container</span><span class="p">.</span><span class="n">report</span><span class="p">()</span>
</code></pre></div></div>

<p><em>chefs kiss</em></p>

<p><strong>P.s</strong> Again, you would need to do the final processing on <code class="language-plaintext highlighter-rouge">Summary</code> but that’s easy.</p>

<h2 id="23-a-retrospective">2.3) A retrospective</h2>

<p>Notice how, by modifying our logic, we made our code look extremely simple. If we decide to add another feature, e.g., a max, a standard deviation, etc., all we would have to change is our <code class="language-plaintext highlighter-rouge">Summary</code> class to encapsulate the change.</p>

<h1 id="3-monoids-and-abstractions">3) Monoids and abstractions</h1>

<p><strong>QUICK</strong>: Before your eyes gloss over the following diagram, listen to what I’ve got to say. You already know all of the things in the diagram, which is from <a href="https://en.wikipedia.org/wiki/Monoid_%28category_theory%29">Wikipedia: monoids</a></p>

<p><img src="../blog_images/monoids/Monoid_mult.png" alt="monoid pentagon diagram" /></p>

<p>In this case, <code class="language-plaintext highlighter-rouge">M</code> is a category; think of it as a fixed but arbitrary class, e.g., <code class="language-plaintext highlighter-rouge">ValidationResult</code> or <code class="language-plaintext highlighter-rouge">Node</code>. As programmers, we operate on <strong>instances</strong> of those classes but ignore that for now</p>

<hr />

<p>On the first line, we have three terms; let’s index them 0, 1, and 2. On the bottom line, we have two terms; index 3 and 4. In between these terms, we have arrows, which are transformations.</p>

<p><code class="language-plaintext highlighter-rouge">1-&gt;2</code>: we see that \(\alpha\) is <strong>association</strong> where we move the parenthesis around. We introduced associativity as a property of a monoid earlier.</p>

<p><code class="language-plaintext highlighter-rouge">2-&gt;3</code> we see that we have “reduced” the equation \(M \bigotimes (M \bigotimes M)\) into \(M \bigotimes M\) by applying \(1 \bigotimes \mu\), which is equivalent to saying that the first term (the M not in the parens) is the identity. We can do this because monoids must have an identity.</p>

<p><code class="language-plaintext highlighter-rouge">2-&gt;4</code> is the same as the above, but with the parens in a different location</p>

<p><code class="language-plaintext highlighter-rouge">4-&gt;5</code> &amp;&amp; <code class="language-plaintext highlighter-rouge">3-&gt;5</code>: is the result of just evaluation the <code class="language-plaintext highlighter-rouge">x</code>, the \(\mu\).</p>

<p>And there you go!</p>

<h1 id="closing-thoughts">Closing Thoughts</h1>

<p>This post came about after a discussion with one of my mentees. That mentee was facing something similar, and as someone who has gone through this EXACT problem, I thought I’d write about it and share what I’ve learned.</p>

<p>Also, I firmly believe that one way to ensure you know something is by explaining it. And so, to finally understand what</p>

<blockquote>
  <p>A monad is a monoid in the category of endofunctors, what’s the problem?</p>
</blockquote>

<p>I’ve decided to write a 3-part series on “What is a monoid?”, “What is an endofunctor” and “What is a monad”. All those posts will build off one another so stick around!</p>]]></content><author><name>Ian Quah</name><email>ian@ianq.ai</email></author><category term="machine learning" /><category term="category theory" /><summary type="html"><![CDATA[Spoiler: Category theory has applications in machine learning]]></summary></entry><entry><title type="html">PyTorch Gradient Manipulation 1</title><link href="https://ianq.ai/pytorch-gradients-pt1/" rel="alternate" type="text/html" title="PyTorch Gradient Manipulation 1" /><published>2022-01-06T00:00:00-08:00</published><updated>2022-01-06T00:00:00-08:00</updated><id>https://ianq.ai/pytorch-gradients-pt1</id><content type="html" xml:base="https://ianq.ai/pytorch-gradients-pt1/"><![CDATA[<p><strong>Spoiler</strong>: PyTorch offers about five ways to manipulate gradients.</p>

<p>This notebook is part 1 in a series of tutorials discussing gradients (manipulation, stopping, etc.) in <code class="language-plaintext highlighter-rouge">PyTorch</code>. The series covers the following network architectures:</p>

<p>1) <strong>Single-headed simple architecture</strong><br />
2) Single-headed complex architecture<br />
3) Multi-headed architecture</p>

<p>but by the end of this post you will know all that you need to know to tackle the other architectures on your own.</p>

<p>The notebook for this tutorial can be found on <a href="https://colab.research.google.com/drive/1IfFTSCZkjQWKDLLZ3u4aLDGoD_ptALgG?usp=sharing">Google Colab gradient_flow_1</a>.</p>

<p><strong>Note</strong>: For the purpose of this discussion, we define a module as either a single layer or a collection of layers in a neural network.</p>

<h1 id="1-motivation">1) Motivation</h1>

<p>The motivation behind this post is threefold:</p>

<h2 id="i-familiarizing-myself-with-pytorch">i) Familiarizing Myself with <code class="language-plaintext highlighter-rouge">PyTorch</code></h2>

<p>Although <code class="language-plaintext highlighter-rouge">PyTorch</code> is easy to prototype with, I don’t fully understand its computation graph and how it applies its gradients via the <code class="language-plaintext highlighter-rouge">optim</code></p>

<h2 id="ii-playing-with-gradient-stopping-and-propagating">ii) Playing with Gradient Stopping and Propagating</h2>

<p>Understanding how to stop propagation of the gradients is essential, especially nowadays, where we use off-the-shelf weights that we then fine-tune; fine-tuning is a straightfoward problem if we have a simple module, as shown below:</p>

<p><img src="../blog_images/pt_gradients/frozen_layer.png" alt="Simple module" /></p>

<p>But what happens if we want to skip the application of a specific gradient layer?</p>

<p><img src="../blog_images/pt_gradients/frozen_intermediate.png" alt="Complicated Module" /></p>

<p>Or where we have two networks that only interact occasionally? Or where we have two networks that are related? Consider the following topology with two primary modules: the actor and the critic, as used in the Deep Deterministic Policy Gradient (DDPG) architecture:</p>

<p><img src="../blog_images/pt_gradients/ddpg.png" alt="DDPG" /></p>

<p><strong>NOTE</strong>: Image sourced from <a href="https://intellabs.github.io/coach/components/agents/policy_optimization/ddpg.html">IntelLabs: DDPG</a></p>

<p>We see that the critic (the bottom module) accepts the actor’s output. However, unless we stop the gradient flow, the computation graph will inadvertently backpropagate critic updates through the actor, which is undesirable.</p>

<h1 id="2-contents">2) Contents</h1>

<p>We explore five methods categorized into <strong>High-Level</strong>, which utilize built-in methods, and <strong>Low-Level</strong>, where we manually access the gradients.</p>

<h2 id="21-high-level">2.1) High-Level</h2>

<p>The following methods are pertinent only to <strong>stopping</strong> gradients:</p>

<ul>
  <li>
    <p><a href="https://pytorch.org/docs/1.9.1/generated/torch.Tensor.detach.html"><code class="language-plaintext highlighter-rouge">detach</code></a>, which returns a copied tensor with the same values and properties but detached from the computation graph. The original tensor is preserved.</p>
  </li>
  <li>
    <p><a href="https://pytorch.org/docs/stable/generated/torch.no_grad.html"><code class="language-plaintext highlighter-rouge">no_grad</code></a>, which is a context manager that disables gradient calculation, setting <code class="language-plaintext highlighter-rouge">requires_grad</code> to <code class="language-plaintext highlighter-rouge">False</code> for all variables created within its scope.</p>
  </li>
  <li>
    <p><a href="https://pytorch.org/docs/stable/generated/torch.inference_mode.html"><code class="language-plaintext highlighter-rouge">inference</code></a>, which ompletely halts gradient calculations both downstream and upstream. This is a relatively new method, introduced on September 14, 2021, and warrants discussion.</p>
  </li>
</ul>

<h2 id="22-low-level">2.2) Low-Level</h2>

<p>With direct access to the gradients, we can not only stop gradients but also manipulate them based on our specific needs:</p>

<ul>
  <li>
    <p>Via the <code class="language-plaintext highlighter-rouge">optimizer</code>, where we exclude the optimizer from receiving the parameters of certain modules.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">Manual Manipulation</code>, where we extract the gradients and then choose whether to modify or manipulate them before application.</p>
  </li>
</ul>

<h2 id="23-eval-misconception">2.3) <code class="language-plaintext highlighter-rouge">eval</code> Misconception</h2>

<p>When I first started using <code class="language-plaintext highlighter-rouge">PyTorch</code>, I mistakenly assumed that <code class="language-plaintext highlighter-rouge">eval</code> mode would:</p>

<ul>
  <li>Put the model into inference mode (turning off dropout and making batchnorm run in eval mode),</li>
  <li>Turn off the computation graph construction.</li>
</ul>

<p>However, it does not affect the computation graph construction as I had thought.</p>

<h2 id="24-making-the-right-choice">2.4) Making the Right Choice</h2>

<p>Ultimately, each method comes with various trade-offs. We will discuss these below, allowing you to make an informed decision best suited for your application.</p>

<h1 id="3-problem-setup">3) Problem Setup</h1>

<p>We have the following graph:</p>

<p><img src="../blog_images/pt_gradients/simple.png" alt="Simple Graph" /></p>

<p>In this setup, we aim to update only the network’s output head (L2). What are the various ways we can accomplish this?</p>

<p>I highly recommend having the <a href="https://colab.research.google.com/drive/1IfFTSCZkjQWKDLLZ3u4aLDGoD_ptALgG#scrollTo=f32bcf98">colab notebook</a> open as you work through this. I made it a point to plot the resulting computation graph for each setting, making it easier to understand what is happening.</p>

<h1 id="4-high-level">4) High-Level</h1>

<h2 id="41-detach">4.1) <a href="https://colab.research.google.com/drive/1IfFTSCZkjQWKDLLZ3u4aLDGoD_ptALgG#scrollTo=_5d-dmI0q6RY"><code class="language-plaintext highlighter-rouge">detach</code></a></h2>

<p><code class="language-plaintext highlighter-rouge">detach</code> detaches upstream values from the graph, so we only calculate the gradient backward up to the first <code class="language-plaintext highlighter-rouge">detach</code>. Our current graph setup is too simple to illustrate this phenomenon, but the computation graph in the follow-up post will work well.</p>

<h3 id="411-observations">4.1.1) Observations</h3>

<p>Notice two things from the cells:</p>

<ul>
  <li>The output of the <code class="language-plaintext highlighter-rouge">print</code> statements shows that the <code class="language-plaintext highlighter-rouge">grad</code> of L1 is <code class="language-plaintext highlighter-rouge">None</code>.</li>
  <li>L1 does not exist in the computation graph (contrast this with the <a href="https://colab.research.google.com/drive/1IfFTSCZkjQWKDLLZ3u4aLDGoD_ptALgG#scrollTo=WfL3WKpo7Lgt">Control</a>).</li>
</ul>

<h3 id="412-usecase">4.1.2) Usecase</h3>

<ul>
  <li>Stopping gradient flow.</li>
  <li>Saving memory.</li>
</ul>

<p>Torch tensors keep track of data such as the computation graph. By detaching these tensors, we drop the computation graph of all upstream operations up to the current variable.</p>

<ul>
  <li>Converting the tensor to <code class="language-plaintext highlighter-rouge">numpy</code>.</li>
</ul>

<p>Attempting to directly convert to <code class="language-plaintext highlighter-rouge">numpy</code> will result in an error because <code class="language-plaintext highlighter-rouge">numpy</code> does not track the computation graph. It is safer to have a clear distinction between <code class="language-plaintext highlighter-rouge">numpy</code> arrays and torch tensors.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span> <span class="k">as</span> <span class="n">T</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">T</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="n">a</span>
<span class="n">b</span><span class="p">.</span><span class="n">numpy</span><span class="p">()</span>
</code></pre></div></div>

<h2 id="42-no_grad">4.2) <a href="https://colab.research.google.com/drive/1IfFTSCZkjQWKDLLZ3u4aLDGoD_ptALgG#scrollTo=BiWLTq7ChbJv"><code class="language-plaintext highlighter-rouge">no_grad</code></a></h2>

<h3 id="421-no_grad-in-action">4.2.1) <code class="language-plaintext highlighter-rouge">no_grad</code> in action</h3>

<p>It can be used as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!pip install -q torchviz
</span><span class="kn">import</span> <span class="nn">torch</span> <span class="k">as</span> <span class="n">T</span>
<span class="kn">from</span> <span class="nn">torchviz</span> <span class="kn">import</span> <span class="n">make_dot</span>

<span class="c1"># Requires grad = True to construct graph
</span><span class="n">x</span> <span class="o">=</span> <span class="n">T</span><span class="p">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>  
<span class="k">with</span> <span class="n">T</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
	<span class="k">pass</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">x</span> <span class="o">**</span> <span class="mi">2</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">x</span> <span class="o">**</span> <span class="mi">3</span>
<span class="n">r</span> <span class="o">=</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">z</span><span class="p">).</span><span class="nb">sum</span><span class="p">()</span>

<span class="n">make_dot</span><span class="p">(</span>
    <span class="n">r</span><span class="p">,</span> 
    <span class="n">params</span><span class="o">=</span><span class="p">{</span><span class="s">"y"</span><span class="p">:</span> <span class="n">y</span><span class="p">,</span> <span class="s">"z"</span><span class="p">:</span> <span class="n">z</span><span class="p">,</span> <span class="s">"r"</span><span class="p">:</span> <span class="n">r</span><span class="p">,</span> <span class="s">"x"</span><span class="p">:</span> <span class="n">x</span><span class="p">},</span>
    <span class="n">show_attrs</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Uncomment the first line if you do not already have <a href="https://github.com/szagoruyko/pytorchviz">torchviz</a>. Then, play around with moving <code class="language-plaintext highlighter-rouge">y</code> or <code class="language-plaintext highlighter-rouge">z</code> into the <code class="language-plaintext highlighter-rouge">T.no_grad()</code> context.</p>

<h3 id="422-observations">4.2.2) Observations</h3>

<ul>
  <li>
    <p>The graph of <code class="language-plaintext highlighter-rouge">no_grad</code> is the same as the graph of <code class="language-plaintext highlighter-rouge">detach</code></p>
  </li>
  <li>
    <p>The printed information shows that <code class="language-plaintext highlighter-rouge">L1</code> has <code class="language-plaintext highlighter-rouge">None</code> gradients, similar to the previous method.</p>
  </li>
</ul>

<h3 id="423-usecase">4.2.3) Usecase</h3>

<ul>
  <li>
    <p>Stopping gradients.</p>
  </li>
  <li>
    <p>Improving computational speed and memory consumption.</p>

    <p><code class="language-plaintext highlighter-rouge">no_grad</code> tells PyTorch to not track operations within the context, which means that the computation graph is not created.</p>

    <p>Furthermore, <code class="language-plaintext highlighter-rouge">no_grad</code> is faster than <code class="language-plaintext highlighter-rouge">detach</code> because <code class="language-plaintext highlighter-rouge">detach</code> returns a copy of the input tensor (just without the computation graph), whereas <code class="language-plaintext highlighter-rouge">no_grad</code> does not persist the computation graph of variables within its scope.</p>
  </li>
  <li>
    <p>Less room for mistakes.</p>

    <p>Keeping both the torch tensor and numpy array around might not be your intention, and you might accidentally operate on the wrong variable.</p>
  </li>
</ul>

<h2 id="43-inference">4.3) <a href="https://colab.research.google.com/drive/1IfFTSCZkjQWKDLLZ3u4aLDGoD_ptALgG#scrollTo=HXtMs3CjhbUz"><code class="language-plaintext highlighter-rouge">inference</code></a></h2>

<h3 id="431-observations">4.3.1) Observations</h3>

<p>We discuss two observations for this code section:</p>

<p><a href="https://colab.research.google.com/drive/1IfFTSCZkjQWKDLLZ3u4aLDGoD_ptALgG#scrollTo=JzpsF116hnNQ&amp;line=1&amp;uniqifier=1"><strong>Cell 1: without_grad</strong></a></p>

<p>Viewing the computation graph, we see that no values are tracked (hence an empty singular block)</p>

<p><strong>Solution</strong> 
If we want to allow downstream calculations that themselves are not in <code class="language-plaintext highlighter-rouge">inference</code> mode, we must make a <code class="language-plaintext highlighter-rouge">clone</code> of the tensor. We display the relevant sections of this in <strong>section 4.3.2) Relevant code</strong></p>

<p><a href="https://colab.research.google.com/drive/1IfFTSCZkjQWKDLLZ3u4aLDGoD_ptALgG#scrollTo=JzpsF116hnNQ&amp;line=1&amp;uniqifier=1"><strong>Cell 2: with_grad</strong></a></p>

<p>We see this method produced the same computation graph as in the <code class="language-plaintext highlighter-rouge">detach</code> and <code class="language-plaintext highlighter-rouge">no_grad</code> settings. Like <code class="language-plaintext highlighter-rouge">no_grad</code>, <code class="language-plaintext highlighter-rouge">inference()</code> is a context manager. In <code class="language-plaintext highlighter-rouge">no_grad</code> and <code class="language-plaintext highlighter-rouge">detach</code>, upstream values were not tracked in the computation graph; in <code class="language-plaintext highlighter-rouge">inference</code>, even downstream values are not tracked.</p>

<p>*<a href="https://pytorch.org/cppdocs/notes/inference_mode.html">Pytorch CPP Inference mode docs</a></p>

<h3 id="432-relevant-code">4.3.2) Relevant Code</h3>

<p>We generated the two graphs by following the setup from this <a href="https://twitter.com/pytorch/status/1437838242418671620?lang=en">official Twitter post</a> in mind about</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">_inference_forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">):</span>
  <span class="c1"># First var is a inferenced-var
</span>  <span class="k">with</span> <span class="n">T</span><span class="p">.</span><span class="n">inference_mode</span><span class="p">():</span>
    <span class="n">tmp</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">l1</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
  <span class="k">try</span><span class="p">:</span>
    <span class="c1"># Try to do a non-inference forward pass
</span>    <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">l2</span><span class="p">(</span><span class="n">tmp</span><span class="p">)</span>
  <span class="k">except</span> <span class="nb">Exception</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Trying to use intermediate inference_mode tensor outside inference_mode context manager"</span><span class="p">)</span>
    
    <span class="c1"># Getting pure-inference
</span>    <span class="k">with</span> <span class="n">T</span><span class="p">.</span><span class="n">inference_mode</span><span class="p">():</span>
      <span class="n">grad_disabled</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">l2</span><span class="p">(</span><span class="n">tmp</span><span class="p">)</span>
    <span class="c1"># Convert inferenced-var and allow us to
</span>    <span class="c1"># do a normal forward pass
</span>    <span class="n">new_tmp</span> <span class="o">=</span> <span class="n">T</span><span class="p">.</span><span class="n">clone</span><span class="p">(</span><span class="n">tmp</span><span class="p">)</span>
    <span class="n">grad_enabled</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">l2</span><span class="p">(</span><span class="n">new_tmp</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">grad_disabled</span><span class="p">,</span> <span class="n">grad_enabled</span>
</code></pre></div></div>

<h3 id="433-usecase">4.3.3) Usecase</h3>

<p><strong>Gradient Propagation</strong> It is possible to use this method to stop gradients, but there are easier ways to accomplish this.</p>

<p><strong>Inference Speed</strong> While <code class="language-plaintext highlighter-rouge">no_grad</code> stops operation tracking, <code class="language-plaintext highlighter-rouge">inference</code> disables two other autograd features: version counting and metadata tracking.</p>

<h1 id="5-low-level">5) Low-Level</h1>

<p>In the following methods, we work directly with the computed gradients instead of detaching variables or telling PyTorch to ignore blocks. This low-level manipulation is useful for making complex modifications to our gradients. While it won’t be relevant here, it’s worth mentioning ahead of time.</p>

<p>Furthermore, whereas the methods in the <strong>High-Level</strong> section stop <strong>all</strong> gradients from flowing upstream, the <strong>Low-Level</strong> methods allow us to selectively skip modules.</p>

<h2 id="things-to-note">Things to Note:</h2>

<ul>
  <li>
    <p>Gradients are stored in the model parameters when <code class="language-plaintext highlighter-rouge">loss.backward</code> is called. The <code class="language-plaintext highlighter-rouge">optimizer.step</code> call simply applies these gradients. Thus, using the optimizer method is more or less equivalent to the manual manipulation method.</p>
  </li>
  <li>
    <p>Unlike the resulting computation graphs in the <strong>High-Level</strong> section, where no <code class="language-plaintext highlighter-rouge">L1</code> information is kept, in both <strong>Low-Level</strong> solutions <code class="language-plaintext highlighter-rouge">L1</code> is still tracked even if unused (as verified by quick tests in the corresponding cells).</p>
  </li>
</ul>

<h2 id="52-colab-optimoptimizer">5.2) <a href="https://colab.research.google.com/drive/1IfFTSCZkjQWKDLLZ3u4aLDGoD_ptALgG#scrollTo=LY-dCdjsqeA7&amp;uniqifier=1"><code class="language-plaintext highlighter-rouge">Colab: optim.Optimizer</code></a></h2>

<p>Rather than using <code class="language-plaintext highlighter-rouge">optim.SomeOptimizer(model.parameters())</code>, we use <code class="language-plaintext highlighter-rouge">optim.SomeOptimizer(model.l2.parameters())</code>, which instructs our optimizer to apply gradients only to the <code class="language-plaintext highlighter-rouge">L2</code> parameters.</p>

<h3 id="521-usecase">5.2.1) Usecase</h3>

<ul>
  <li><strong>Gradient Stopping</strong>: As with the above methods, this approach can “freeze” a layer.</li>
  <li><strong>Gradient Manipulation</strong>: This allows specification of per-module hyperparameters, though it does not provide fine-grained control.</li>
</ul>

<h2 id="53-colab-manual-manipulation">5.3) <a href="https://colab.research.google.com/drive/1IfFTSCZkjQWKDLLZ3u4aLDGoD_ptALgG#scrollTo=qFf7C93Rp0yU&amp;uniqifier=1"><code class="language-plaintext highlighter-rouge">Colab: Manual Manipulation</code></a></h2>

<p>Here, unlike the above section where the optimizer applies our gradients, we manually apply the gradient.</p>

<h3 id="531-usecase">5.3.1) Usecase</h3>

<p>The primary use-case for this method over all others is custom gradient applications. For instance, if you wish to zero out gradients every other step or scale the gradients under specific conditions.</p>

<h1 id="6-closing-thoughts">6) Closing Thoughts</h1>

<h2 id="61-gradient-stopping">6.1) Gradient Stopping</h2>

<p>The “simple” methods such as <code class="language-plaintext highlighter-rouge">no_grad</code> are generally easier to implement and should be preferred if your goal is merely to stop gradients from flowing upstream. My recommendation is to use <code class="language-plaintext highlighter-rouge">no_grad</code> wherever possible as it is faster than <code class="language-plaintext highlighter-rouge">detach</code>. This preference is somewhat subjective, but I find <code class="language-plaintext highlighter-rouge">no_grad</code> also clearer because it explicitly excludes a block of computations that will not be used further down. When you <code class="language-plaintext highlighter-rouge">detach</code> a variable, you now have both the torch tensor and the numpy array, which could lead to confusion.</p>

<p>I recommend avoiding <code class="language-plaintext highlighter-rouge">inference</code> for gradient manipulation unless you’re absolutely certain you have a compelling reason. I do not see a scenario where <code class="language-plaintext highlighter-rouge">inference</code> would be preferred over <code class="language-plaintext highlighter-rouge">no_grad</code>, especially when considering that using <code class="language-plaintext highlighter-rouge">no_grad</code> allows you to avoid unnecessary copying of variables.</p>

<h2 id="62-gradient-manipulation">6.2) Gradient Manipulation</h2>

<p>If feasible, use the optimizer approach as it leaves less room for error. However, the <strong>Manual Manipulation</strong> method is ideal if you need to apply custom operations to your gradients. This is particularly useful for scenarios where you might want to scale gradients for specific layers under certain conditions or zero out gradients intermittently.</p>]]></content><author><name>Ian Quah</name><email>ian@ianq.ai</email></author><category term="machine learning" /><category term="gradients" /><category term="optimization" /><category term="pytorch" /><summary type="html"><![CDATA[Spoiler: PyTorch offers about five ways to manipulate gradients.]]></summary></entry><entry><title type="html">Stumbling backwards into np.random.seed through jax.</title><link href="https://ianq.ai/randomness_in_jax/" rel="alternate" type="text/html" title="Stumbling backwards into np.random.seed through jax." /><published>2022-01-06T00:00:00-08:00</published><updated>2022-01-06T00:00:00-08:00</updated><id>https://ianq.ai/randomness_in_jax</id><content type="html" xml:base="https://ianq.ai/randomness_in_jax/"><![CDATA[<p><strong>Spoiler</strong>: We’ve all been using randomness wrong</p>

<p>You can find the associated <a href="https://github.com/IanQS/blogpostcode/blob/master/src/jax/randomness.ipynb">notebook</a> for this post, but it’s relatively minimal. Feel free to open the link and play with the notebook, but know that running it’s not strictly necessary.</p>

<h1 id="1-intro">1) Intro</h1>
<p>Given my current needs, I think that <a href="https://github.com/google/jax"><code class="language-plaintext highlighter-rouge">jax</code></a> is the best computational tool out there. I hope to write more about <code class="language-plaintext highlighter-rouge">jax</code> in the coming months, and show you why you should consider trying it out. One important thing to realize is that <code class="language-plaintext highlighter-rouge">jax</code> is not a deep learning framework (although it does have autograd built-in). First and foremost, <code class="language-plaintext highlighter-rouge">jax</code> is a numerical computation library, like <code class="language-plaintext highlighter-rouge">numpy</code>.</p>

<p>Over the weekend, I was working on porting some code from <code class="language-plaintext highlighter-rouge">pytorch</code> to <code class="language-plaintext highlighter-rouge">jax</code>. In the process, I stumbled onto some code that dealt with randomness, and I decided to read more about randomness in the context of <code class="language-plaintext highlighter-rouge">numpy</code>. The material I had read over the weekend ended up being the motivation behind this blog post. To begin, let’s look at how we would deal with randomness in <code class="language-plaintext highlighter-rouge">jax</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">key</span> <span class="o">=</span> <span class="n">jax</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">PRNGKey</span><span class="p">(</span><span class="n">SEED</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>

<span class="c1"># which outputs the following on my run:
#   DeviceArray([1076515368, 3893328283], dtype=uint32)
</span></code></pre></div></div>

<p>Ironically, I felt like I understood <code class="language-plaintext highlighter-rouge">numpy</code>’s randomness better after using <code class="language-plaintext highlighter-rouge">jax</code>. This blog post hopes to exposit what I learned in the process.</p>

<h2 id="i-a-little-about-jax">i) A little about jax</h2>

<p>As mentioned earlier, <code class="language-plaintext highlighter-rouge">jax</code> is a computational framework akin to <code class="language-plaintext highlighter-rouge">numpy</code>. I’d say the main difference between <code class="language-plaintext highlighter-rouge">jax</code> and <code class="language-plaintext highlighter-rouge">numpy</code> is that <code class="language-plaintext highlighter-rouge">jax</code> was designed to be optimizer agnostic. Being optimizer agnostic means that <code class="language-plaintext highlighter-rouge">jax</code> runs fast regardless of if you’re on a CPU, GPU, or TPU. I particularly like it because of:</p>

<ul>
  <li>
    <p>how <a href="https://jax.readthedocs.io/en/latest/faq.html#benchmarking-jax-code">fast</a> it is when compared to other frameworks (I got a 10X speed boost compared to raw vectorized numpy in a function with lots of dot products).</p>
  </li>
  <li>
    <p>how easy it is to peek into its internals (admittedly, this is subjective).</p>
  </li>
  <li>
    <p>how it allows you to implement the equations you see in papers directly. You can implement the line of code then call <a href="https://jax.readthedocs.io/en/latest/_autosummary/jax.vmap.html#jax.vmap"><code class="language-plaintext highlighter-rouge">vmap</code></a> to apply it to all rows in your array. You don’t need to futz around with vectorizing your equations any longer.</p>
  </li>
</ul>

<h2 id="ii-could-it-be-the-future">ii) Could it be the future?</h2>

<p>I feel like <code class="language-plaintext highlighter-rouge">jax</code> and <code class="language-plaintext highlighter-rouge">XLA</code> are the future of computation in python. Granted, this isn’t exactly a hot take - lots of people and companies have begun to move to <code class="language-plaintext highlighter-rouge">jax</code>:</p>

<ul>
  <li>
    <p>DeepMind’s <a href="https://github.com/deepmind/alphafold">alphafold</a> model is built in <a href="https://github.com/deepmind/dm-haiku">haiku</a>, which is a deep-learning oriented library built on top of <code class="language-plaintext highlighter-rouge">jax</code></p>
  </li>
  <li>
    <p>Google Brain has also released a deep-learning called <a href="https://www.youtube.com/watch?v=fuAyUQcVzTY">flax</a>. From what I can tell, teams at Google Brain have begun transitioning over to it.</p>
  </li>
  <li>
    <p>Huggingface has also begun releasing models in <code class="language-plaintext highlighter-rouge">flax</code></p>
  </li>
</ul>

<p><strong>Note</strong> Pytorch behind</p>

<p>In my last blog post <a href="./2022-01-06-pytorch-gradients-pt1.md">PyTorch Gradients</a>, I mentioned publishing a series of posts covering gradients in PyTorch. I fully intend to finish that series, but I’ve more or less abandoned PyTorch.</p>

<h1 id="2-randomness">2) Randomness:</h1>

<p>Anyways, on to the meat of this post: over the weekend, I was playing with the idea of porting over <a href="https://github.com/jeshraghian/snntorch">snnTorch</a> to <code class="language-plaintext highlighter-rouge">jax</code>. I first began by scanning through the tutorials where I read some material about <a href="https://github.com/jeshraghian/snntorch/blob/master/docs/tutorials/tutorial_1.rst#3-spike-generation-optional">creating random spike trains</a>. The contents of the tutorial and what spike trains are aren’t crucial for this post. Still, it did remind me that <code class="language-plaintext highlighter-rouge">jax</code> handles randomness differently from other frameworks. So, I thought I should do some deep(er) reading before naively moving code over.</p>

<p>If you look up randomness in <code class="language-plaintext highlighter-rouge">jax</code>, one of the first things you’ll stumble on is how to generate a key and continually split the random key. To make a long story short, <code class="language-plaintext highlighter-rouge">jax</code> is <a href="https://en.wikipedia.org/wiki/Functional_programming">functional</a> in nature, which means that it is stateless. Being stateless means (among other things) that <code class="language-plaintext highlighter-rouge">jax</code> handles randomness explicitly; we have to explicitly seed a value every time we invoke randomness in our code. On the one hand, this makes our code more verbose, but on the other hand, it makes reproducibility far easier.</p>

<hr />

<h2 id="i-statefulness">i) Statefulness</h2>

<p>The following is merely a working example of what “statefulness” means. It is by no means a rigorous definition. Think of being stateful as the following:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">StatefulAdd</span><span class="p">():</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">():</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">count</span> <span class="o">=</span> <span class="mi">0</span>

    <span class="k">def</span> <span class="nf">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="c1"># The identity + number of times it has been called
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">count</span> <span class="o">+=</span> <span class="mi">1</span>
        <span class="k">return</span> <span class="n">x</span> <span class="o">+</span> <span class="mi">1</span>

<span class="n">foo</span> <span class="o">=</span> <span class="n">StatefulAdd</span><span class="p">()</span>
<span class="n">first</span> <span class="o">=</span> <span class="n">foo</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>  <span class="c1"># first := 1
</span><span class="n">second</span> <span class="o">=</span> <span class="n">foo</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>  <span class="c1"># second := 2
</span></code></pre></div></div>

<p>i.e. we can plug the same value in but obtain different values each time. There’s nothing inherently wrong about coding this way(regardless of what the func-ies will say); it can just be harder to reason about it.</p>

<hr />

<p>Anyways, going back to <code class="language-plaintext highlighter-rouge">jax</code>: by enforcing statelessness, we have to be explicit in terms of our random key every time we make a call. By enforcing statelessness, <code class="language-plaintext highlighter-rouge">jax</code> sidesteps the reproducibility issue that plagued Tensorflow1.X (and probably pytorch too). Although <code class="language-plaintext highlighter-rouge">jax</code> <a href="https://github.com/google/jax/issues/565">isn’t perfect</a> in the reproducibility aspect, I believe it is going in the right direction.</p>

<h2 id="ii-reproducibility-in-tf1x">ii) Reproducibility in TF1.X</h2>

<ul>
  <li>
    <p><a href="https://stackoverflow.com/questions/36288235/how-to-get-stable-results-with-tensorflow-setting-random-seed">How to get stable results with TensorFlow, setting random seed</a> although, to be fair, there seems to be an official answer for Tensorflow 2 as of <a href="https://stackoverflow.com/a/60088810/3532564">2020</a></p>
  </li>
  <li>
    <p><a href="https://stackoverflow.com/questions/32419510/how-to-get-reproducible-results-in-keras/52897216#52897216">Why can’t I get reproducible results in Keras even though I set the random seeds? (asked in 2018)</a> which contains my favorite answer I’ve seen so far. The answer states the following and has the following caveat:</p>
  </li>
</ul>

<blockquote>
  <p>In short, to be absolutely sure that you will get reproducible results with your python script on one computer’s/laptop’s CPU then you will have to do the following:</p>
</blockquote>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Seed value
# Apparently you may use different seed values at each stage
</span><span class="n">seed_value</span><span class="o">=</span> <span class="mi">0</span>

<span class="c1"># 1. Set the `PYTHONHASHSEED` environment variable at a fixed value
</span><span class="kn">import</span> <span class="nn">os</span>
<span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'PYTHONHASHSEED'</span><span class="p">]</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">seed_value</span><span class="p">)</span>

<span class="c1"># 2. Set the `python` built-in pseudo-random generator at a fixed value
</span><span class="kn">import</span> <span class="nn">random</span>
<span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed_value</span><span class="p">)</span>

<span class="c1"># 3. Set the `numpy` pseudo-random generator at a fixed value
</span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed_value</span><span class="p">)</span>

<span class="c1"># 4. Set the `tensorflow` pseudo-random generator at a fixed value
</span><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="n">tf</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">set_seed</span><span class="p">(</span><span class="n">seed_value</span><span class="p">)</span>
<span class="c1"># for later versions: 
# tf.compat.v1.set_random_seed(seed_value)
</span>
<span class="c1"># 5. Configure a new global `tensorflow` session
</span><span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="n">session_conf</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">ConfigProto</span><span class="p">(</span><span class="n">intra_op_parallelism_threads</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">inter_op_parallelism_threads</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">sess</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">Session</span><span class="p">(</span><span class="n">graph</span><span class="o">=</span><span class="n">tf</span><span class="p">.</span><span class="n">get_default_graph</span><span class="p">(),</span> <span class="n">config</span><span class="o">=</span><span class="n">session_conf</span><span class="p">)</span>
<span class="n">K</span><span class="p">.</span><span class="n">set_session</span><span class="p">(</span><span class="n">sess</span><span class="p">)</span>
<span class="c1"># for later versions:
# session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
# sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
# tf.compat.v1.keras.backend.set_session(sess)
</span></code></pre></div></div>

<p>Indeed, a thing of beauty.</p>

<h1 id="3-reproducibility-in-numpy">3) Reproducibility in <code class="language-plaintext highlighter-rouge">numpy</code></h1>

<p>First and foremost, I’d recommend opening the <a href="https://github.com/IanQS/blogpostcode/blob/master/src/jax/randomness.ipynb">accompanying notebook</a>, specifically the <code class="language-plaintext highlighter-rouge">numpy</code> portion and playing with the code there. NB: the <code class="language-plaintext highlighter-rouge">jax</code> portion is trivial and works as you might expect; I included the <code class="language-plaintext highlighter-rouge">jax</code> portion primarily for completeness.</p>

<p>As you play with the <code class="language-plaintext highlighter-rouge">numpy</code> portion, you’ll notice how you get new random values every time you call a <code class="language-plaintext highlighter-rouge">random</code> module. We get new random values every time we call a random module without explicitly giving in a key, which tells us something is happening under the hood.</p>

<p>This “something” looks a lot like we are generating a new random key on every call. Note that this is not what happens under the hood, but it helps tie what we see to <code class="language-plaintext highlighter-rouge">jax</code> and how it handles random state.</p>

<h2 id="i-example-scenario">i) Example Scenario</h2>

<p>You have a program that only crashes once in a while, and you’ve identified the exact function that it crashes on! You’ve even managed to find a specific random seed on which that function works fine, so you’d like to set the state only inside that function and avoid the problem altogether.</p>

<p>Yes, this is a contrived example; sue me.</p>

<h3 id="statefullness-issue-illustrated">Statefullness issue illustrated</h3>

<p>Note here how we have reset the random seed within the <code class="language-plaintext highlighter-rouge">new_generate_np_weights</code>. If the randomness were only local to the context we are in, we would expect to “continue” the original randomness once we exit the function. Said differently, we would have two “sources” of randomness, the second of which would get garbage collected once <code class="language-plaintext highlighter-rouge">new_generate_np_weights</code> returns; however, as we can see on the function labeled with “#3rd” call”, we have received the same random value as our “# 2nd call”.</p>

<h3 id="the-global-state">The global state</h3>

<p>Clearly, something “unexpected” is happening. At its core, <code class="language-plaintext highlighter-rouge">np.random.seed</code> creates what is known as a <a href="https://numpy.org/doc/stable/reference/random/legacy.html#numpy.random.RandomStatehttps://numpy.org/doc/stable/reference/random/generated/numpy.random.RandomState.seed.html#numpy.random.RandomState.seed"><code class="language-plaintext highlighter-rouge">RandomState</code></a> which, as we’ve discussed, creates a stateful object. In fact, as we saw in our code example, calling <code class="language-plaintext highlighter-rouge">seed</code> recreates the object instead of reseeding it.</p>

<p>Obviously, this is the source of our issues.</p>

<h2 id="ii-how-do-we-address-reproducibility-in-numpy">ii) How do we address reproducibility in numpy?</h2>

<p>In all honesty, I have previously stumbled on the <a href="https://numpy.org/doc/stable/reference/random/generator.html">new best practices</a> for generating random numbers in <code class="language-plaintext highlighter-rouge">numpy</code>, but I never bothered to read it. I don’t think that the reasoning behind the recommendation ever clicked with me, so I never felt a need to change how I was doing things.</p>

<p>However, now that we are clear on the limitations of the existing <code class="language-plaintext highlighter-rouge">np.random.seed</code>, we can discuss the recommended way of doing things: <a href="https://numpy.org/doc/stable/reference/random/generator.html"><code class="language-plaintext highlighter-rouge">RandomGenerator</code></a>. To make a long story short, you create an object which contains all your randomness; you “extract” whatever you need from this random object. For example, see <a href="https://numpy.org/doc/stable/reference/random/index.html">random sampling</a></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">numpy.random</span> <span class="kn">import</span> <span class="n">default_rng</span>
<span class="n">rng</span> <span class="o">=</span> <span class="n">default_rng</span><span class="p">()</span>
<span class="n">vals</span> <span class="o">=</span> <span class="n">rng</span><span class="p">.</span><span class="n">standard_normal</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
<span class="n">more_vals</span> <span class="o">=</span> <span class="n">rng</span><span class="p">.</span><span class="n">standard_normal</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div></div>

<p>as opposed to an older method</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">numpy</span> <span class="kn">import</span> <span class="n">random</span>
<span class="n">vals</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">standard_normal</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
<span class="n">more_vals</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">standard_normal</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
</code></pre></div></div>

<p>Where we presumably mutate a global object.</p>

<h1 id="closing-thoughts">Closing Thoughts:</h1>

<p>This was an enlightening topic for me to dive into, and I hope you found reading this useful. I feel like I better understand what <code class="language-plaintext highlighter-rouge">numpy</code> does under the hood when we use randomness. I also feel like I better understand the motivation behind <code class="language-plaintext highlighter-rouge">numpy</code>’s API change recommendation when viewed through the lens of <code class="language-plaintext highlighter-rouge">jax</code>.</p>

<p>tl;dr</p>

<p>1) <code class="language-plaintext highlighter-rouge">jax</code> handles randomness very well, even if it may be more verbose.
2) Use the new <a href="https://numpy.org/doc/stable/reference/random/index.html">best practices</a> if you are dealing with random numbers in <code class="language-plaintext highlighter-rouge">numpy</code></p>

<h2 id="ps">P.s</h2>

<p>You can generate multiple keys with <a href="https://jax.readthedocs.io/en/latest/_autosummary/jax.random.split.html">jax.random.split</a> that you can consume</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">key_array</span> <span class="o">=</span> <span class="n">jax</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="n">X</span><span class="p">)</span>
</code></pre></div></div>]]></content><author><name>Ian Quah</name><email>ian@ianq.ai</email></author><category term="machine learning" /><category term="jax" /><summary type="html"><![CDATA[Spoiler: We’ve all been using randomness wrong]]></summary></entry><entry><title type="html">A Machine Learning oriented introduction to PALISADE, CKKS and pTensor.</title><link href="https://ianq.ai/pTensor-and-palisade/" rel="alternate" type="text/html" title="A Machine Learning oriented introduction to PALISADE, CKKS and pTensor." /><published>2021-02-01T00:00:00-08:00</published><updated>2021-02-01T00:00:00-08:00</updated><id>https://ianq.ai/pTensor-and-palisade</id><content type="html" xml:base="https://ianq.ai/pTensor-and-palisade/"><![CDATA[<p><strong>Spoiler</strong>: You can do math on encrypted numbers</p>

<p>Note: “we” means “I”</p>

<p>Overview</p>

<p>1) We introduce the <a href="https://gitlab.com/palisade/palisade-development">PALISADE</a> library and the cryptographic parameters that we need to specify. We then explain what the cryptographic parameters mean for our application.</p>

<p>2) We use the <a href="https://github.com/IanQS/pTensor">pTensor</a> library and train a housing price predictor on the <a href="https://github.com/melindaleung/Ames-Iowa-Housing-Dataset/tree/master/data">Ames</a> dataset, a modern house price dataset.</p>

<p>3) We set up the discussion for the next post in the series.</p>

<p>Note: check the link at the very bottom for the complete source code. Sections have been omitted in this page to reduce clutter.</p>

<h2 id="1-palisade">1) PALISADE</h2>

<p>Instructions to install PALISADE can be found here: <a href="https://gitlab.com/palisade/palisade-development#build-instructions">PALISADE-Dev build instructions</a>. For users new to PALISADE and C++, we highly recommend bookmarking the <a href="https://palisade.gitlab.io/palisade-development/files.html">PALISADE Doxygen page</a> containing the library’s documentation.</p>

<h3 id="i-what-is-palisade">i) What is PALISADE</h3>

<p>From the <code class="language-plaintext highlighter-rouge">README.md</code> on the PALISADE page</p>

<p>PALISADE is a general lattice cryptography library that currently includes efficient implementations of the following lattice cryptography capabilities:</p>

<ul>
  <li>Fully Homomorphic Encryption (FHE)
    <ul>
      <li>Brakerski/Fan-Vercauteren (BFV) scheme for integer arithmetic</li>
      <li>Brakerski-Gentry-Vaikuntanathan (BGV) scheme for integer arithmetic</li>
      <li>Cheon-Kim-Kim-Song (CKKS) scheme for real-number arithmetic</li>
      <li>Ducas-Micciancio (FHEW) and Chillotti-Gama-Georgieva-Izabachene (TFHE) schemes for Boolean circuit evaluation</li>
    </ul>
  </li>
  <li>Multi-Party Extensions of FHE (to support multi-key FHE)
    <ul>
      <li>Threshold FHE for BGV, BFV, and CKKS schemes</li>
      <li>Proxy Re-Encryption for BGV, BFV, and CKKS schemes</li>
    </ul>
  </li>
</ul>

<h3 id="ii-machine-learning-application">ii) Machine Learning Application</h3>

<p>The takeaway for us machine learning practitioners is that we can train encrypted machine learning models to output encrypted predictions after training said model on encrypted data.</p>

<h2 id="2-palisades-cryptographic-parameters">2) PALISADE’s Cryptographic Parameters</h2>

<p>We as machine learners(?) need to have a rough idea of the following parameters:</p>

<h3 id="i-multdepth">i) multDepth</h3>

<p>This describes the depth of multiplication supported. Informally, when we encrypt data, we add some noise to increase the scheme’s security. When doing mathematical operations on these data, our noise increases (linearly in addition and subtraction but squared in multiplication).</p>

<p>There is no single “best” value to set the multDepth to and this is highly dependent on your problem. The following are some example equations and their corresponding multiplication depth</p>

<ul>
  <li>
    <p>\((a * b) + (c * d)\) has a multiplication depth of 1</p>
  </li>
  <li>
    <p>\(a * b * c\) has a multiplication depth of 2</p>
  </li>
</ul>

<h3 id="ii-scalingfactorbits">ii) scalingFactorBits</h3>

<p>In the original <a href="https://eprint.iacr.org/2016/421.pdf">CKKS paper,</a> the authors discuss a scaling factor they multiply values with. The scaling factor prevents rounding errors from destroying the significant figures during encoding. Unfortunately, it is difficult to discuss this parameter without discussing the paper’s core ideas, so we leave this for the next post. Thankfully, PALISADE is reliable in informing us if the <code class="language-plaintext highlighter-rouge">scalingFactorBits</code> is set too low.</p>

<p>We tend to use values between 30 and 50 for most of the applications.</p>

<h3 id="iii-batchsize">iii) batchSize</h3>

<p>The batchSize is a tricky parameter to set correctly. The issue is that the batch size must be equal to</p>

\[\frac{\text{Ring size}}{2}\]

<p>Unfortunately, one needs to set multDepth, then look at ring size before doing it all over again with batchsize set to be equal to half the ring size. It’s a little hairy, yes, but this is the price we pay for privacy.</p>

<hr />

<h2 id="3-ptensor-library">3) pTensor library:</h2>

<p>For this discussion we encourage readers to refer to <a href="https://github.com/IanQS/pTensor/blob/main/linear_regression_ames.cpp">linear_regression_ames.cpp</a> but we also highlight the critical sections in our discussion.</p>

<h3 id="i-ptensor">i) pTensor</h3>

<p>The pTensor library’s motivation is to provide those with a machine learning or data science background the ability to train encrypted machine learning models in a framework that looks and feels familiar. Where possible we aimed to mimic the numpy library in terms of behavior (e.g allows broadcasting, <code class="language-plaintext highlighter-rouge">*</code> corresponds to the Hadamard product, etc.)</p>

<p>In line with the library’s motivation, there are many aspects hidden from the user, but we briefly discuss important concepts that the inquisitive user may stumble upon while perusing the source code.</p>

<h3 id="ii-complex-numbers">ii) Complex numbers</h3>

<p>CKKS operates on complex numbers for various reasons that we will discuss in the follow-up but know that we only focus on the real-number portion from these complex numbers.</p>

<h3 id="iii-packing">iii) Packing</h3>

<p>To pack the data essentially means that we encode multiple data points into a single ciphertext. Homomorphic encryption is a slow process, but by leveraging SIMD, we can carry out our operations faster. An analogy would be doing a <code class="language-plaintext highlighter-rouge">for-loop</code> vs. vectorized operation in numpy. Because the size of our ciphertexts is already very large, it is advantageous to store the data in transpose form to reduce the number of encryptions we need to do and to allow for faster element-wise operations.</p>

<h3 id="iv-ptensorm_cc">iv) pTensor::m_cc</h3>

<p>The <code class="language-plaintext highlighter-rouge">m_cc</code> object is the cryptographic context which we use to carry PALISADE’s operations.</p>

<h2 id="4-using-ptensor-on-the-ames-dataset">4) Using pTensor on the Ames dataset</h2>

<h3 id="i-setting-up-the-cryptographic-contexts">i) Setting up the cryptographic contexts</h3>

<p>We show</p>

<ul>
  <li>how to create a cryptocontext, which configures PALISADE to perform encrypted computation within a specific encryption scheme</li>
  <li>code for training on the Ames dataset</li>
</ul>

<p>Should one attempt to follow the process in <a href="numpy.org/"><code class="language-plaintext highlighter-rouge">Numpy</code></a> or in <a href="https://eigen.tuxfamily.org/index.php?title=Main_Page"><code class="language-plaintext highlighter-rouge">Eigen</code></a> know that because of the noise and the way our encryption scheme operates, one may achieve slightly different results between those plaintext versions and this encrypted version.</p>

<p>We briefly introduce the parameters used below but defer further discussion to later.</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">auto</span> <span class="n">cc</span> <span class="o">=</span> <span class="n">lbcrypto</span><span class="o">::</span><span class="n">CryptoContextFactory</span><span class="o">&lt;</span><span class="n">lbcrypto</span><span class="o">::</span><span class="n">DCRTPoly</span><span class="o">&gt;::</span><span class="n">genCryptoContextCKKS</span><span class="p">(</span>
    <span class="n">multDepth</span><span class="p">,</span> <span class="n">scalingFactorBits</span><span class="p">,</span> <span class="n">batchSize</span>
<span class="p">);</span>

<span class="n">cc</span><span class="o">-&gt;</span><span class="n">Enable</span><span class="p">(</span><span class="n">ENCRYPTION</span><span class="p">);</span>
<span class="n">cc</span><span class="o">-&gt;</span><span class="n">Enable</span><span class="p">(</span><span class="n">SHE</span><span class="p">);</span>
<span class="n">cc</span><span class="o">-&gt;</span><span class="n">Enable</span><span class="p">(</span><span class="n">LEVELEDSHE</span><span class="p">);</span>  <span class="c1">// @NOTE: we discuss SHE and LeveledSHE in the follow up</span>
<span class="k">auto</span> <span class="n">keys</span> <span class="o">=</span> <span class="n">cc</span><span class="o">-&gt;</span><span class="n">KeyGen</span><span class="p">();</span>
<span class="n">cc</span><span class="o">-&gt;</span><span class="n">EvalMultKeyGen</span><span class="p">(</span><span class="n">keys</span><span class="p">.</span><span class="n">secretKey</span><span class="p">);</span>
<span class="n">cc</span><span class="o">-&gt;</span><span class="n">EvalSumKeyGen</span><span class="p">(</span><span class="n">keys</span><span class="p">.</span><span class="n">secretKey</span><span class="p">);</span>

<span class="kt">int</span> <span class="n">ringDim</span> <span class="o">=</span> <span class="n">cc</span><span class="o">-&gt;</span><span class="n">GetRingDimension</span><span class="p">();</span>
<span class="kt">int</span> <span class="n">rot</span> <span class="o">=</span> <span class="kt">int</span><span class="p">(</span><span class="o">-</span><span class="n">ringDim</span> <span class="o">/</span> <span class="mi">4</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span>
<span class="c1">// @NOTE: we discuss EvalAtIndex in the followup</span>
<span class="n">cc</span><span class="o">-&gt;</span><span class="n">EvalAtIndexKeyGen</span><span class="p">(</span><span class="n">keys</span><span class="p">.</span><span class="n">secretKey</span><span class="p">,</span> <span class="p">{</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">rot</span><span class="p">});</span>  
</code></pre></div></div>
<p>We create a cryptocontext object which takes our chosen parameters:</p>

<p><code class="language-plaintext highlighter-rouge">multDepth</code> - The maximum number of sequential multiplications we can do before our data becomes too noisy and the decryption becomes meaningless.</p>

<p><code class="language-plaintext highlighter-rouge">scalingFactorBits</code> - the scaling factor mentioned above and to be discussed later.</p>

<p><code class="language-plaintext highlighter-rouge">batchSize</code> - how many data points (think vector of data) we pack into a ciphertext. Homomorphic encryption is slow but can be sped up by conducting operations over batches of data (via SIMD)</p>

<h3 id="ii-training-setup">ii) Training setup</h3>

<h3 id="iii-constructdataset">iii) constructDataset</h3>

<p>Notice how the parameters that the function takes in are the plaintext X and y. The reason for passing in plaintext X’s and y’s is to allow for easy indexing into the data for shuffling. It would be possible to shuffle the data in encrypted form but it is prohibitively slow and an easier alternative already exists. Thus, to simulate shuffling the data every epoch, we allow the user to specify some number of shuffles, and the data owner creates n-shuffles of the data that is then encrypted.</p>

<p>While training, we can simulate this randomness by randomly indexing into any of the shuffles.</p>

<h3 id="iv-training">iv) Training</h3>

<p>The following should look familiar to anyone familiar with machine learning</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">epoch</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">epoch</span> <span class="o">&lt;</span> <span class="n">epochs</span><span class="p">;</span> <span class="o">++</span><span class="n">epoch</span><span class="p">)</span> <span class="p">{</span>
  <span class="k">auto</span> <span class="n">index</span> <span class="o">=</span> <span class="n">distr</span><span class="p">(</span><span class="n">generator</span><span class="p">);</span>
  <span class="k">auto</span> <span class="n">curr_dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">[</span><span class="n">index</span><span class="p">];</span>
  <span class="k">auto</span> <span class="n">X</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">get</span><span class="o">&lt;</span><span class="mi">0</span><span class="o">&gt;</span><span class="p">(</span><span class="n">curr_dataset</span><span class="p">);</span>
  <span class="k">auto</span> <span class="n">y</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">get</span><span class="o">&lt;</span><span class="mi">1</span><span class="o">&gt;</span><span class="p">(</span><span class="n">curr_dataset</span><span class="p">);</span>

  <span class="k">auto</span> <span class="n">prediction</span> <span class="o">=</span> <span class="n">X</span><span class="p">.</span><span class="n">encryptedDot</span><span class="p">(</span><span class="n">w</span><span class="p">);</span>
  <span class="k">auto</span> <span class="n">residual</span> <span class="o">=</span> <span class="n">prediction</span> <span class="o">-</span> <span class="n">y</span><span class="p">;</span><span class="c1">// Remember, our X is already a transpose</span>
  <span class="k">auto</span> <span class="n">_gradient</span> <span class="o">=</span> <span class="n">X</span><span class="p">.</span><span class="n">encryptedDot</span><span class="p">(</span><span class="n">residual</span><span class="p">);</span>
  <span class="n">pTensor</span> <span class="n">gradient</span><span class="p">;</span>
  <span class="n">gradient</span> <span class="o">=</span> <span class="n">_gradient</span><span class="p">;</span>
  <span class="k">auto</span> <span class="n">scaledGradient</span> <span class="o">=</span> <span class="n">gradient</span> <span class="o">*</span> <span class="n">alpha</span> <span class="o">*</span> <span class="n">scaleByNumSamples</span><span class="p">;</span>

  <span class="n">w</span> <span class="o">=</span> <span class="n">pTensor</span><span class="o">::</span><span class="n">applyGradient</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">scaledGradient</span><span class="p">);</span>
  <span class="n">w</span> <span class="o">=</span> <span class="n">w</span><span class="p">.</span><span class="n">decrypt</span><span class="p">().</span><span class="n">encrypt</span><span class="p">();</span>
</code></pre></div></div>

<p>However, there are a few things to note:</p>

<p>1) <code class="language-plaintext highlighter-rouge">encryptedDot</code> instead of <code class="language-plaintext highlighter-rouge">dot</code> (which is also supported)</p>

<p>In the first <code class="language-plaintext highlighter-rouge">encryptedDot</code>, in the case of a matrix-matrix, we do a Hadamard product before doing a summation along the 0th axis. Again,our X is encrypted in transpose form, of shape (#features, #observations). Thus, our weight matrix is of shape (#features, #observations). We leave it to the reader to work out the details of why this works.</p>

<p>In the other case (not matrix-matrix), we default to the standard dot product.</p>

<p>2) <code class="language-plaintext highlighter-rouge">applyGradient</code></p>

<p>To understand the motivation here, we must first discuss the shape of the incoming values</p>

<p><code class="language-plaintext highlighter-rouge">w: (#features, #observations)</code></p>

<p><code class="language-plaintext highlighter-rouge">scaledGradient: (1, #features)</code></p>

<p>So, we must modify the <code class="language-plaintext highlighter-rouge">scaledGradient</code> into a repeated Matrix form to apply it to the weights</p>

<p>3) <code class="language-plaintext highlighter-rouge">w.decrypt().encrypt()</code></p>

<p>The reason for our decrypt-encrypt has to do with the <code class="language-plaintext highlighter-rouge">multDepth</code> parameter that we briefly discussed earlier. As mentioned, as we do operations on our ciphertexts, we accumulate noise. If this noise gets too large, our decryption will begin to fail. This failing results in random bits interpreted as (usually huge) random numbers. By decrypting and encrypting our results again, we can refresh this noise (reduce the noise to 0).</p>

<p>However, there is a caveat here: only the party with the secret key can do the encrypting. Consider a scenario where we have a data enclave-client setup where the client does all the computations. There is a limit to the maximum <code class="language-plaintext highlighter-rouge">multDepth</code> one can set before CKKS becomes too unwieldy. Computations that exceed that <code class="language-plaintext highlighter-rouge">multDepth</code> need either server reencryption (like shown here) or Bootstrapping (which we will address in the next post) to securely reencrypt the data. Bootstrapping resets the noise and thus the multiplicative depth. However Bootstrapping for CKKS is not yet available for PALISADE as of Feb 2021. This server reencryption process is considered less secure compared to a fully homomorphic setup, but we defer further discussion to the next post.</p>

<h1 id="5-closing-thoughts">5) Closing Thoughts</h1>

<p>P.s: visit  <a href="https://gitlab.com/palisade/palisade-development/-/tree/master/src/pke/examples">PALISADE - PKE</a> for further examples of how to use PALISADE (one of which I contributed to!).</p>]]></content><author><name>Ian Quah</name><email>ian@ianq.ai</email></author><category term="CKKS" /><category term="FHE" /><category term="encrypted ml" /><category term="C++" /><summary type="html"><![CDATA[Spoiler: You can do math on encrypted numbers]]></summary></entry><entry><title type="html">MAPE Madness</title><link href="https://ianq.ai/The-Curious-MAPE/" rel="alternate" type="text/html" title="MAPE Madness" /><published>2019-12-25T00:00:00-08:00</published><updated>2019-12-25T00:00:00-08:00</updated><id>https://ianq.ai/The-Curious-MAPE</id><content type="html" xml:base="https://ianq.ai/The-Curious-MAPE/"><![CDATA[<p><strong>Spoiler</strong>: RTFM</p>

<p><strong>Problem setup</strong>: You want to use the Mean Absolute Precision Error (<strong>MAPE</strong>) as your loss function for training <strong>Linear Regression</strong> on some forecast data. <a href="https://link.springer.com/referenceworkentry/10.1007%2F1-4020-0612-8_580">Springer: Mean Absolute Precision Error (MAPE))</a> has found success in forecasting because it has desirable properties:</p>

<ul>
  <li>
    <p>robust to outliers</p>
  </li>
  <li>
    <p>scale invariance (returns a percentage) and is intuitive to compare across datasets.</p>
  </li>
</ul>

<h1 id="0-setup">0) Setup</h1>

<p>You have forecasting data where a significant difference may exist between contiguous samples.</p>

\[T_1 = 5, T_2 = 5000\]

<p>For example, you want to predict the price of Bitcoin, or ensure that your power plants can support when <a href="https://www.reuters.com/article/uk-soccer-world-england-electricity/england-brews-up-sufficient-power-for-world-cup-tea-time-surge-idUKKBN0E92G220140529">England brews up sufficient power for World Cup tea-time surge</a></p>

<p>We reproduce the equation below:</p>

\[\text{MAPE} = \frac{1}{N} \sum_t^N |\frac{y - \hat{y}}{y}|\]

<h1 id="1-failed-attempts">1) Failed Attempts</h1>

<p>Here’s hoping you learn from my mistakes and can avoid the time I wasted trying to solve this problem</p>

<h2 id="11-sklearn">1.1) Sklearn</h2>

<p>A quick look at the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html">Sklearn Linear Model - Linear Regression</a> page tells you that it only supports <a href="https://en.wikipedia.org/wiki/Ordinary_least_squares">OLS</a>. This is unfortunate because <code class="language-plaintext highlighter-rouge">sklearn</code> is, in general, heavily optimized and well tested.</p>

<h2 id="12-autograd">1.2) <a href="https://github.com/HIPS/autograd">Autograd</a></h2>

<p>Having worked through the <a href="https://github.com/HIPS/autograd/tree/master/examples">examples</a>, I was not clear how to handle enormous datasets, which I was modeling at the time. The solution I was after was how to generate indices to be passed in for minibatch training. After much searching, I eventually found what I was looking for in <a href="https://github.com/HIPS/autograd/blob/master/examples/convnet.py#L198">Convnet Example</a>, which shows you how to pass minibatches in.</p>

<p><strong>Note:</strong> you want to be sure that none of your <code class="language-plaintext highlighter-rouge">y_true</code> values aren’t 0 as this can lead to division by 0 errors in the optimization. I suggest doing</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">objective</span><span class="p">(</span><span class="n">params</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
    <span class="n">pred</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">params</span><span class="p">)</span>
    <span class="n">non_zero_mask</span> <span class="o">=</span> <span class="n">y</span> <span class="o">&gt;</span> <span class="mi">0</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">y</span><span class="p">[</span><span class="n">non_zero_mask</span> <span class="o">-</span> <span class="n">pred</span><span class="p">[</span><span class="n">non_zero_mask</span><span class="p">]])</span> <span class="o">/</span> <span class="n">y</span><span class="p">[</span><span class="n">non_zero_mask</span><span class="p">]</span>
</code></pre></div></div>

<p>Other options would be to add weights to the <code class="language-plaintext highlighter-rouge">objective</code> function as it is possible that you are extremely unlucky, and the objective function returns 0 as all your labels, `y’, are 0. Additionally, you may want to weigh different samples more or less heavily.</p>

<p>Unfortunately, although I managed to get it to work, this solution was unbearably slow. Furthermore, for maintainability reasons, it would just be easier if you could use the <code class="language-plaintext highlighter-rouge">sklearn</code> API (not to say that you couldn’t wrap your <code class="language-plaintext highlighter-rouge">autograd</code> training into the <code class="language-plaintext highlighter-rouge">sklearn</code> format).</p>

<p>It was time to head back to the drawing board.</p>

<h1 id="2-solution">2) Solution</h1>

<p>I got lucky, and things lined up perfectly.</p>

<h2 id="21-getting-lucky-with-sklearn">2.1) Getting lucky with sklearn</h2>

<p>While researching for ways to use <code class="language-plaintext highlighter-rouge">sklearn</code> packages to solve my issue, I also came across <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html">sklearn.SGDRegressor</a>, but that only allows the following loss functions:</p>

<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">squared_error</code>: OLS</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">huber</code>: wherein errors below some \(\epsilon\) are treated as a linear loss, while errors above that \(\epsilon\) use the squared loss.</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">epsilon_insensitive</code>: ignores errors less than \(\epsilon\) and is linear when greater than that</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">squared_epsilon_insensitive</code>: is <code class="language-plaintext highlighter-rouge">epsilon_insensitive</code> but quadratic instead of linear.</p>
  </li>
</ul>

<h2 id="22-getting-lucky-with-the-equations">2.2) Getting lucky with the equations</h2>

<p>Looking at the <a href="https://en.wikipedia.org/wiki/Mean_absolute_percentage_error">Wikipedia page</a> for MAPE, one might notice that it resembles the formula for <a href="https://en.wikipedia.org/wiki/Mean_absolute_error">MAE</a></p>

\[MAPE = \frac{1}{n}|\frac{Y - \hat{Y}}{Y}|\]

\[MAE = \frac{1}{n}|Y - \hat{Y}|\]

<h3 id="algebraic-manipulation">Algebraic Manipulation</h3>

\[\begin{align*}
MAPE &amp;= \frac{1}{n}|\frac{Y - \hat{Y}}{Y}| &amp; \text{In my problem, Y is always positive} \\
&amp;= \frac{1}{nY}|Y - \hat{Y}| &amp; \text{Looks like a weighted MAE}\\
&amp;= \frac{1}{Y}\text{MAE}\\
\end{align*}\]

<p>so this means that I just need to find an <code class="language-plaintext highlighter-rouge">MAE</code> implementation.</p>

<h2 id="23-lady-luck-is-smiling">2.3) Lady Luck is Smiling</h2>

<p>By pure chance, I found <a href="https://scikit-learn.org/stable/modules/sgd.html#mathematical-formulation">Sklearn-mathematical formulation of SGD losses</a>, and I decided to read it.</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">epsilon_insensitive</code> loss ignores errors less than $\epsilon$ and is linear when greater than that</p>
</blockquote>

<p>was the description for one of the losses. However, it wasn’t apparent to me that they would also take the absolute error. Only after reading the contents in the link above, did I realize what it meant:</p>

\[L(Y, \hat{Y}) = max(0, |Y - \hat{Y}| - \epsilon)\]

<p>This means that if we set \(\epsilon\) to 0, we get the form we want!</p>

<h2 id="24-for-completeness">2.4) For completeness</h2>

<p>For completeness, I list out the equation as I used it.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Y</span> <span class="o">=</span> <span class="p">...</span> <span class="c1"># Our labels
</span><span class="n">X</span> <span class="o">=</span> <span class="p">...</span> <span class="c1"># My forecast data
</span><span class="n">denominator</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">/</span> <span class="n">Y</span> <span class="c1"># we can do this
</span>
<span class="c1"># Scaling
</span><span class="n">scaled_Y</span> <span class="o">=</span> <span class="n">Y</span> <span class="o">*</span> <span class="n">denominator</span>
<span class="n">scaled_X</span> <span class="o">=</span> <span class="n">X</span> <span class="o">*</span> <span class="n">denominator</span> <span class="c1">#
</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">SGDRegressor</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">"epsilon_insensitive`, epsilon=0)
model.fit(scaled_X, scaled_Y)

</span></code></pre></div></div>

<h1 id="closing-words">Closing words</h1>

<p>Although we managed to make <code class="language-plaintext highlighter-rouge">autograd</code> and <code class="language-plaintext highlighter-rouge">sklearn</code> work for my problem, the results were still not good. I suppose that the takeaway from this is that you can do everything “right” and still not have things turn out your way.</p>

<p>In hindsight, this was a simple problem, but it was a good reminder of what it takes to be a good machine learning engineer: good software and math skills. I needed to set up minor infrastructure, massage data via a pipeline, and work out the <code class="language-plaintext highlighter-rouge">autograd</code> package, so being able to code was imperative. In addition, I needed to understand the math to come to the solution I did.</p>

<p>Please know that I am not blowing my own horn; in fact, I’m embarrassed about how long I took to find the solution. And even then, I stumbled backward into the solution.</p>

<p>Thank you for taking the time to read this, and happy holidays!</p>]]></content><author><name>Ian Quah</name><email>ian@ianq.ai</email></author><category term="scikit-learn" /><category term="regression" /><category term="machine learning" /><summary type="html"><![CDATA[Spoiler: RTFM]]></summary></entry><entry><title type="html">Fundamentals Part 2: Hessians and Jacobians</title><link href="https://ianq.ai/Hessian-Jacobian/" rel="alternate" type="text/html" title="Fundamentals Part 2: Hessians and Jacobians" /><published>2018-01-25T00:00:00-08:00</published><updated>2018-01-25T00:00:00-08:00</updated><id>https://ianq.ai/Hessian-Jacobian</id><content type="html" xml:base="https://ianq.ai/Hessian-Jacobian/"><![CDATA[<p><strong>Spoiler</strong>: “H” is before “J”, which means that it’s the second-derivative. Obviously</p>

<p>This section builds off the last post, <a href="2018-01-20-Quick-and-dirty-calc-linalg.md">Fundamentals Part 1: An intuitive introduction to Calculus and Linear Algebra</a>; if you’re not familiar with calculus or linear algebra, I highly recommend starting there. If this is your first time seeing all of this, know that this section is more involved than the first fundamentals post. Be prepared to feel a little lost, but if you keep at it, I know you’ll get there (it took me a while to wrap my head around)</p>

<p>For each of the topics covered, Jacobian and Hessian, I try to provide 3 levels of information: a high level, a mid-level, and a low level for you to review, depending on your level of interest.</p>

<h1 id="1-a-quick-glossary">1) A quick glossary:</h1>

<p>0) x describes a scalar value, \(\vec{x}\) describes a vector, and <strong>X</strong> describes a matrix.</p>

<p>1) Vector-valued function is a function that returns a vector.</p>

<p>2) Matrix-valued function is a function that returns a matrix.</p>

<p>3) Tensor: a scalar value is a 0-order tensor, a vector is a 1-order tensor, and a matrix is a 2-order tensor. For the purpose of most Machine Learning applications, a tensor is just an n-th order tensor (more abstractions). We’ll come back to this idea later when considering the not-yet-defined Jacobian and Hessian.</p>

<p>4) \(\mathbb{R}^n\): basically means a point in n-dimensional space. For example, if you drew a Cartesian map, any point you pick has an (x,y) coordinate that describes it. Thus, we can say that the point exists in \(\mathbb{R}^2\). If you restrict the points to taking on “whole numbers” (aka Natural numbers, or counting numbers), you can say that it exists in \(\mathbb{N}^2\).</p>

<hr />

<h1 id="2-partial-derivatives">2) Partial Derivatives</h1>

<p>This section is a little awkward as it’s not covered in Calculus 101; however, discussing it is extremely important before broaching the rest of the blog post. A partial derivative is basically a derivative of a “part” of a multivariable function, i.e., we take the derivative along a single dimension while keeping all others constant.</p>

<p>A Jacobian and a Hessian are just derivatives of the first and second-order multivariate functions, i.e., applied once and twice, respectively.</p>

<hr />

<h1 id="3-jacobian">3) Jacobian</h1>

<p>The Jacobian is, in essence, the first derivative of some tensor. We begin with the following example: given a single point of data about you (age, height, favorite food), we want to find out how likely it is that you’re in certain clubs: (reading, sleeping) we’ll reference this problem while discussing the Hessian as well.</p>

<h2 id="31-high-level">3.1) High-level</h2>

<p>The Jacobian describes how changing each of the input dimensions affects each of the output dimensions. Looking at our example, changing how much we weigh or how our favorite food changes, we can observe the linear effect on our clubs.</p>

<p>Given our 3 input dimensions and our 2 output dimensions, we’d have 6 pairs to look at (3 possible things to manipulate for each of those 2 outputs). This intuition will come in handy if you read on.</p>

<h2 id="32-middle-level">3.2) Middle-level</h2>

<p>Consider our example from earlier:</p>

<p>1) Your input data, <strong>X</strong> \(\in \mathbb{R}^{1 \times 3}\).</p>

<p>2) You have some weight matrix, <strong>W</strong> \(\in \mathbb{R}^{3 \times 2}\)</p>

<p>3) Your output, \(\vec{y} \in \mathbb{R}^2\)</p>

<p>4) If we were to put some classifier algorithm, defining it by some function, \(f\), it would look like this:</p>

\[\vec{y} = f(\vec{x}) = W\vec{x}\]

<p>5) If we calculated the Jacobian of this, it would look along the lines of</p>

\[\textbf{J}(f) =
\begin{pmatrix}
        \frac{\partial y_1}{\partial x_1}   &amp; \frac{\partial y_1}{\partial x_2} &amp; \frac{\partial y_1}{\partial x_3}\\
        \frac{\partial y_2}{\partial x_1}   &amp; \frac{\partial y_2}{\partial x_2} &amp; \frac{\partial y_2}{\partial x_3}\\
\end{pmatrix}\]

<p>where we’re iterating through each dimension of <strong>X</strong> (3 of them), and $\vec{y}$ (2 of them). This can be expressed more compactly as:</p>

\[\textbf{J}(f) =
\begin{pmatrix}
        \frac{\partial \vec{y}}{\partial x_1}   &amp; \frac{\partial \vec{y}}{\partial x_2} &amp; \frac{\partial \vec{y}}{\partial x_3}\\
\end{pmatrix}\]

<p>where \(\vec{y}\) is a vector describing the vector which we partially differentiate.</p>

<h2 id="33-low-level">3.3) Low-level</h2>

<p>If we think of our inputs as a point lying on some n-dimensional plane, we can think of our weights as some linear transformation, \(f\),  that takes us from our current point \(\mathbb{x} \rightarrow \mathbb{y}\). What the Jacobian then gives us is the best \(\textbf{linear local estimation}\) of how the points are warped in that small area.</p>

<p>Remember, our n-dimensional vector \(\in \mathbb{R}^{1 \times 3}\) of features is just some point in n-dimensional space. If we take the ‘rate of change’ of something that transforms it, <strong>W</strong>, in our concrete case, into an m-dimensional space, we get the amount of linear transform in the small region around the point.</p>

<h2 id="34-where-might-you-see-it">3.4) Where might you see it?</h2>

<p>1) Loss function:</p>

<p>By imposing some restrictions on the neighborhood around a point, we can do some interesting work on making the values invariant (or close to) small changes in an area. if the explanation sounds a little hand-wavy, and you’d like a concrete example, check out <a href="https://www.youtube.com/watch?v=79sYlJ8Cvlc">Hugo Larochelle’s Contractive Autoencoder video</a>.</p>

<p>2) Discussions about local linearity in non-linear settings:</p>

<p>Neural Networks are known to be non-convex, but analyzing them from a linear standpoint can still be useful. I’d suggest watching the video above for an example of how it can be beneficial to analyze in this way.</p>

<p><strong>Note</strong>: I’m hoping to talk about convexity down the line as it is a fascinating topic.</p>

<hr />

<h1 id="4-hessian">4) Hessian</h1>

<h2 id="41-high-level">4.1) High-level</h2>

<p>The Hessian is essentially the derivative of the Jacobian.</p>

<p>In calculus 1, you might have learned that the derivative describes the rate of change, and the derivative of the derivative describes the maximum/ minimum. The Hessian is the equivalent of the concept mentioned above but applied to N-dimensional tensors in an abstract sense.</p>

<h2 id="42-middle-level">4.2) Middle-level</h2>

<p>Recall our Jacobian function from earlier:</p>

\[\textbf{J} (f) = \nabla f
\begin{pmatrix}
        \frac{\partial \vec{y}}{\partial x_1}   &amp; \frac{\partial \vec{y}}{\partial x_2} &amp; \frac{\partial \vec{y}}{\partial x_3}\\
\end{pmatrix}\]

<p>If we then take the Jacobian of THAT, we end up with the following:</p>

\[\textbf{J}(\textbf{J} (f)) = \nabla (\nabla f)
\begin{pmatrix}
        \frac{\partial^2 \vec{y}}{\partial x_1^2}   &amp; \frac{\partial^2 \vec{y}}{\partial x_2 \partial  x_1 } &amp; \frac{\partial^2 \vec{y}}{\partial x_3 \partial  x_1}\\
        \frac{\partial^2 \vec{y}}{\partial x_1 \partial  x_2 }   &amp; \frac{\partial^2 \vec{y}}{\partial x_2^2} &amp; \frac{\partial^2 \vec{y}}{\partial x_3 \partial  x_2 }\\
        \frac{\partial^2 \vec{y}}{\partial x_1 \partial  x_3}   &amp; \frac{\partial^2 \vec{y}}{\partial x_2 \partial  x_3 } &amp; \frac{\partial^2 \vec{y}}{\partial x_3^2}\\
\end{pmatrix}\]

<p>An interesting tidbit that the eagle-eyed among you may have realized is that we went up in dimensions from a compact vector representation to a compact matrix representation. Intuitively this makes sense as we are now varying our variables on our variables (hence the denominators like \(\partial x_1 \partial x_2\))</p>

<h3 id="43-low-level">4.3) Low-level</h3>

<p>Bear with me for a bit. If we were to expand out our \(\vec{y}\) into its components (\(y_1, y_2\)), we’d need another axis to put them on. So, our Hessian from above would need to “expand” into another dimension to store them. Still with me? I hope so because if you are, you’ll understand why:</p>

<p>1) I’m not going to actually list out the ‘tensor.’</p>

<p>2) I’ll call the ‘expanded’ version a 3-order tensor</p>

<p>3) When we differentiate a vector with regards to a vector, we increase dimensionality. See <a href="https://tminka.github.io/papers/matrix/minka-matrix.pdf">Old and New Matrix Algebra Useful for Statistics</a> for a summary of the different forms of differentiation. Also, <a href="https://en.wikipedia.org/wiki/Matrix_calculus#Layout_conventions">Wikipedia: Matrix Calculus Layout Conventions</a> has some interesting notes.</p>

<h2 id="44-where-might-you-see-it">4.4) Where might you see it?</h2>

<h3 id="441-convexity-of-the-loss-function">4.4.1) Convexity of the loss function:</h3>

<p><strong>Positive Semi-definite</strong>: if A is your matrix, then for any non-zero \(\vec{x}\), \(\vec{x}^T A \vec{x} \geq 0\)</p>

<p>If we calculate the Hessian of our loss function (I’d suggest going online and working through one of the proofs), we see that it is positive semidefinite, which means that it is convex. This brings about the property of a guaranteed global minimum; reaching it in a method like gradient descent is another matter.</p>

<p>One example of a convex loss function is the logistic, where because we know it is a bowl shape (convex), we know that we are guaranteed global minima if we find the minima</p>

<h3 id="442-an-evaluation-metric">4.4.2) An Evaluation metric</h3>

<p>Observed Information matrix where we’re looking at the negative Hessian of the log-likelihood function.</p>

<p>I’m not going deep into the details, but if we have some estimated parameters \(\theta\) (also called the weights in ML) , one way of evaluating how well \(\theta\) fits our data is by first taking the <strong>log-likelihood</strong>:</p>

\[\mathcal{L} (X_1, X_2, ..., X_n \| \theta) = \sum_{i=1}^{n} log f(X_{i} \theta)\]

<p>Taking the negative Hessian of our log-likelihood tells us how our loss varies as we manipulate different parameters.</p>

<h3 id="443-optimization">4.4.3) Optimization</h3>

<p>If we know the curvature of the surface, this can guide us in our gradient descent. Some papers talk about using the diagonals of the Hessian to estimate the optimal learning rate as mentioned in <a href="https://arxiv.org/pdf/1703.00788.pdf">A Robust Adaptive Stochastic Gradient Method for Deep Learning</a></p>

<p><strong>Intuition</strong></p>

<p>Admittedly, I’ve not read the paper above in-depth, and I’ve not read the papers referenced at all. Still, I’d wager that utilizing the diagonals of the Hessian allows them to weigh the importance of the different features as they make their gradient descent. I say this because not all features are equally informative, so it doesn’t make sense to treat them equally (especially since your error is typically just a scalar value that you propagate backward). I may be completely wrong, but this example stresses 2 things:</p>

<p>i) read the paper</p>

<p>ii) intuition is only helpful so long as it’s right, so it falls to you to make sure you’re correct.</p>

<hr />

<h1 id="5-further-readings--references">5) Further Readings / References</h1>

<p>1) <a href="https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf">Matrix Cookbook - Page 8-16</a> - I’m personally not a fan of recommending this off the bat as I think a collection of facts in itself isn’t useful except for as a reference.</p>

<p>2) <a href="http://www.cs.cmu.edu/~zkolter/course/15-884/linalg-review.pdf">Zico Kolter’s Linear Algebra Review and Reference</a> - great professor at CMU, and I found this guide to be very useful.</p>

<p>3) <a href="https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw">3Blue1Brown’s channel</a></p>

<p>4) <a href="https://tminka.github.io/papers/matrix/minka-matrix.pdf">Old and New Matrix Algebra Useful for Statistics</a></p>

<hr />]]></content><author><name>Ian Quah</name><email>ian@ianq.ai</email></author><category term="fundamentals" /><category term="calculus" /><category term="optimization" /><category term="machine learning" /><summary type="html"><![CDATA[Spoiler: “H” is before “J”, which means that it’s the second-derivative. Obviously]]></summary></entry><entry><title type="html">Fundamentals Part 1: An intuitive introduction to Calculus and Linear Algebra</title><link href="https://ianq.ai/Quick-and-dirty-calc-linalg/" rel="alternate" type="text/html" title="Fundamentals Part 1: An intuitive introduction to Calculus and Linear Algebra" /><published>2018-01-20T00:00:00-08:00</published><updated>2018-01-20T00:00:00-08:00</updated><id>https://ianq.ai/Quick-and-dirty-calc-linalg</id><content type="html" xml:base="https://ianq.ai/Quick-and-dirty-calc-linalg/"><![CDATA[<p><strong>Spoiler</strong>: The pre-calc of ML</p>

<p>As you’ve probably heard, calculus is imperative for Machine Learning. However, there is a definite emphasis on differentiation compared to integration, so this series of posts will build from simple derivatives to Jacobians and Hessians. Ideally, at the end of this series, if you read a paper that mentions one of the topics above, you’ll have a rough idea of why the authors chose to do what they did and what their choice means for the results.</p>

<h1 id="background">Background</h1>

<p>If you’ve already taken Calculus or Linear Algebra, feel free to skip ahead to the <a href="2018-01-25-Hessian-Jacobian.md">next tutorial, Hessians and Jacobians</a></p>

<h1 id="1-derivatives-101">1) Derivatives 101</h1>

<p>The equation below describes both the equation of a straight line as well as what happens if you take the derivative of that straight line with respect to some input value:</p>

\[\begin{align*}
y &amp;= mx + c\\
\frac{d y}{dx} &amp;= m
\end{align*}\]

<p>Typically in a calculus class, we’d talk about the rate of change of \(y\) with regards to \(x\). In other words, how much does \(y\) change as \(x\) changes? In this case, we see that \(y\) changes by a factor of m for every unit that \(x\) changes. For the moment, we are focused on scalar values, but this concept will generalize to vectors and matrices (which segues us into….)</p>

<hr />

<h1 id="2-linear-algebra-101">2) Linear Algebra 101</h1>

<p>Math often deals with the concept of abstraction. For example, we often deal with numbers, e.g., 5 or 100. In Linear Algebra, we are concerned with collections of numbers (vectors), e.g., a collection of (5, 10), or a collection of those collections (matrices), and further abstractions. To make this notion concrete, consider the following example:</p>

<h2 id="21-scalars">2.1) Scalars</h2>
<p>Edit: I have no idea if the following examples describe actual streets and avenues, so I’d like to apologize beforehand.</p>

<p>Say that we were somewhere in New York City, which works on a <a href="https://thegreatestgrid.mcny.org/">grid system</a>. If I were on 4th and 5th, while you were on 10th and 7th, our (x, y) coordinates could be described as (4, 5) and (10, 7), respectively. Equivalently, our coordinates could be described as the following:</p>

\[\text{My location:=}
\begin{pmatrix}
4\\
5\\
\end{pmatrix}\]

<p>and</p>

\[\text{Your location:=}
\begin{pmatrix}
10\\
7\\
\end{pmatrix}\]

<p>We decide to meet for coffee, but since neither of us drives, we agree to meet in the middle as that is easiest. So, we would meet at:</p>

\[x := \frac{4 + 10}{2} = 7\]

\[y := \frac{5 + 7}{2} = 6\]

<p>which corresponds to 7th and 6th (7, 6).</p>

<p>We saw in the computation above that it can be tedious to write out both equations to describe our (x, y) position. This complexity only grows as we add more locations, e.g., what shop; what if we had a compact way of representing my location, your location, and the operation of averaging to determine where we should meet? Here I want to keep two concepts in the back of your mind:</p>

<p>1) The concept of abstraction on scalars.</p>

<p>2) The concept of a coordinate system and what it means for something to be in the coordinate system.</p>

<h2 id="22-abstractions-on-scalars-vectors">2.2) Abstractions on Scalars: Vectors</h2>

<p>At the start of this Linear Algebra review, I said that Linear Algebra is concerned with numbers or collections. So far, we have already discussed one such collection: a coordinate system. In that case, my location is described as the collection of (4, 5), and yours is represented by (10, 7). The top element (4 and 10) represents the street, and the bottom represents the avenue.</p>

<p>Congratulations! We’ve just worked through the concept of a vector, albeit in a particular setting: New York streets and avenues. Let’s take a step back and our locations for what they are: specific instances of an abstract concept. We could just as well write:</p>

\[X_1:= 
\begin{pmatrix}
a\\
b\\
\end{pmatrix}\]

\[X_2 := 
\begin{pmatrix}
c\\
d\\
\end{pmatrix}\]

<p>where \(X_1\) <strong>CAN</strong> represent my street-avenue, but it could just as well describe my latitude-longitude or my age-height. Whatever the case, if we are then looking for the average of these two containers, \(X_1\) and \(X_2\), we can represent them as the following:</p>

\[\text{the middle := } \frac{X_1 + X_2}{2}\]

<p>. This equation holds for the street number and the avenue (our x and y coordinates).</p>

<p>Note, we can add more information, e.g., a <code class="language-plaintext highlighter-rouge">Z</code> coordinate, which represents the shop number to meet at, or the corner I’m on, but we do not need to change anything. Our “middle” can still be represented by the same general equation above.</p>

<h2 id="22-abstractions-on-scalars-and-vectors-matrices">2.2) Abstractions on Scalars and Vectors: Matrices</h2>

<p>We can then expand on our scalars and vectors to a collection of collections. Say we had two other friends, all our locations could be described as</p>

\[\text{Us := }
\begin{pmatrix}
4 &amp; 6 &amp; 10 &amp; 12\\
5 &amp; 7 &amp; 7 &amp; 15\\
\end{pmatrix}\]

<p>which would be a matrix. Phew, that was a mouthful.</p>

<h2 id="24-the-abstracted-coordinate-system">2.4) The abstracted coordinate system</h2>

<p>When we first introduced the idea of vectors, we discussed it in the sense of streets and avenues on New York’s grid system. In that case, our locations would be described by whole numbers (we can’t be at avenue 10.5).</p>

\[\text{My location: }
\begin{pmatrix}
4\\
5\\
\end{pmatrix}\]

<p>However, if we consider latitude and longitude, it makes sense that we can describe those numbers as numbers with some decimal point. For example, this random location I picked in New York has a latitude-longitude of (40.712776, -74.005974).</p>

<h3 id="241-counting-numbers">2.4.1) Counting Numbers</h3>

<p>The first example, street-avenue, pertains to the Natural numbers. We say that the street and the avenue, individual elements of our collection, exist \(\in \mathbb{N}\), the natural numbers (also known as the counting numbers).</p>

<h3 id="242-decimal-point-numbers">2.4.2) Decimal point numbers</h3>

<p>In the case of latitude-longitude, the individual elements of our collection exist \(\in \mathbb{R}\), the real numbers (have a decimal space). We denote these scalar values as elements of the sets of \(\in \mathbb{N}\) and \(\in \mathbb{R}\) respectively.</p>

<h3 id="243-collections-of-scalars-vectors">2.4.3 Collections of Scalars: Vectors</h3>

<p>If we talked about the collection, as opposed to elements within the collection, my street-avenue would then be:</p>

\[\text{My location: }
\begin{pmatrix}
4\\
5\\
\end{pmatrix}\]

<p>such that my location can be described as being in the naturals, \(\in \mathbb{N}^2\), a vector of natural numbers. My latitude, longitude can be described as \(\in \mathbb{R}^2\), a vector of real numbers. If we then added another number, e.g., the shop that I’m in, we would then have</p>

\[\text{My location: }
\begin{pmatrix}
4\\
5\\
6 \\
\end{pmatrix}\]

<p>and my location can thus be represented as \(\text{my location } \in \mathbb{N}^3\). This same concept extends to matrices. Consider our group of friends from earlier:</p>

\[\text{Us: }
\begin{pmatrix}
4 &amp; 6 &amp; 10 &amp; 12\\
5 &amp; 7 &amp; 7 &amp; 15\\
\end{pmatrix}\]

<p>Our location can then be described as \(\text{my location } \in \mathbb{N}^{2 x 4}\). And that’s it for the linear algebra you’ll need for the rest of this series!</p>
<h1 id="3-further-readings--references">3) Further Readings / References</h1>

<p>1) <a href="http://www.cs.cmu.edu/~zkolter/course/15-884/linalg-review.pdf">Zico Kolter’s Linear Algebra Review and Reference</a> - great professor at CMU, and I found this guide to be handy.</p>

<p>2) <a href="https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw">3Blue1Brown’s channel</a></p>]]></content><author><name>Ian Quah</name><email>ian@ianq.ai</email></author><category term="fundamentals" /><category term="calculus" /><category term="optimization" /><category term="machine learning" /><category term="math" /><summary type="html"><![CDATA[Spoiler: The pre-calc of ML]]></summary></entry><entry><title type="html">Pusheen The Limit</title><link href="https://ianq.ai/Pusheen-The-Limit/" rel="alternate" type="text/html" title="Pusheen The Limit" /><published>2018-01-19T00:00:00-08:00</published><updated>2018-01-19T00:00:00-08:00</updated><id>https://ianq.ai/Pusheen-The-Limit</id><content type="html" xml:base="https://ianq.ai/Pusheen-The-Limit/"><![CDATA[<p>Note: The code can be found <a href="https://github.com/IanQS/quitPusheenMeAround">here: quitPusheenMeAround</a></p>

<p>I <strong>love</strong> Pusheen, and I’m also a fan of playing around in my terminal. After talking to someone the other day, I was inspired to work on this; she mentioned how an officemate commented on the Pusheen that popped up whenever she opened her shell.</p>

<p>I didn’t use any statistics other than the standard deviation for a small portion of the image segmentation (cat v. background). Having said that, I think that this is a fun exercise to occupy my time.</p>

<h2 id="initial-problem"><strong>Initial Problem</strong></h2>

<p>A quick Google search revealed about 3 Pusheen ASCII art images online, which is disappointing given how many Pusheen images and GIFs there are. After a long week at work and some climbing earlier today, I’m ready to spend this Friday night in. So, it looks like I’m making a Pusheen ASCII art converter and some shell scripts. Also, Pusheen sounds like pushin’, which opens up several cute GitHub project names.</p>

<hr />

<h2 id="process"><strong>Process</strong></h2>

<p>1) Create a folder wherein we will store many Pusheen images.</p>

<p>2) Load, resize, and convert those images to ASCII art.</p>

<p>3) Make some shell scripts</p>

<h2 id="4-push-the-code">4) Push the code.</h2>

<h2 id="immediate-problems"><strong>Immediate problems</strong></h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                                  }}
               }|))|)           ))   |
              )      )         )  xX   }
             | uhMMoQ )}     }| Q#WWWk  |}}))||||||||||||||)}}
            / O&amp;8oaW%h         d%Whbo%Mc                      )|)}
           / w%Wdpdpo%MY|/)/jxo%*pdbpb8&amp;0XQZwdbkhaaaaaaakbpZCj    |)
    }}}}} ) m%Mpbbbbpk8&amp;WWWWW&amp;8apbbbbddW8W&amp;WWWM8888888W#88888W#hZ/  |)
         } J8Wpbbbbbbpa8#o8*o8opbbbbbbddkbdpqqwhMWWWW#dp#WMWW&amp;8W&amp;&amp;oQ  |
   vCLCUzrtW8bdbbbddbbbabbhbbakbbbdbbbbbddbhao**M#ooabbbk#WWWMapdaW8#Q  )        }||||}
 ) b&amp;WMMMMW%adbbbdbbdbdpppahppdbbdbddbbbd#&amp;WWMM##hpddbbbddkhkbdbbdpkW%a/ )      )      }
     rUOmd%Wpbbbd#88*ddo&amp;o%&amp;*&amp;hddM88adbbdoM****oohbbbbbbbbddddbbbbbdpa%WJ )    | vdoadv }
 ) wWWWWM&amp;%hdbbbd*88*dbkMB&amp;8B#bbdM8&amp;adbbdoWMMMMMW#dbbbbbbbbbbbbbbbbbbdd&amp;8C ) }/ m8%88%&amp;U})
 } CqLc)r&amp;Wpbbbbbdbbdbbdd*WMopbbbdbddbbbbdpppppppdbbbbbbbbbbbbbbbbbbbbdd8&amp;u|/  p%W#WWWBZ |
        qBadbbbbbbbbbbbbbpdddbbbbbdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbdkBa  ra%Wddko%Mt)}
   }))t #%bdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbpM%wh&amp;%&amp;WWok&amp;&amp;U )
     } x8Wpbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbda888ob*WW&amp;%WY }
     | Z%*pbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbdbMWWhqdM%8h} |
     / bBhdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbd#W&amp;&amp;W&amp;WhY })
     / h%kdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbp#%MobO)  |
     / o%kdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbpM8r    |}
     / o%kdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbd&amp;&amp;t/|)
     / k%hdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb8# |
     ) Z%*pbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbdaBb /
     }}j&amp;&amp;dbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbpM%Y )
      / kBhdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbdk%o |
      })rW8ddbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbp&amp;8z}}
       ) C8&amp;ddbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbpM%Z )
        ) Q&amp;8hpdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbpdW%w |
         | u*%Whppdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbdpd*8WQ |
          )  0*8&amp;obbdpppppppppdbdpppppppppppppppppdbdppppppppppdbb*&amp;&amp;b  |
           }|  YkB#waWWWWWWWWMopk#MMMMMMMMWWWWWWMMophMMWWWWWWWW*qaBa|  )
             )/  &amp;8*%Wwqqqppp&amp;%kMBhdbdddddddddddd88bW%bdppqqqwMBoMBO j
               ))Jh*ku       LMWWk               mMWWp        ch##p|)
                      |)|||)|  j  )||||||||||||||  x  /|||||)|
                 }|||}       })  |               )}  )        }|)||
                               }                   }
</code></pre></div></div>

<p>looks FAR better than</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$$$$$$$$$$$$$$$$$$$$$$$$$$$$$@$$$@@M*#oa@@@@@@@$@@$$$@@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@$BB@@$$@@$W*#q**o@$$$$$$$$@8MMW%$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$@@8b*#&amp;B@$$WaWppw##W8B&amp;B@@@B&amp;M*aMa8$@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$@B&amp;###*%%aMbqppq#&amp;*MMWM&amp;M&amp;#hpmh#*@@@@@$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@%8BBBW##MWkqdppqk*d*MkW#aapqpppM#B$$$$$@$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$@@@Wo**#W8hMMdqpwqpppqpdpqbpqqppppwooM@%%B$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$@$B8%&amp;W#*W*qppb#opppqqpppqppqpppqpk&amp;*W#o*#%@$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$@a*#qppqa%&amp;pqwh*qppppbkppqh#**&amp;oWW&amp;W%$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@@@@o*#wppppqpqpo#MWppppp&amp;Boqpdpqm*oW$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@@@**#qppqwqbao*aqc*Mqppphodpppo###&amp;#W8@$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$@$$WaMwqqpko**kkbZuud#wpppqwpppppdddoW*M#8$@$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@Wo&amp;pdao*od0uh#aWhjZMppppppppppppppwMoW%@$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$@@@*o#oM#hbZufUX#MoMpcCMdqppppppppppppqk#a@$@$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$@$$M*&amp;hoMd*o*#wXUYwqCzCu*opqqpppppppppppw#oW$@$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$B8#*dddYh&amp;hkWonJUrOwUXz#M*#aqppppppppppqdMh%@@$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$@@&amp;##&amp;kQUnvuqaokLUJXo#oMmvMoqkopppppppppppko&amp;#%$@$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$@$@WMwqa***akwJ/ uXUUM#a&amp;q/d8#bwppppppppppdMW&amp;M#@$@$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$@@@##hqZOOwdko#ohdZLcXpbQzcpMqdppppppppppppo#MWoB$@$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$@$B&amp;##W*okqwZZpba**oabqZ0zbMwpppppppppppppqpq#MB$@$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$W#oph*###*hbwOZqdka**oMWppppppppppppppppmo##@@$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$@B8hwqwqppka*##obpqZ0Om0M*wpppppppppppppk*MWM@$$$$@@$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$&amp;Whqpppppqqwqba**##*oo*#bqppppppppppppp#WW&amp;8@$@@$$$@$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$&amp;Wkqpppppppppqqwqqdkhkkqqppppppppppppppk**Wo%$$$@BB$@$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$8Wkqpppppppppppppppqqqqppppppppppppppppqqq#*%$@&amp;WW#&amp;$@$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$8WkqppppppppppppppppppppppppppppppppppppppM*%$Mh&amp;W&amp;*%@@$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$&amp;WaqppppppppppppppppppppppppppppppppppppqbWW@@#Wka&amp;*%$@$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$#**wppppppppppppppppppppppppppppppppppppwo##@#MWdooo$@$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$@$%aMqpppppppppppppppppppppppppppppppppppppMM&amp;##M#&amp;Wh%@@$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$@@**omppppppppppppppppppppppppppppppppppqo&amp;MMW#maWb%$@$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$@$@oMawppppppppppppppppppppppppppppppppwkWWah&amp;M*#o8$@$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$@$B*#odqqpppppppppppppppppppppppppppqqa#MMoo##&amp;8@$@$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$@$@WW#okqpwqqqqqqqqqqqqqqqqqqqqqpqph*#MW&amp;8&amp;&amp;%@$$@$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@$$B8MM*p#aaaaaaaaaaaahhhahhhokp**#8%$$$$$$$$@@$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$@@$$@####o**##M*M###MW&amp;WM&amp;&amp;W*M**o%@$@@$$@@@$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$@@$%W*MB@@@@@@@@@@@@@@@@@@M*##B$@@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$@$$@$$$$$$$$$$$$$$$$$$$$$@@$$@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$@@$@@@@@@@@@@@@@@$@@$$@@$$@@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$@$$$$$$$$$$$$$$$$$$$$$@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
</code></pre></div></div>

<p>this.</p>

<h2 id="solution"><strong>Solution</strong></h2>

<p>We can apply some heuristics to clear out the background. One heuristic is that Pusheen is typically at the center of the image, which means that we can probably use the corners to act as a threshold to remove the background.</p>

<hr />

<h2 id="more-problems--solutions"><strong>More Problems + Solutions</strong></h2>

<p>In the real world, you’d probably want to 0 out everything non-Pusheen, but since this image will be piped to the terminal, it helps to contrast with the non-empty characters around the image.</p>

<p>1) We need to add a background (and after we went through all that trouble to get rid of it….)</p>

<p>We are using <code class="language-plaintext highlighter-rouge">img.max()</code> to scale our image, so one hacky solution is to use the max value and scale it by some percentage.</p>

<p>2) Our choice to scale the image size before changing the background, our chosen parameters are all wonky. We can just swap the order of our operations by changing the background, then scaling the image.</p>

<p>3) However, we now have to contend with scenarios where the background is black or white. We simplify the problem by checking if the image is below some “sensible” threshold, and if it is, we set it to some percentage of the max.</p>

<p><strong>Special Thanks</strong></p>

<p>* <sub><sup> <a href="https://gist.github.com/cdiener/10491632">ASCII converter 1</a> for providing me with a starting point for code, and <a href="https://www.geeksforgeeks.org/converting-image-ascii-image-python/">ASCII converter 2</a> for providing a more detailed ‘gradient’ of colors for Pusheen to exist in. Both were extremely useful in providing a starting point for the ASCII art converter </sup></sub></p>

<p>* <sub><sup> <a href="http://flothesof.github.io/removing-background-scikit-image.html">Frolian - flothesof</a> for making me realize that OpenCV is for lazy people (lazy people who happen to be able to figure out how to install it ¯\<em>(ツ)</em>/¯) </sup></sub></p>]]></content><author><name>Ian Quah</name><email>ian@ianq.ai</email></author><summary type="html"><![CDATA[Note: The code can be found here: quitPusheenMeAround]]></summary></entry></feed>