Ian Quah

On Monads, Monoids and Endofunctors 1: The monoid

2022-07-19T00:00:00-07:00

Spoiler: Category theory has applications in machine learning

I’m a fan of code abstraction; I like how clean code looks and “feels”. I think that clean and good code is like art. And just like art can be categorized into styles such as Impressionism, Neo-Impressionism, and Post-Impressionism (all of which I like), we can also organize code.

In this post, I do not talk about functional vs. imperative vs. object-oriented programming but the mathematical structure in code. You might have heard of concepts such as monads, monoids, functors, etc. At an abstract level, these concepts lay out specific properties that we can use to describe how data can flow between various classes (in the programming sense, e.g., python, c++, java, etc.). The benefit here is that if your code fulfills the requirements laid out by these categories, you get certain guarantees about your program regarding results and how you can compose them together.

This is the first in a series of blog posts discussing categories in programming languages that hopefully help you notice patterns to write cleaner code. This series will not be mathematical and assumes no prior knowledge other than python (which you don’t even really need - it just provides a concrete example of what we’re doing).

We will continually expand on the following scenario throughout the series as we go from “ugly” unabstracted code to clean abstractions. It’s important to note that you (and I) have probably written code that fits into these concepts without even realizing it! The concepts introduced here are to make you more aware of what you are writing and make you notice these patterns, allowing you to reuse lots of code you have already written.

1) Initial Project

You are working on a project involving “parallel” computation, e.g., you have multiple computers or processes on the same system. Concretely, you have 100 machines with identical datasets on them. You want to do a hyperparameter search, e.g., ten searches over each of the 100 machines, totaling 1K runs. For each run, you want to track some validation loss before returning the model with the lowest validation loss.

Note: Throughout this post we assume that you have some train and validate method implemented.

1.1) Simple Scenario

If you were to find the best model, you might have something like the following:

Dataset = Tuple[NumericalArray, NumericalArray]
ValidationResults = List[float]

def Node(object):
    """
    A compute node on a single machine
    """
    def __init__(self, data: Dataset, hyperparameters: Dict[str, Any]):
        self.train_data = data[0]
        self.validation_data = data[1]
        self.conf = hyperparameters
        self.validation_losses = []

    def run(self):  # A Map 
        """Run the training and validation"""
        for conf in self.conf:
            trained_model = train(conf, self.train_data)
            self.validation_losses.append(validate(trained_model, self.validation_data))

    def report(self, validation_lists: List[ValidationResults]) -> float:  # A reduce
        """
        validation_lists = [
            [1, ..., 10]  # Node 1
            [0.1, ..., 1.0] # Node 2
            ....
        ]
        """
        minimum = math.inf
        for arr in validation_lists:
            minimum = min(minimum, min(arr))
        return minimum

And you would distribute this to all your nodes. After completing its computation, each node will send its report (a list of 10 floating numbers describing the validation loss) to a “reducer” node. The reducer node will accumulate all 1K results before reducing them to find the minimum value.

1.2) Complicated Scenario

The situation above is simple; the final node in the graph accumulates all the report results and finds the minimum, which is simple and doesn’t take up too much memory since floats are cheap to store.

However, what happens if we want to find more than just the minimum losses, and our data takes up much more memory? In this case, we would like to apply multiple reductions; Nodes 1-10 send their results to Reducer1, Nodes 11-20 send to Reducer2, and so forth. At the end, we have a final reducer which takes results from all the reducers for our final result

Our code now looks like the following:

Dataset = Tuple[NumericalArray, NumericalArray]
ValidationResults = List[float]
Data = Union[Dataset, ValidationResult]

def Node(object):
    """
    A compute node on a single machine which either:
        - runs the hyperparameter search
        - runs a reduction on the data
    """
    def __init__(self, data: Data, hyperparameters: Optional[Dict[str, Any]] = None):
        """
        In the case of our data being of instance `ValidationResults`, hyperparameters is an empty dictionary
        """

        # On our "reducer" nodes
        if isinstance(data, List):
            self.data = data
        else:
            self.train_data = data[0]
            self.validation_data = data[1]
        self.conf = hyperparameters

        self.validation_losses = []

    def run(self):
        """Run the training and validation"""

        # Our reduce step
        if hasattr(self, data):
            self.validation_losses = self.data
            return
        # Our map-and-run step
        for conf in self.conf:
            trained_model = train(conf, self.train_data)
            self.validation_losses.append(validate(trained_model, self.validation_data))

    def report(self) -> float:
        """
        All of the results here get collected and saved
        """
        return min(validation_losses)

1.2.1) Messiness

As we can see above, the code is quite messy. The messiness is because we have to care about the underlying data and what to do with it. We want to squint our eyes and abstract all the conditionals and checks.

Concretely, we would like to abstract away the values and make it cleaner, which we can do by the following:

class Dataset():
    def __init__(self, data: Tuple[NumericalArray, NumericalArray], hyperparameters):
        self.train_data = data[0]
        self.validation_data = data[1]
        self.conf = hyperparameters

        self.validation_losses = []

    def run(self):
        for conf in self.conf:
            trained_model = train(conf, self.train_data)
            self.validation_losses.append(validate(trained_model, self.validation_data))

    def report(self) -> float:
        return min(self.validation_losses)

class ValidationResults():
    def __init__(self, data: List[float], _: Optional[Any] = None):
        self.data = data

    def run(self):
        return

    def report(self) -> float:
        return min(self.data)

Container = Union[Dataset, ValidationResults]

def Node(object):
    """
    A compute node on a single machine which either:
        - runs the hyperparameter search
        - runs a reduction on the data
    """
    def __init__(self, container: Container):
        self.container = container

    def run(self):
        # The individual types handle their own run
        self.container.run()

    def report(self) -> float:
        # The individual types handle their own reduction
        return self.container.report()

As we can see, we defined two classes above, which will handle the run and report as necessary. By delegating the calls, we, as the programmer, do not have to care what the underlying Container is.

In my opinion, this is much cleaner! This way, we have decoupled the run logic from the underlying data type. All we need to do is call the appropriate values.

At a higher level, this is freeing because we can treat these class instances are abstract containers - as long as something follows the type-signatures of run and report from Node, it should, in theory, work out exactly as they expect.

1.2.2) So what?

However, none of this should be new to you. Creating an abstract interface to make code clean isn’t anything “interesting” in and of itself. Let’s go deeper.

2) The Second Phase

In the second phase of the project, you decide that you want to add in things like:

running average
standard deviation
tracking the 100 best models in terms of validation losses

Which would ultimately derail the structure we’ve got above…. or would it? Let’s take a look at the custom types we have defined so far:

Dataset

run
report

ValidationResult

run
report

We notice that our Dataset doesn’t change much, other than the Dataset.run. Our ValidationResult will change, but that’s understandable.

Note In the following, I assume you’ll be keeping track of the top 100 best models in your own way. I’ll be “using” a heap, but I won’t include any logic for it because that’s not the point of this work.

2.1) Naive Approach

The naive approach (which would probably come to mind first) would be the following

P.s: at the end of our reduce step, we have a dictionary of values which we must process to get whatever values you want.

class Dataset():
    def __init__(self, data: Dataset = Tuple[NumericalArray, NumericalArray], hyperparameters):
        self.train_data = data[0]
        self.validation_data = data[1]
        self.conf = hyperparameters
    
        self.validation_losses = []


    def run(self):
        self.validation_loss_min_heap = heapify([])
        for conf in self.conf:
            trained_model = train(conf, self.train_data)
            validation_losses = validate(trained_model, self.validation_data)
            self.validation_losses.append(validation_losses)

            # you do the checks and logic
            self.validation_loss_min_heap.insert(validation_losses)


    def report(self) -> Dict[str, float]:
        return {
            "min": min(self.validation_losses),
            "sum": sum(self.validation_losses),
            "count": len(self.validation_losses)
            "best_100": self.validation_loss_min_heap
        }

class ValidationResultDict():
    def __init__(self, data: List[Dict], _):
        self.data = data

    def run(self):
        return

    def report(self):
        min_so_far = math.inf
        sum_so_far = 0
        count_so_far = 0
        validation_loss_min_heap = heapify([])
        for data_dict in self.data:
            min_so_far = min(min_so_far, data_dict["min"])
            sum_so_far = sum_so_far + data_dict["sum"]
            count_so_far = count_so_far + data_dict["count"]

            # you do the checks and logic
            validation_loss_min_heap.insert(data_dict["best_100"])

        return {
            "min": min_so_far
            "sum": sum_so_far,
            "count": count_so_far
            "best_100": validation_loss_min_heap
        }

Container = Union[Dataset, ValidationResult]

def Node(object):
    """
    A compute node on a single machine which either:
        - runs the hyperparameter search
        - runs a reduction on the data
    """
    def __init__(self, data: Data):
        self.container = data

    def run(self):
        # The individual types handle their own run
        self.container.run()

    def report(self) -> Dict:
        # The individual types handle their own reduction
        # Also, you now have to process the returned dictionary
        return self.container.report()

where we added custom code to track the state and update our dictionary container. However, as we can see, there is a LOT of similarity between the Dataset.report and ValidationDatasetDict.report. Can we make this cleaner?

To do so, we can first introduce the concept of a monoid but I wouldn’t bother reading that until after you’ve read this article.

2.2) A monoid?

How does a monoid help us? Well, what is a monoid? A monoid is a mathematical structure that has the following properties:

a binary operation that is associative i.e operation(a,b) == operation(b, a)
closed i.e two instances of BLABLABLA will always output an instance of BLABLABLA when you apply the binary operation above
an identity e.g 1 + 0 == 1 and 10 * 1 == 10 (0 and 1 being the identity respectively)

Knowing this, could we abstract out our code? We’re making a bit of a jump below, but I promise I’ll add comments to the code. Let’s add a new class, Summary, which we define as the following:

class Summary():
    def __init__(self, validation_loss: Optional[float] = None, inplace=False):
        """
        We define an identity and non-identity instantiation

        There are 2 cases:
            - validation_loss is None:       where our compute node had an empty configuration file, or errored out
            - validation_loss is not None:   our computation node worked!

        """
        self.count = 0 if validation_loss is None else 1
        self.min = math.inf if validation_loss is None else validation_loss
        self.sum = 0 if validation_loss is None else validation_loss
        self.best_N = heapify([]) if validation_loss is None else heapify([validation_loss])

        self.inplace = inplace


    def reduce(self, other: Summary) -> Summary:
        """
        We've defined an associative binary operation where 
            reduce(a, b) == reduce(b, a)

        and the output is always a summary! 
        """
        to_assign = self if self.inplace else Summary()

        to_assign.count += other.count
        to_assign.min = min(self.min, other.min)
        to_assign.sum += other.sum
        to_assign.best_N = merge_heaps(self.best_N, other.best_N)
        return to_assign

We’ve done three things above:

defined an “identity” Summary to handle the case where we’ve errored out or our configuration was empty (for various reasons)
defined a binary operation that is associative (we can reorder the terms in the function, and the result is the same)
ensure that we always output a Summary type!

2.3) Using our monoid

We can then restructure our code by noting a few things:

our Dataset.report will now always return a singleton Summary
our ValidationResultDict now accepts a List[Summary] on __init__ as opposed to a List[Dict], and it now outputs a Summary

class Dataset():
    def __init__(self, data: Tuple[NumericalArray, NumericalArray], hyperparameters):
        self.train_data = data[0]
        self.validation_data = data[1]
        self.conf = hyperparameters


        # Create one just to ensure we always have something when the `report` is called
        # This way even if we do a `report` we can be sure that the code won't error out
        self.summary = [Summary()]  

    def run(self):
        for conf in self.conf:
            trained_model = train(conf, self.train_data)
            v = validate(trained_model, self.validation_data)
            self.summary.append(v)

    def report(self) -> List[Summary]:
        return self.summary
        
class ValidationResult():
    def __init__(self, summary_list_of_lists: List[List[Summary]], _, reduce_immediately=False):
        # Reduce the LoL into a single list
        self.summaries = sum(summary_list_of_lists, [])
        self.reduce_immediately = reduce_immediately

    def run(self):
        return

    def report(self) -> List[Summary]:
        # Option 1
        if self.reduce_immediately:
            running_summary = Summary()
            for summary in self.summaries:
                running_summary.reduce(summary)

            # Insert into a list to keep the types nice and tidy
            running_summary = [running_summary]

        # Option 2: reduce it all and then transmit, which saves bandwidth
        else:
            running_summary = [] 
            for summary in self.summaries:
                running_summary.append(summary)
        return running_summary

Container = Union[Dataset, ValidationResult]

def Node(object):
    """
    A compute node on a single machine which either:
        - runs the hyperparameter search
        - runs a reduction on the data
    """
    def __init__(self, data: Data):
        self.container = data

    def run(self):
        # The individual types handle their own run
        self.container.run()

    def report(self) -> List[Summary]:
        # The individual types handle their own reduction
        return self.container.report()

chefs kiss

P.s Again, you would need to do the final processing on Summary but that’s easy.

2.3) A retrospective

Notice how, by modifying our logic, we made our code look extremely simple. If we decide to add another feature, e.g., a max, a standard deviation, etc., all we would have to change is our Summary class to encapsulate the change.

3) Monoids and abstractions

QUICK: Before your eyes gloss over the following diagram, listen to what I’ve got to say. You already know all of the things in the diagram, which is from Wikipedia: monoids

In this case, M is a category; think of it as a fixed but arbitrary class, e.g., ValidationResult or Node. As programmers, we operate on instances of those classes but ignore that for now

On the first line, we have three terms; let’s index them 0, 1, and 2. On the bottom line, we have two terms; index 3 and 4. In between these terms, we have arrows, which are transformations.

1->2: we see that $\alpha$ is association where we move the parenthesis around. We introduced associativity as a property of a monoid earlier.

2->3 we see that we have “reduced” the equation $M \bigotimes (M \bigotimes M)$ into $M \bigotimes M$ by applying $1 \bigotimes \mu$, which is equivalent to saying that the first term (the M not in the parens) is the identity. We can do this because monoids must have an identity.

2->4 is the same as the above, but with the parens in a different location

4->5 && 3->5: is the result of just evaluation the x, the $\mu$.

And there you go!

Closing Thoughts

This post came about after a discussion with one of my mentees. That mentee was facing something similar, and as someone who has gone through this EXACT problem, I thought I’d write about it and share what I’ve learned.

Also, I firmly believe that one way to ensure you know something is by explaining it. And so, to finally understand what

A monad is a monoid in the category of endofunctors, what’s the problem?

I’ve decided to write a 3-part series on “What is a monoid?”, “What is an endofunctor” and “What is a monad”. All those posts will build off one another so stick around!

PyTorch Gradient Manipulation 1

2022-01-06T00:00:00-08:00

Spoiler: PyTorch offers about five ways to manipulate gradients.

This notebook is part 1 in a series of tutorials discussing gradients (manipulation, stopping, etc.) in PyTorch. The series covers the following network architectures:

1) Single-headed simple architecture
2) Single-headed complex architecture
3) Multi-headed architecture

but by the end of this post you will know all that you need to know to tackle the other architectures on your own.

The notebook for this tutorial can be found on Google Colab gradient_flow_1.

Note: For the purpose of this discussion, we define a module as either a single layer or a collection of layers in a neural network.

1) Motivation

The motivation behind this post is threefold:

i) Familiarizing Myself with `PyTorch`

Although PyTorch is easy to prototype with, I don’t fully understand its computation graph and how it applies its gradients via the optim

ii) Playing with Gradient Stopping and Propagating

Understanding how to stop propagation of the gradients is essential, especially nowadays, where we use off-the-shelf weights that we then fine-tune; fine-tuning is a straightfoward problem if we have a simple module, as shown below:

But what happens if we want to skip the application of a specific gradient layer?

Or where we have two networks that only interact occasionally? Or where we have two networks that are related? Consider the following topology with two primary modules: the actor and the critic, as used in the Deep Deterministic Policy Gradient (DDPG) architecture:

NOTE: Image sourced from IntelLabs: DDPG

We see that the critic (the bottom module) accepts the actor’s output. However, unless we stop the gradient flow, the computation graph will inadvertently backpropagate critic updates through the actor, which is undesirable.

2) Contents

We explore five methods categorized into High-Level, which utilize built-in methods, and Low-Level, where we manually access the gradients.

2.1) High-Level

The following methods are pertinent only to stopping gradients:

detach, which returns a copied tensor with the same values and properties but detached from the computation graph. The original tensor is preserved.
no_grad, which is a context manager that disables gradient calculation, setting requires_grad to False for all variables created within its scope.
inference, which ompletely halts gradient calculations both downstream and upstream. This is a relatively new method, introduced on September 14, 2021, and warrants discussion.

2.2) Low-Level

With direct access to the gradients, we can not only stop gradients but also manipulate them based on our specific needs:

Via the optimizer, where we exclude the optimizer from receiving the parameters of certain modules.
Manual Manipulation, where we extract the gradients and then choose whether to modify or manipulate them before application.

2.3) `eval` Misconception

When I first started using PyTorch, I mistakenly assumed that eval mode would:

Put the model into inference mode (turning off dropout and making batchnorm run in eval mode),
Turn off the computation graph construction.

However, it does not affect the computation graph construction as I had thought.

2.4) Making the Right Choice

Ultimately, each method comes with various trade-offs. We will discuss these below, allowing you to make an informed decision best suited for your application.

3) Problem Setup

We have the following graph:

In this setup, we aim to update only the network’s output head (L2). What are the various ways we can accomplish this?

I highly recommend having the colab notebook open as you work through this. I made it a point to plot the resulting computation graph for each setting, making it easier to understand what is happening.

4) High-Level

4.1) `detach`

detach detaches upstream values from the graph, so we only calculate the gradient backward up to the first detach. Our current graph setup is too simple to illustrate this phenomenon, but the computation graph in the follow-up post will work well.

4.1.1) Observations

Notice two things from the cells:

The output of the print statements shows that the grad of L1 is None.
L1 does not exist in the computation graph (contrast this with the Control).

4.1.2) Usecase

Stopping gradient flow.
Saving memory.

Torch tensors keep track of data such as the computation graph. By detaching these tensors, we drop the computation graph of all upstream operations up to the current variable.

Converting the tensor to numpy.

Attempting to directly convert to numpy will result in an error because numpy does not track the computation graph. It is safer to have a clear distinction between numpy arrays and torch tensors.

import torch as T
a = T.tensor(1.0, requires_grad=True)
b = a + a
b.numpy()

4.2) `no_grad`

4.2.1) `no_grad` in action

It can be used as follows:

#!pip install -q torchviz
import torch as T
from torchviz import make_dot

# Requires grad = True to construct graph
x = T.ones(10, requires_grad=True)  
with T.no_grad():
	pass
y = x ** 2
z = x ** 3
r = (y + z).sum()

make_dot(
    r, 
    params={"y": y, "z": z, "r": r, "x": x},
    show_attrs=True
)

Uncomment the first line if you do not already have torchviz. Then, play around with moving y or z into the T.no_grad() context.

4.2.2) Observations

The graph of no_grad is the same as the graph of detach
The printed information shows that L1 has None gradients, similar to the previous method.

4.2.3) Usecase

Stopping gradients.
Improving computational speed and memory consumption.

no_grad tells PyTorch to not track operations within the context, which means that the computation graph is not created.

Furthermore, no_grad is faster than detach because detach returns a copy of the input tensor (just without the computation graph), whereas no_grad does not persist the computation graph of variables within its scope.
Less room for mistakes.

Keeping both the torch tensor and numpy array around might not be your intention, and you might accidentally operate on the wrong variable.

4.3) `inference`

4.3.1) Observations

We discuss two observations for this code section:

Cell 1: without_grad

Viewing the computation graph, we see that no values are tracked (hence an empty singular block)

Solution If we want to allow downstream calculations that themselves are not in inference mode, we must make a clone of the tensor. We display the relevant sections of this in section 4.3.2) Relevant code

Cell 2: with_grad

We see this method produced the same computation graph as in the detach and no_grad settings. Like no_grad, inference() is a context manager. In no_grad and detach, upstream values were not tracked in the computation graph; in inference, even downstream values are not tracked.

*Pytorch CPP Inference mode docs

4.3.2) Relevant Code

We generated the two graphs by following the setup from this official Twitter post in mind about

def _inference_forward(self, X):
  # First var is a inferenced-var
  with T.inference_mode():
    tmp = self.l1(X)
  try:
    # Try to do a non-inference forward pass
    return self.l2(tmp)
  except Exception:
    print(f"Trying to use intermediate inference_mode tensor outside inference_mode context manager")
    
    # Getting pure-inference
    with T.inference_mode():
      grad_disabled = self.l2(tmp)
    # Convert inferenced-var and allow us to
    # do a normal forward pass
    new_tmp = T.clone(tmp)
    grad_enabled = self.l2(new_tmp)
    return grad_disabled, grad_enabled

4.3.3) Usecase

Gradient Propagation It is possible to use this method to stop gradients, but there are easier ways to accomplish this.

Inference Speed While no_grad stops operation tracking, inference disables two other autograd features: version counting and metadata tracking.

5) Low-Level

In the following methods, we work directly with the computed gradients instead of detaching variables or telling PyTorch to ignore blocks. This low-level manipulation is useful for making complex modifications to our gradients. While it won’t be relevant here, it’s worth mentioning ahead of time.

Furthermore, whereas the methods in the High-Level section stop all gradients from flowing upstream, the Low-Level methods allow us to selectively skip modules.

Things to Note:

Gradients are stored in the model parameters when loss.backward is called. The optimizer.step call simply applies these gradients. Thus, using the optimizer method is more or less equivalent to the manual manipulation method.
Unlike the resulting computation graphs in the High-Level section, where no L1 information is kept, in both Low-Level solutions L1 is still tracked even if unused (as verified by quick tests in the corresponding cells).

5.2) `Colab: optim.Optimizer`

Rather than using optim.SomeOptimizer(model.parameters()), we use optim.SomeOptimizer(model.l2.parameters()), which instructs our optimizer to apply gradients only to the L2 parameters.

5.2.1) Usecase

Gradient Stopping: As with the above methods, this approach can “freeze” a layer.
Gradient Manipulation: This allows specification of per-module hyperparameters, though it does not provide fine-grained control.

5.3) `Colab: Manual Manipulation`

Here, unlike the above section where the optimizer applies our gradients, we manually apply the gradient.

5.3.1) Usecase

The primary use-case for this method over all others is custom gradient applications. For instance, if you wish to zero out gradients every other step or scale the gradients under specific conditions.

6) Closing Thoughts

6.1) Gradient Stopping

The “simple” methods such as no_grad are generally easier to implement and should be preferred if your goal is merely to stop gradients from flowing upstream. My recommendation is to use no_grad wherever possible as it is faster than detach. This preference is somewhat subjective, but I find no_grad also clearer because it explicitly excludes a block of computations that will not be used further down. When you detach a variable, you now have both the torch tensor and the numpy array, which could lead to confusion.

I recommend avoiding inference for gradient manipulation unless you’re absolutely certain you have a compelling reason. I do not see a scenario where inference would be preferred over no_grad, especially when considering that using no_grad allows you to avoid unnecessary copying of variables.

6.2) Gradient Manipulation

If feasible, use the optimizer approach as it leaves less room for error. However, the Manual Manipulation method is ideal if you need to apply custom operations to your gradients. This is particularly useful for scenarios where you might want to scale gradients for specific layers under certain conditions or zero out gradients intermittently.

Stumbling backwards into np.random.seed through jax.

2022-01-06T00:00:00-08:00

Spoiler: We’ve all been using randomness wrong

You can find the associated notebook for this post, but it’s relatively minimal. Feel free to open the link and play with the notebook, but know that running it’s not strictly necessary.

1) Intro

Given my current needs, I think that jax is the best computational tool out there. I hope to write more about jax in the coming months, and show you why you should consider trying it out. One important thing to realize is that jax is not a deep learning framework (although it does have autograd built-in). First and foremost, jax is a numerical computation library, like numpy.

Over the weekend, I was working on porting some code from pytorch to jax. In the process, I stumbled onto some code that dealt with randomness, and I decided to read more about randomness in the context of numpy. The material I had read over the weekend ended up being the motivation behind this blog post. To begin, let’s look at how we would deal with randomness in jax:

key = jax.random.PRNGKey(SEED)
print(key)

# which outputs the following on my run:
#   DeviceArray([1076515368, 3893328283], dtype=uint32)

Ironically, I felt like I understood numpy’s randomness better after using jax. This blog post hopes to exposit what I learned in the process.

i) A little about jax

As mentioned earlier, jax is a computational framework akin to numpy. I’d say the main difference between jax and numpy is that jax was designed to be optimizer agnostic. Being optimizer agnostic means that jax runs fast regardless of if you’re on a CPU, GPU, or TPU. I particularly like it because of:

how fast it is when compared to other frameworks (I got a 10X speed boost compared to raw vectorized numpy in a function with lots of dot products).
how easy it is to peek into its internals (admittedly, this is subjective).
how it allows you to implement the equations you see in papers directly. You can implement the line of code then call vmap to apply it to all rows in your array. You don’t need to futz around with vectorizing your equations any longer.

ii) Could it be the future?

I feel like jax and XLA are the future of computation in python. Granted, this isn’t exactly a hot take - lots of people and companies have begun to move to jax:

DeepMind’s alphafold model is built in haiku, which is a deep-learning oriented library built on top of jax
Google Brain has also released a deep-learning called flax. From what I can tell, teams at Google Brain have begun transitioning over to it.
Huggingface has also begun releasing models in flax

Note Pytorch behind

In my last blog post PyTorch Gradients, I mentioned publishing a series of posts covering gradients in PyTorch. I fully intend to finish that series, but I’ve more or less abandoned PyTorch.

2) Randomness:

Anyways, on to the meat of this post: over the weekend, I was playing with the idea of porting over snnTorch to jax. I first began by scanning through the tutorials where I read some material about creating random spike trains. The contents of the tutorial and what spike trains are aren’t crucial for this post. Still, it did remind me that jax handles randomness differently from other frameworks. So, I thought I should do some deep(er) reading before naively moving code over.

If you look up randomness in jax, one of the first things you’ll stumble on is how to generate a key and continually split the random key. To make a long story short, jax is functional in nature, which means that it is stateless. Being stateless means (among other things) that jax handles randomness explicitly; we have to explicitly seed a value every time we invoke randomness in our code. On the one hand, this makes our code more verbose, but on the other hand, it makes reproducibility far easier.

i) Statefulness

The following is merely a working example of what “statefulness” means. It is by no means a rigorous definition. Think of being stateful as the following:

class StatefulAdd():
    def __init__():
        self.count = 0

    def __call__(self, x):
        # The identity + number of times it has been called
        self.count += 1
        return x + 1

foo = StatefulAdd()
first = foo(1)  # first := 1
second = foo(1)  # second := 2

i.e. we can plug the same value in but obtain different values each time. There’s nothing inherently wrong about coding this way(regardless of what the func-ies will say); it can just be harder to reason about it.

Anyways, going back to jax: by enforcing statelessness, we have to be explicit in terms of our random key every time we make a call. By enforcing statelessness, jax sidesteps the reproducibility issue that plagued Tensorflow1.X (and probably pytorch too). Although jax isn’t perfect in the reproducibility aspect, I believe it is going in the right direction.

ii) Reproducibility in TF1.X

How to get stable results with TensorFlow, setting random seed although, to be fair, there seems to be an official answer for Tensorflow 2 as of 2020
Why can’t I get reproducible results in Keras even though I set the random seeds? (asked in 2018) which contains my favorite answer I’ve seen so far. The answer states the following and has the following caveat:

In short, to be absolutely sure that you will get reproducible results with your python script on one computer’s/laptop’s CPU then you will have to do the following:

# Seed value
# Apparently you may use different seed values at each stage
seed_value= 0

# 1. Set the `PYTHONHASHSEED` environment variable at a fixed value
import os
os.environ['PYTHONHASHSEED']=str(seed_value)

# 2. Set the `python` built-in pseudo-random generator at a fixed value
import random
random.seed(seed_value)

# 3. Set the `numpy` pseudo-random generator at a fixed value
import numpy as np
np.random.seed(seed_value)

# 4. Set the `tensorflow` pseudo-random generator at a fixed value
import tensorflow as tf
tf.random.set_seed(seed_value)
# for later versions: 
# tf.compat.v1.set_random_seed(seed_value)

# 5. Configure a new global `tensorflow` session
from keras import backend as K
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)
# for later versions:
# session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
# sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
# tf.compat.v1.keras.backend.set_session(sess)

Indeed, a thing of beauty.

3) Reproducibility in `numpy`

First and foremost, I’d recommend opening the accompanying notebook, specifically the numpy portion and playing with the code there. NB: the jax portion is trivial and works as you might expect; I included the jax portion primarily for completeness.

As you play with the numpy portion, you’ll notice how you get new random values every time you call a random module. We get new random values every time we call a random module without explicitly giving in a key, which tells us something is happening under the hood.

This “something” looks a lot like we are generating a new random key on every call. Note that this is not what happens under the hood, but it helps tie what we see to jax and how it handles random state.

i) Example Scenario

You have a program that only crashes once in a while, and you’ve identified the exact function that it crashes on! You’ve even managed to find a specific random seed on which that function works fine, so you’d like to set the state only inside that function and avoid the problem altogether.

Yes, this is a contrived example; sue me.

Statefullness issue illustrated

Note here how we have reset the random seed within the new_generate_np_weights. If the randomness were only local to the context we are in, we would expect to “continue” the original randomness once we exit the function. Said differently, we would have two “sources” of randomness, the second of which would get garbage collected once new_generate_np_weights returns; however, as we can see on the function labeled with “#3rd” call”, we have received the same random value as our “# 2nd call”.

The global state

Clearly, something “unexpected” is happening. At its core, np.random.seed creates what is known as a RandomState which, as we’ve discussed, creates a stateful object. In fact, as we saw in our code example, calling seed recreates the object instead of reseeding it.

Obviously, this is the source of our issues.

ii) How do we address reproducibility in numpy?

In all honesty, I have previously stumbled on the new best practices for generating random numbers in numpy, but I never bothered to read it. I don’t think that the reasoning behind the recommendation ever clicked with me, so I never felt a need to change how I was doing things.

However, now that we are clear on the limitations of the existing np.random.seed, we can discuss the recommended way of doing things: RandomGenerator. To make a long story short, you create an object which contains all your randomness; you “extract” whatever you need from this random object. For example, see random sampling

from numpy.random import default_rng
rng = default_rng()
vals = rng.standard_normal(10)
more_vals = rng.standard_normal(10)

as opposed to an older method

from numpy import random
vals = random.standard_normal(10)
more_vals = random.standard_normal(10)

Where we presumably mutate a global object.

Closing Thoughts:

This was an enlightening topic for me to dive into, and I hope you found reading this useful. I feel like I better understand what numpy does under the hood when we use randomness. I also feel like I better understand the motivation behind numpy’s API change recommendation when viewed through the lens of jax.

tl;dr

1) jax handles randomness very well, even if it may be more verbose. 2) Use the new best practices if you are dealing with random numbers in numpy

P.s

You can generate multiple keys with jax.random.split that you can consume

key_array = jax.random.split(key, num=X)

A Machine Learning oriented introduction to PALISADE, CKKS and pTensor.

2021-02-01T00:00:00-08:00

Spoiler: You can do math on encrypted numbers

Note: “we” means “I”

Overview

1) We introduce the PALISADE library and the cryptographic parameters that we need to specify. We then explain what the cryptographic parameters mean for our application.

2) We use the pTensor library and train a housing price predictor on the Ames dataset, a modern house price dataset.

3) We set up the discussion for the next post in the series.

Note: check the link at the very bottom for the complete source code. Sections have been omitted in this page to reduce clutter.

1) PALISADE

Instructions to install PALISADE can be found here: PALISADE-Dev build instructions. For users new to PALISADE and C++, we highly recommend bookmarking the PALISADE Doxygen page containing the library’s documentation.

i) What is PALISADE

From the README.md on the PALISADE page

PALISADE is a general lattice cryptography library that currently includes efficient implementations of the following lattice cryptography capabilities:

Fully Homomorphic Encryption (FHE)
- Brakerski/Fan-Vercauteren (BFV) scheme for integer arithmetic
- Brakerski-Gentry-Vaikuntanathan (BGV) scheme for integer arithmetic
- Cheon-Kim-Kim-Song (CKKS) scheme for real-number arithmetic
- Ducas-Micciancio (FHEW) and Chillotti-Gama-Georgieva-Izabachene (TFHE) schemes for Boolean circuit evaluation
Multi-Party Extensions of FHE (to support multi-key FHE)
- Threshold FHE for BGV, BFV, and CKKS schemes
- Proxy Re-Encryption for BGV, BFV, and CKKS schemes

ii) Machine Learning Application

The takeaway for us machine learning practitioners is that we can train encrypted machine learning models to output encrypted predictions after training said model on encrypted data.

2) PALISADE’s Cryptographic Parameters

We as machine learners(?) need to have a rough idea of the following parameters:

i) multDepth

This describes the depth of multiplication supported. Informally, when we encrypt data, we add some noise to increase the scheme’s security. When doing mathematical operations on these data, our noise increases (linearly in addition and subtraction but squared in multiplication).

There is no single “best” value to set the multDepth to and this is highly dependent on your problem. The following are some example equations and their corresponding multiplication depth

$(a * b) + (c * d)$ has a multiplication depth of 1
$a * b * c$ has a multiplication depth of 2

ii) scalingFactorBits

In the original CKKS paper, the authors discuss a scaling factor they multiply values with. The scaling factor prevents rounding errors from destroying the significant figures during encoding. Unfortunately, it is difficult to discuss this parameter without discussing the paper’s core ideas, so we leave this for the next post. Thankfully, PALISADE is reliable in informing us if the scalingFactorBits is set too low.

We tend to use values between 30 and 50 for most of the applications.

iii) batchSize

The batchSize is a tricky parameter to set correctly. The issue is that the batch size must be equal to

\[\frac{\text{Ring size}}{2}\]

Unfortunately, one needs to set multDepth, then look at ring size before doing it all over again with batchsize set to be equal to half the ring size. It’s a little hairy, yes, but this is the price we pay for privacy.

3) pTensor library:

For this discussion we encourage readers to refer to linear_regression_ames.cpp but we also highlight the critical sections in our discussion.

i) pTensor

The pTensor library’s motivation is to provide those with a machine learning or data science background the ability to train encrypted machine learning models in a framework that looks and feels familiar. Where possible we aimed to mimic the numpy library in terms of behavior (e.g allows broadcasting, * corresponds to the Hadamard product, etc.)

In line with the library’s motivation, there are many aspects hidden from the user, but we briefly discuss important concepts that the inquisitive user may stumble upon while perusing the source code.

ii) Complex numbers

CKKS operates on complex numbers for various reasons that we will discuss in the follow-up but know that we only focus on the real-number portion from these complex numbers.

iii) Packing

To pack the data essentially means that we encode multiple data points into a single ciphertext. Homomorphic encryption is a slow process, but by leveraging SIMD, we can carry out our operations faster. An analogy would be doing a for-loop vs. vectorized operation in numpy. Because the size of our ciphertexts is already very large, it is advantageous to store the data in transpose form to reduce the number of encryptions we need to do and to allow for faster element-wise operations.

iv) pTensor::m_cc

The m_cc object is the cryptographic context which we use to carry PALISADE’s operations.

4) Using pTensor on the Ames dataset

i) Setting up the cryptographic contexts

We show

how to create a cryptocontext, which configures PALISADE to perform encrypted computation within a specific encryption scheme
code for training on the Ames dataset

Should one attempt to follow the process in Numpy or in Eigen know that because of the noise and the way our encryption scheme operates, one may achieve slightly different results between those plaintext versions and this encrypted version.

We briefly introduce the parameters used below but defer further discussion to later.

auto cc = lbcrypto::CryptoContextFactory<lbcrypto::DCRTPoly>::genCryptoContextCKKS(
    multDepth, scalingFactorBits, batchSize
);

cc->Enable(ENCRYPTION);
cc->Enable(SHE);
cc->Enable(LEVELEDSHE);  // @NOTE: we discuss SHE and LeveledSHE in the follow up
auto keys = cc->KeyGen();
cc->EvalMultKeyGen(keys.secretKey);
cc->EvalSumKeyGen(keys.secretKey);

int ringDim = cc->GetRingDimension();
int rot = int(-ringDim / 4) + 1;
// @NOTE: we discuss EvalAtIndex in the followup
cc->EvalAtIndexKeyGen(keys.secretKey, {-1, 1, rot});  

We create a cryptocontext object which takes our chosen parameters:

multDepth - The maximum number of sequential multiplications we can do before our data becomes too noisy and the decryption becomes meaningless.

scalingFactorBits - the scaling factor mentioned above and to be discussed later.

batchSize - how many data points (think vector of data) we pack into a ciphertext. Homomorphic encryption is slow but can be sped up by conducting operations over batches of data (via SIMD)

ii) Training setup

iii) constructDataset

Notice how the parameters that the function takes in are the plaintext X and y. The reason for passing in plaintext X’s and y’s is to allow for easy indexing into the data for shuffling. It would be possible to shuffle the data in encrypted form but it is prohibitively slow and an easier alternative already exists. Thus, to simulate shuffling the data every epoch, we allow the user to specify some number of shuffles, and the data owner creates n-shuffles of the data that is then encrypted.

While training, we can simulate this randomness by randomly indexing into any of the shuffles.

iv) Training

The following should look familiar to anyone familiar with machine learning

for (unsigned int epoch = 0; epoch < epochs; ++epoch) {
  auto index = distr(generator);
  auto curr_dataset = dataset[index];
  auto X = std::get<0>(curr_dataset);
  auto y = std::get<1>(curr_dataset);

  auto prediction = X.encryptedDot(w);
  auto residual = prediction - y;// Remember, our X is already a transpose
  auto _gradient = X.encryptedDot(residual);
  pTensor gradient;
  gradient = _gradient;
  auto scaledGradient = gradient * alpha * scaleByNumSamples;

  w = pTensor::applyGradient(w, scaledGradient);
  w = w.decrypt().encrypt();

However, there are a few things to note:

1) encryptedDot instead of dot (which is also supported)

In the first encryptedDot, in the case of a matrix-matrix, we do a Hadamard product before doing a summation along the 0th axis. Again,our X is encrypted in transpose form, of shape (#features, #observations). Thus, our weight matrix is of shape (#features, #observations). We leave it to the reader to work out the details of why this works.

In the other case (not matrix-matrix), we default to the standard dot product.

2) applyGradient

To understand the motivation here, we must first discuss the shape of the incoming values

w: (#features, #observations)

scaledGradient: (1, #features)

So, we must modify the scaledGradient into a repeated Matrix form to apply it to the weights

3) w.decrypt().encrypt()

The reason for our decrypt-encrypt has to do with the multDepth parameter that we briefly discussed earlier. As mentioned, as we do operations on our ciphertexts, we accumulate noise. If this noise gets too large, our decryption will begin to fail. This failing results in random bits interpreted as (usually huge) random numbers. By decrypting and encrypting our results again, we can refresh this noise (reduce the noise to 0).

However, there is a caveat here: only the party with the secret key can do the encrypting. Consider a scenario where we have a data enclave-client setup where the client does all the computations. There is a limit to the maximum multDepth one can set before CKKS becomes too unwieldy. Computations that exceed that multDepth need either server reencryption (like shown here) or Bootstrapping (which we will address in the next post) to securely reencrypt the data. Bootstrapping resets the noise and thus the multiplicative depth. However Bootstrapping for CKKS is not yet available for PALISADE as of Feb 2021. This server reencryption process is considered less secure compared to a fully homomorphic setup, but we defer further discussion to the next post.

5) Closing Thoughts

P.s: visit PALISADE - PKE for further examples of how to use PALISADE (one of which I contributed to!).

MAPE Madness

2019-12-25T00:00:00-08:00

Spoiler: RTFM

Problem setup: You want to use the Mean Absolute Precision Error (MAPE) as your loss function for training Linear Regression on some forecast data. Springer: Mean Absolute Precision Error (MAPE)) has found success in forecasting because it has desirable properties:

robust to outliers
scale invariance (returns a percentage) and is intuitive to compare across datasets.

0) Setup

You have forecasting data where a significant difference may exist between contiguous samples.

\[T_1 = 5, T_2 = 5000\]

For example, you want to predict the price of Bitcoin, or ensure that your power plants can support when England brews up sufficient power for World Cup tea-time surge

We reproduce the equation below:

\[\text{MAPE} = \frac{1}{N} \sum_t^N |\frac{y - \hat{y}}{y}|\]

1) Failed Attempts

Here’s hoping you learn from my mistakes and can avoid the time I wasted trying to solve this problem

1.1) Sklearn

A quick look at the Sklearn Linear Model - Linear Regression page tells you that it only supports OLS. This is unfortunate because sklearn is, in general, heavily optimized and well tested.

1.2) Autograd

Having worked through the examples, I was not clear how to handle enormous datasets, which I was modeling at the time. The solution I was after was how to generate indices to be passed in for minibatch training. After much searching, I eventually found what I was looking for in Convnet Example, which shows you how to pass minibatches in.

Note: you want to be sure that none of your y_true values aren’t 0 as this can lead to division by 0 errors in the optimization. I suggest doing

def objective(params, X, y):
    pred = np.dot(X, params)
    non_zero_mask = y > 0
    return (y[non_zero_mask - pred[non_zero_mask]]) / y[non_zero_mask]

Other options would be to add weights to the objective function as it is possible that you are extremely unlucky, and the objective function returns 0 as all your labels, `y’, are 0. Additionally, you may want to weigh different samples more or less heavily.

Unfortunately, although I managed to get it to work, this solution was unbearably slow. Furthermore, for maintainability reasons, it would just be easier if you could use the sklearn API (not to say that you couldn’t wrap your autograd training into the sklearn format).

It was time to head back to the drawing board.

2) Solution

I got lucky, and things lined up perfectly.

2.1) Getting lucky with sklearn

While researching for ways to use sklearn packages to solve my issue, I also came across sklearn.SGDRegressor, but that only allows the following loss functions:

squared_error: OLS
huber: wherein errors below some $\epsilon$ are treated as a linear loss, while errors above that $\epsilon$ use the squared loss.
epsilon_insensitive: ignores errors less than $\epsilon$ and is linear when greater than that
squared_epsilon_insensitive: is epsilon_insensitive but quadratic instead of linear.

2.2) Getting lucky with the equations

Looking at the Wikipedia page for MAPE, one might notice that it resembles the formula for MAE

\[MAPE = \frac{1}{n}|\frac{Y - \hat{Y}}{Y}|\] \[MAE = \frac{1}{n}|Y - \hat{Y}|\]

Algebraic Manipulation

\[\begin{align*} MAPE &= \frac{1}{n}|\frac{Y - \hat{Y}}{Y}| & \text{In my problem, Y is always positive} \\ &= \frac{1}{nY}|Y - \hat{Y}| & \text{Looks like a weighted MAE}\\ &= \frac{1}{Y}\text{MAE}\\ \end{align*}\]

so this means that I just need to find an MAE implementation.

2.3) Lady Luck is Smiling

By pure chance, I found Sklearn-mathematical formulation of SGD losses, and I decided to read it.

epsilon_insensitive loss ignores errors less than $\epsilon$ and is linear when greater than that

was the description for one of the losses. However, it wasn’t apparent to me that they would also take the absolute error. Only after reading the contents in the link above, did I realize what it meant:

\[L(Y, \hat{Y}) = max(0, |Y - \hat{Y}| - \epsilon)\]

This means that if we set $\epsilon$ to 0, we get the form we want!

2.4) For completeness

For completeness, I list out the equation as I used it.

Y = ... # Our labels
X = ... # My forecast data
denominator = 1 / Y # we can do this

# Scaling
scaled_Y = Y * denominator
scaled_X = X * denominator #

model = SGDRegressor(loss="epsilon_insensitive`, epsilon=0)
model.fit(scaled_X, scaled_Y)

Closing words

Although we managed to make autograd and sklearn work for my problem, the results were still not good. I suppose that the takeaway from this is that you can do everything “right” and still not have things turn out your way.

In hindsight, this was a simple problem, but it was a good reminder of what it takes to be a good machine learning engineer: good software and math skills. I needed to set up minor infrastructure, massage data via a pipeline, and work out the autograd package, so being able to code was imperative. In addition, I needed to understand the math to come to the solution I did.

Please know that I am not blowing my own horn; in fact, I’m embarrassed about how long I took to find the solution. And even then, I stumbled backward into the solution.

Thank you for taking the time to read this, and happy holidays!

Fundamentals Part 2: Hessians and Jacobians

2018-01-25T00:00:00-08:00

Spoiler: “H” is before “J”, which means that it’s the second-derivative. Obviously

This section builds off the last post, Fundamentals Part 1: An intuitive introduction to Calculus and Linear Algebra; if you’re not familiar with calculus or linear algebra, I highly recommend starting there. If this is your first time seeing all of this, know that this section is more involved than the first fundamentals post. Be prepared to feel a little lost, but if you keep at it, I know you’ll get there (it took me a while to wrap my head around)

For each of the topics covered, Jacobian and Hessian, I try to provide 3 levels of information: a high level, a mid-level, and a low level for you to review, depending on your level of interest.

1) A quick glossary:

0) x describes a scalar value, $\vec{x}$ describes a vector, and X describes a matrix.

1) Vector-valued function is a function that returns a vector.

2) Matrix-valued function is a function that returns a matrix.

3) Tensor: a scalar value is a 0-order tensor, a vector is a 1-order tensor, and a matrix is a 2-order tensor. For the purpose of most Machine Learning applications, a tensor is just an n-th order tensor (more abstractions). We’ll come back to this idea later when considering the not-yet-defined Jacobian and Hessian.

4) $\mathbb{R}^n$: basically means a point in n-dimensional space. For example, if you drew a Cartesian map, any point you pick has an (x,y) coordinate that describes it. Thus, we can say that the point exists in $\mathbb{R}^2$. If you restrict the points to taking on “whole numbers” (aka Natural numbers, or counting numbers), you can say that it exists in $\mathbb{N}^2$.

2) Partial Derivatives

This section is a little awkward as it’s not covered in Calculus 101; however, discussing it is extremely important before broaching the rest of the blog post. A partial derivative is basically a derivative of a “part” of a multivariable function, i.e., we take the derivative along a single dimension while keeping all others constant.

A Jacobian and a Hessian are just derivatives of the first and second-order multivariate functions, i.e., applied once and twice, respectively.

3) Jacobian

The Jacobian is, in essence, the first derivative of some tensor. We begin with the following example: given a single point of data about you (age, height, favorite food), we want to find out how likely it is that you’re in certain clubs: (reading, sleeping) we’ll reference this problem while discussing the Hessian as well.

3.1) High-level

The Jacobian describes how changing each of the input dimensions affects each of the output dimensions. Looking at our example, changing how much we weigh or how our favorite food changes, we can observe the linear effect on our clubs.

Given our 3 input dimensions and our 2 output dimensions, we’d have 6 pairs to look at (3 possible things to manipulate for each of those 2 outputs). This intuition will come in handy if you read on.

3.2) Middle-level

Consider our example from earlier:

1) Your input data, X $\in \mathbb{R}^{1 \times 3}$.

2) You have some weight matrix, W $\in \mathbb{R}^{3 \times 2}$

3) Your output, $\vec{y} \in \mathbb{R}^2$

4) If we were to put some classifier algorithm, defining it by some function, $f$, it would look like this:

\[\vec{y} = f(\vec{x}) = W\vec{x}\]

5) If we calculated the Jacobian of this, it would look along the lines of

\[\textbf{J}(f) = \begin{pmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \frac{\partial y_1}{\partial x_3}\\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \frac{\partial y_2}{\partial x_3}\\ \end{pmatrix}\]

where we’re iterating through each dimension of X (3 of them), and $\vec{y}$ (2 of them). This can be expressed more compactly as:

\[\textbf{J}(f) = \begin{pmatrix} \frac{\partial \vec{y}}{\partial x_1} & \frac{\partial \vec{y}}{\partial x_2} & \frac{\partial \vec{y}}{\partial x_3}\\ \end{pmatrix}\]

where $\vec{y}$ is a vector describing the vector which we partially differentiate.

3.3) Low-level

If we think of our inputs as a point lying on some n-dimensional plane, we can think of our weights as some linear transformation, $f$, that takes us from our current point $\mathbb{x} \rightarrow \mathbb{y}$. What the Jacobian then gives us is the best $\textbf{linear local estimation}$ of how the points are warped in that small area.

Remember, our n-dimensional vector $\in \mathbb{R}^{1 \times 3}$ of features is just some point in n-dimensional space. If we take the ‘rate of change’ of something that transforms it, W, in our concrete case, into an m-dimensional space, we get the amount of linear transform in the small region around the point.

3.4) Where might you see it?

1) Loss function:

By imposing some restrictions on the neighborhood around a point, we can do some interesting work on making the values invariant (or close to) small changes in an area. if the explanation sounds a little hand-wavy, and you’d like a concrete example, check out Hugo Larochelle’s Contractive Autoencoder video.

2) Discussions about local linearity in non-linear settings:

Neural Networks are known to be non-convex, but analyzing them from a linear standpoint can still be useful. I’d suggest watching the video above for an example of how it can be beneficial to analyze in this way.

Note: I’m hoping to talk about convexity down the line as it is a fascinating topic.

4) Hessian

4.1) High-level

The Hessian is essentially the derivative of the Jacobian.

In calculus 1, you might have learned that the derivative describes the rate of change, and the derivative of the derivative describes the maximum/ minimum. The Hessian is the equivalent of the concept mentioned above but applied to N-dimensional tensors in an abstract sense.

4.2) Middle-level

Recall our Jacobian function from earlier:

\[\textbf{J} (f) = \nabla f \begin{pmatrix} \frac{\partial \vec{y}}{\partial x_1} & \frac{\partial \vec{y}}{\partial x_2} & \frac{\partial \vec{y}}{\partial x_3}\\ \end{pmatrix}\]

If we then take the Jacobian of THAT, we end up with the following:

\[\textbf{J}(\textbf{J} (f)) = \nabla (\nabla f) \begin{pmatrix} \frac{\partial^2 \vec{y}}{\partial x_1^2} & \frac{\partial^2 \vec{y}}{\partial x_2 \partial x_1 } & \frac{\partial^2 \vec{y}}{\partial x_3 \partial x_1}\\ \frac{\partial^2 \vec{y}}{\partial x_1 \partial x_2 } & \frac{\partial^2 \vec{y}}{\partial x_2^2} & \frac{\partial^2 \vec{y}}{\partial x_3 \partial x_2 }\\ \frac{\partial^2 \vec{y}}{\partial x_1 \partial x_3} & \frac{\partial^2 \vec{y}}{\partial x_2 \partial x_3 } & \frac{\partial^2 \vec{y}}{\partial x_3^2}\\ \end{pmatrix}\]

An interesting tidbit that the eagle-eyed among you may have realized is that we went up in dimensions from a compact vector representation to a compact matrix representation. Intuitively this makes sense as we are now varying our variables on our variables (hence the denominators like $\partial x_1 \partial x_2$)

4.3) Low-level

Bear with me for a bit. If we were to expand out our $\vec{y}$ into its components ($y_1, y_2$), we’d need another axis to put them on. So, our Hessian from above would need to “expand” into another dimension to store them. Still with me? I hope so because if you are, you’ll understand why:

1) I’m not going to actually list out the ‘tensor.’

2) I’ll call the ‘expanded’ version a 3-order tensor

3) When we differentiate a vector with regards to a vector, we increase dimensionality. See Old and New Matrix Algebra Useful for Statistics for a summary of the different forms of differentiation. Also, Wikipedia: Matrix Calculus Layout Conventions has some interesting notes.

4.4) Where might you see it?

4.4.1) Convexity of the loss function:

Positive Semi-definite: if A is your matrix, then for any non-zero $\vec{x}$, $\vec{x}^T A \vec{x} \geq 0$

If we calculate the Hessian of our loss function (I’d suggest going online and working through one of the proofs), we see that it is positive semidefinite, which means that it is convex. This brings about the property of a guaranteed global minimum; reaching it in a method like gradient descent is another matter.

One example of a convex loss function is the logistic, where because we know it is a bowl shape (convex), we know that we are guaranteed global minima if we find the minima

4.4.2) An Evaluation metric

Observed Information matrix where we’re looking at the negative Hessian of the log-likelihood function.

I’m not going deep into the details, but if we have some estimated parameters $\theta$ (also called the weights in ML) , one way of evaluating how well $\theta$ fits our data is by first taking the log-likelihood:

\[\mathcal{L} (X_1, X_2, ..., X_n \| \theta) = \sum_{i=1}^{n} log f(X_{i} \theta)\]

Taking the negative Hessian of our log-likelihood tells us how our loss varies as we manipulate different parameters.

4.4.3) Optimization

If we know the curvature of the surface, this can guide us in our gradient descent. Some papers talk about using the diagonals of the Hessian to estimate the optimal learning rate as mentioned in A Robust Adaptive Stochastic Gradient Method for Deep Learning

Intuition

Admittedly, I’ve not read the paper above in-depth, and I’ve not read the papers referenced at all. Still, I’d wager that utilizing the diagonals of the Hessian allows them to weigh the importance of the different features as they make their gradient descent. I say this because not all features are equally informative, so it doesn’t make sense to treat them equally (especially since your error is typically just a scalar value that you propagate backward). I may be completely wrong, but this example stresses 2 things:

i) read the paper

ii) intuition is only helpful so long as it’s right, so it falls to you to make sure you’re correct.

5) Further Readings / References

1) Matrix Cookbook - Page 8-16 - I’m personally not a fan of recommending this off the bat as I think a collection of facts in itself isn’t useful except for as a reference.

2) Zico Kolter’s Linear Algebra Review and Reference - great professor at CMU, and I found this guide to be very useful.

3) 3Blue1Brown’s channel

4) Old and New Matrix Algebra Useful for Statistics

Fundamentals Part 1: An intuitive introduction to Calculus and Linear Algebra

2018-01-20T00:00:00-08:00

Spoiler: The pre-calc of ML

As you’ve probably heard, calculus is imperative for Machine Learning. However, there is a definite emphasis on differentiation compared to integration, so this series of posts will build from simple derivatives to Jacobians and Hessians. Ideally, at the end of this series, if you read a paper that mentions one of the topics above, you’ll have a rough idea of why the authors chose to do what they did and what their choice means for the results.

Background

If you’ve already taken Calculus or Linear Algebra, feel free to skip ahead to the next tutorial, Hessians and Jacobians

1) Derivatives 101

The equation below describes both the equation of a straight line as well as what happens if you take the derivative of that straight line with respect to some input value:

\[\begin{align*} y &= mx + c\\ \frac{d y}{dx} &= m \end{align*}\]

Typically in a calculus class, we’d talk about the rate of change of $y$ with regards to $x$. In other words, how much does $y$ change as $x$ changes? In this case, we see that $y$ changes by a factor of m for every unit that $x$ changes. For the moment, we are focused on scalar values, but this concept will generalize to vectors and matrices (which segues us into….)

2) Linear Algebra 101

Math often deals with the concept of abstraction. For example, we often deal with numbers, e.g., 5 or 100. In Linear Algebra, we are concerned with collections of numbers (vectors), e.g., a collection of (5, 10), or a collection of those collections (matrices), and further abstractions. To make this notion concrete, consider the following example:

2.1) Scalars

Edit: I have no idea if the following examples describe actual streets and avenues, so I’d like to apologize beforehand.

Say that we were somewhere in New York City, which works on a grid system. If I were on 4th and 5th, while you were on 10th and 7th, our (x, y) coordinates could be described as (4, 5) and (10, 7), respectively. Equivalently, our coordinates could be described as the following:

\[\text{My location:=} \begin{pmatrix} 4\\ 5\\ \end{pmatrix}\]

and

\[\text{Your location:=} \begin{pmatrix} 10\\ 7\\ \end{pmatrix}\]

We decide to meet for coffee, but since neither of us drives, we agree to meet in the middle as that is easiest. So, we would meet at:

\[x := \frac{4 + 10}{2} = 7\] \[y := \frac{5 + 7}{2} = 6\]

which corresponds to 7th and 6th (7, 6).

We saw in the computation above that it can be tedious to write out both equations to describe our (x, y) position. This complexity only grows as we add more locations, e.g., what shop; what if we had a compact way of representing my location, your location, and the operation of averaging to determine where we should meet? Here I want to keep two concepts in the back of your mind:

1) The concept of abstraction on scalars.

2) The concept of a coordinate system and what it means for something to be in the coordinate system.

2.2) Abstractions on Scalars: Vectors

At the start of this Linear Algebra review, I said that Linear Algebra is concerned with numbers or collections. So far, we have already discussed one such collection: a coordinate system. In that case, my location is described as the collection of (4, 5), and yours is represented by (10, 7). The top element (4 and 10) represents the street, and the bottom represents the avenue.

Congratulations! We’ve just worked through the concept of a vector, albeit in a particular setting: New York streets and avenues. Let’s take a step back and our locations for what they are: specific instances of an abstract concept. We could just as well write:

\[X_1:= \begin{pmatrix} a\\ b\\ \end{pmatrix}\] \[X_2 := \begin{pmatrix} c\\ d\\ \end{pmatrix}\]

where $X_1$ CAN represent my street-avenue, but it could just as well describe my latitude-longitude or my age-height. Whatever the case, if we are then looking for the average of these two containers, $X_1$ and $X_2$, we can represent them as the following:

\[\text{the middle := } \frac{X_1 + X_2}{2}\]

. This equation holds for the street number and the avenue (our x and y coordinates).

Note, we can add more information, e.g., a Z coordinate, which represents the shop number to meet at, or the corner I’m on, but we do not need to change anything. Our “middle” can still be represented by the same general equation above.

2.2) Abstractions on Scalars and Vectors: Matrices

We can then expand on our scalars and vectors to a collection of collections. Say we had two other friends, all our locations could be described as

\[\text{Us := } \begin{pmatrix} 4 & 6 & 10 & 12\\ 5 & 7 & 7 & 15\\ \end{pmatrix}\]

which would be a matrix. Phew, that was a mouthful.

2.4) The abstracted coordinate system

When we first introduced the idea of vectors, we discussed it in the sense of streets and avenues on New York’s grid system. In that case, our locations would be described by whole numbers (we can’t be at avenue 10.5).

\[\text{My location: } \begin{pmatrix} 4\\ 5\\ \end{pmatrix}\]

However, if we consider latitude and longitude, it makes sense that we can describe those numbers as numbers with some decimal point. For example, this random location I picked in New York has a latitude-longitude of (40.712776, -74.005974).

2.4.1) Counting Numbers

The first example, street-avenue, pertains to the Natural numbers. We say that the street and the avenue, individual elements of our collection, exist $\in \mathbb{N}$, the natural numbers (also known as the counting numbers).

2.4.2) Decimal point numbers

In the case of latitude-longitude, the individual elements of our collection exist $\in \mathbb{R}$, the real numbers (have a decimal space). We denote these scalar values as elements of the sets of $\in \mathbb{N}$ and $\in \mathbb{R}$ respectively.

2.4.3 Collections of Scalars: Vectors

If we talked about the collection, as opposed to elements within the collection, my street-avenue would then be:

\[\text{My location: } \begin{pmatrix} 4\\ 5\\ \end{pmatrix}\]

such that my location can be described as being in the naturals, $\in \mathbb{N}^2$, a vector of natural numbers. My latitude, longitude can be described as $\in \mathbb{R}^2$, a vector of real numbers. If we then added another number, e.g., the shop that I’m in, we would then have

\[\text{My location: } \begin{pmatrix} 4\\ 5\\ 6 \\ \end{pmatrix}\]

and my location can thus be represented as $\text{my location } \in \mathbb{N}^3$. This same concept extends to matrices. Consider our group of friends from earlier:

\[\text{Us: } \begin{pmatrix} 4 & 6 & 10 & 12\\ 5 & 7 & 7 & 15\\ \end{pmatrix}\]

Our location can then be described as $\text{my location } \in \mathbb{N}^{2 x 4}$. And that’s it for the linear algebra you’ll need for the rest of this series!

3) Further Readings / References

1) Zico Kolter’s Linear Algebra Review and Reference - great professor at CMU, and I found this guide to be handy.

2) 3Blue1Brown’s channel

Pusheen The Limit

2018-01-19T00:00:00-08:00

Note: The code can be found here: quitPusheenMeAround

I love Pusheen, and I’m also a fan of playing around in my terminal. After talking to someone the other day, I was inspired to work on this; she mentioned how an officemate commented on the Pusheen that popped up whenever she opened her shell.

I didn’t use any statistics other than the standard deviation for a small portion of the image segmentation (cat v. background). Having said that, I think that this is a fun exercise to occupy my time.

Initial Problem

A quick Google search revealed about 3 Pusheen ASCII art images online, which is disappointing given how many Pusheen images and GIFs there are. After a long week at work and some climbing earlier today, I’m ready to spend this Friday night in. So, it looks like I’m making a Pusheen ASCII art converter and some shell scripts. Also, Pusheen sounds like pushin’, which opens up several cute GitHub project names.

Process

1) Create a folder wherein we will store many Pusheen images.

2) Load, resize, and convert those images to ASCII art.

3) Make some shell scripts

4) Push the code.

Immediate problems

                                  }}
               }|))|)           ))   |
              )      )         )  xX   }
             | uhMMoQ )}     }| Q#WWWk  |}}))||||||||||||||)}}
            / O&8oaW%h         d%Whbo%Mc                      )|)}
           / w%Wdpdpo%MY|/)/jxo%*pdbpb8&0XQZwdbkhaaaaaaakbpZCj    |)
    }}}}} ) m%Mpbbbbpk8&WWWWW&8apbbbbddW8W&WWWM8888888W#88888W#hZ/  |)
         } J8Wpbbbbbbpa8#o8*o8opbbbbbbddkbdpqqwhMWWWW#dp#WMWW&8W&&oQ  |
   vCLCUzrtW8bdbbbddbbbabbhbbakbbbdbbbbbddbhao**M#ooabbbk#WWWMapdaW8#Q  )        }||||}
 ) b&WMMMMW%adbbbdbbdbdpppahppdbbdbddbbbd#&WWMM##hpddbbbddkhkbdbbdpkW%a/ )      )      }
     rUOmd%Wpbbbd#88*ddo&o%&*&hddM88adbbdoM****oohbbbbbbbbddddbbbbbdpa%WJ )    | vdoadv }
 ) wWWWWM&%hdbbbd*88*dbkMB&8B#bbdM8&adbbdoWMMMMMW#dbbbbbbbbbbbbbbbbbbdd&8C ) }/ m8%88%&U})
 } CqLc)r&Wpbbbbbdbbdbbdd*WMopbbbdbddbbbbdpppppppdbbbbbbbbbbbbbbbbbbbbdd8&u|/  p%W#WWWBZ |
        qBadbbbbbbbbbbbbbpdddbbbbbdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbdkBa  ra%Wddko%Mt)}
   }))t #%bdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbpM%wh&%&WWok&&U )
     } x8Wpbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbda888ob*WW&%WY }
     | Z%*pbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbdbMWWhqdM%8h} |
     / bBhdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbd#W&&W&WhY })
     / h%kdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbp#%MobO)  |
     / o%kdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbpM8r    |}
     / o%kdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbd&&t/|)
     / k%hdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb8# |
     ) Z%*pbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbdaBb /
     }}j&&dbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbpM%Y )
      / kBhdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbdk%o |
      })rW8ddbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbp&8z}}
       ) C8&ddbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbpM%Z )
        ) Q&8hpdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbpdW%w |
         | u*%Whppdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbdpd*8WQ |
          )  0*8&obbdpppppppppdbdpppppppppppppppppdbdppppppppppdbb*&&b  |
           }|  YkB#waWWWWWWWWMopk#MMMMMMMMWWWWWWMMophMMWWWWWWWW*qaBa|  )
             )/  &8*%Wwqqqppp&%kMBhdbdddddddddddd88bW%bdppqqqwMBoMBO j
               ))Jh*ku       LMWWk               mMWWp        ch##p|)
                      |)|||)|  j  )||||||||||||||  x  /|||||)|
                 }|||}       })  |               )}  )        }|)||
                               }                   }

looks FAR better than

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$@$$$@@M*#oa@@@@@@@$@@$$$@@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@$BB@@$$@@$W*#q**o@$$$$$$$$@8MMW%$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$@@8b*#&B@$$WaWppw##W8B&B@@@B&M*aMa8$@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$@B&###*%%aMbqppq#&*MMWM&M&#hpmh#*@@@@@$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@%8BBBW##MWkqdppqk*d*MkW#aapqpppM#B$$$$$@$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$@@@Wo**#W8hMMdqpwqpppqpdpqbpqqppppwooM@%%B$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$@$B8%&W#*W*qppb#opppqqpppqppqpppqpk&*W#o*#%@$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$@a*#qppqa%&pqwh*qppppbkppqh#**&oWW&W%$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@@@@o*#wppppqpqpo#MWppppp&Boqpdpqm*oW$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@@@**#qppqwqbao*aqc*Mqppphodpppo###&#W8@$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$@$$WaMwqqpko**kkbZuud#wpppqwpppppdddoW*M#8$@$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@Wo&pdao*od0uh#aWhjZMppppppppppppppwMoW%@$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$@@@*o#oM#hbZufUX#MoMpcCMdqppppppppppppqk#a@$@$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$@$$M*&hoMd*o*#wXUYwqCzCu*opqqpppppppppppw#oW$@$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$B8#*dddYh&hkWonJUrOwUXz#M*#aqppppppppppqdMh%@@$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$@@&##&kQUnvuqaokLUJXo#oMmvMoqkopppppppppppko&#%$@$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$@$@WMwqa***akwJ/ uXUUM#a&q/d8#bwppppppppppdMW&M#@$@$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$@@@##hqZOOwdko#ohdZLcXpbQzcpMqdppppppppppppo#MWoB$@$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$@$B&##W*okqwZZpba**oabqZ0zbMwpppppppppppppqpq#MB$@$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$W#oph*###*hbwOZqdka**oMWppppppppppppppppmo##@@$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$@B8hwqwqppka*##obpqZ0Om0M*wpppppppppppppk*MWM@$$$$@@$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$&Whqpppppqqwqba**##*oo*#bqppppppppppppp#WW&8@$@@$$$@$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$&Wkqpppppppppqqwqqdkhkkqqppppppppppppppk**Wo%$$$@BB$@$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$8Wkqpppppppppppppppqqqqppppppppppppppppqqq#*%$@&WW#&$@$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$8WkqppppppppppppppppppppppppppppppppppppppM*%$Mh&W&*%@@$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$&WaqppppppppppppppppppppppppppppppppppppqbWW@@#Wka&*%$@$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$#**wppppppppppppppppppppppppppppppppppppwo##@#MWdooo$@$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$@$%aMqpppppppppppppppppppppppppppppppppppppMM&##M#&Wh%@@$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$@@**omppppppppppppppppppppppppppppppppppqo&MMW#maWb%$@$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$@$@oMawppppppppppppppppppppppppppppppppwkWWah&M*#o8$@$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$@$B*#odqqpppppppppppppppppppppppppppqqa#MMoo##&8@$@$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$@$@WW#okqpwqqqqqqqqqqqqqqqqqqqqqpqph*#MW&8&&%@$$@$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@$$B8MM*p#aaaaaaaaaaaahhhahhhokp**#8%$$$$$$$$@@$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$@@$$@####o**##M*M###MW&WM&&W*M**o%@$@@$$@@@$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$@@$%W*MB@@@@@@@@@@@@@@@@@@M*##B$@@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$@$$@$$$$$$$$$$$$$$$$$$$$$@@$$@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$@@$@@@@@@@@@@@@@@$@@$$@@$$@@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$@$$$$$$$$$$$$$$$$$$$$$@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

this.

Solution

We can apply some heuristics to clear out the background. One heuristic is that Pusheen is typically at the center of the image, which means that we can probably use the corners to act as a threshold to remove the background.

Ian Quah

On Monads, Monoids and Endofunctors 1: The monoid

1) Initial Project

1.1) Simple Scenario

1.2) Complicated Scenario

1.2.1) Messiness

1.2.2) So what?

2) The Second Phase

2.1) Naive Approach

2.2) A monoid?

2.3) Using our monoid

2.3) A retrospective

3) Monoids and abstractions

Closing Thoughts

PyTorch Gradient Manipulation 1

1) Motivation

i) Familiarizing Myself with PyTorch

ii) Playing with Gradient Stopping and Propagating

2) Contents

2.1) High-Level

2.2) Low-Level

2.3) eval Misconception

2.4) Making the Right Choice

3) Problem Setup

4) High-Level

4.1) detach

4.1.1) Observations

4.1.2) Usecase

4.2) no_grad

4.2.1) no_grad in action

4.2.2) Observations

4.2.3) Usecase

4.3) inference

4.3.1) Observations

4.3.2) Relevant Code

4.3.3) Usecase

5) Low-Level

Things to Note:

5.2) Colab: optim.Optimizer

5.2.1) Usecase

5.3) Colab: Manual Manipulation

5.3.1) Usecase

6) Closing Thoughts

6.1) Gradient Stopping

6.2) Gradient Manipulation

Stumbling backwards into np.random.seed through jax.

1) Intro

i) A little about jax

ii) Could it be the future?

2) Randomness:

i) Statefulness

ii) Reproducibility in TF1.X

3) Reproducibility in numpy

i) Example Scenario

Statefullness issue illustrated

The global state

ii) How do we address reproducibility in numpy?

Closing Thoughts:

P.s

A Machine Learning oriented introduction to PALISADE, CKKS and pTensor.

1) PALISADE

i) What is PALISADE

ii) Machine Learning Application

2) PALISADE’s Cryptographic Parameters

i) multDepth

ii) scalingFactorBits

iii) batchSize

3) pTensor library:

i) pTensor

ii) Complex numbers

iii) Packing

iv) pTensor::m_cc

4) Using pTensor on the Ames dataset

i) Setting up the cryptographic contexts

ii) Training setup

iii) constructDataset

iv) Training

5) Closing Thoughts

MAPE Madness

0) Setup

i) Familiarizing Myself with `PyTorch`

2.3) `eval` Misconception

4.1) `detach`

4.2) `no_grad`

4.2.1) `no_grad` in action

4.3) `inference`

5.2) `Colab: optim.Optimizer`

5.3) `Colab: Manual Manipulation`

3) Reproducibility in `numpy`