I’m a fan of code abstraction; I like how clean code looks and “feels”. I think that clean and good code is like art. And just like art can be categorized into styles such as Impressionism, Neo-Impressionism, and Post-Impressionism (all of which I like), we can also organize code.
In this post, I do not talk about functional vs. imperative vs. object-oriented programming but the mathematical structure in code. You might have heard of concepts such as monads, monoids, functors, etc. At an abstract level, these concepts lay out specific properties that we can use to describe how data can flow between various classes (in the programming sense, e.g., python, c++, java, etc.). The benefit here is that if your code fulfills the requirements laid out by these categories, you get certain guarantees about your program regarding results and how you can compose them together.
This is the first in a series of blog posts discussing categories in programming languages that hopefully help you notice patterns to write cleaner code. This series will not be mathematical and assumes no prior knowledge other than python
(which you don’t even really need - it just provides a concrete example of what we’re doing).
We will continually expand on the following scenario throughout the series as we go from “ugly” unabstracted code to clean abstractions. It’s important to note that you (and I) have probably written code that fits into these concepts without even realizing it! The concepts introduced here are to make you more aware of what you are writing and make you notice these patterns, allowing you to reuse lots of code you have already written.
You are working on a project involving “parallel” computation, e.g., you have multiple computers or processes on the same system. Concretely, you have 100 machines with identical datasets on them. You want to do a hyperparameter search, e.g., ten searches over each of the 100 machines, totaling 1K runs. For each run, you want to track some validation loss before returning the model with the lowest validation loss.
Note: Throughout this post we assume that you have some train
and validate
method implemented.
If you were to find the best model, you might have something like the following:
Dataset = Tuple[NumericalArray, NumericalArray]
ValidationResults = List[float]
def Node(object):
"""
A compute node on a single machine
"""
def __init__(self, data: Dataset, hyperparameters: Dict[str, Any]):
self.train_data = data[0]
self.validation_data = data[1]
self.conf = hyperparameters
self.validation_losses = []
def run(self): # A Map
"""Run the training and validation"""
for conf in self.conf:
trained_model = train(conf, self.train_data)
self.validation_losses.append(validate(trained_model, self.validation_data))
def report(self, validation_lists: List[ValidationResults]) -> float: # A reduce
"""
validation_lists = [
[1, ..., 10] # Node 1
[0.1, ..., 1.0] # Node 2
....
]
"""
minimum = math.inf
for arr in validation_lists:
minimum = min(minimum, min(arr))
return minimum
And you would distribute this to all your nodes. After completing its computation, each node will send its report (a list of 10 floating numbers describing the validation loss) to a “reducer” node. The reducer node will accumulate all 1K results before reducing them to find the minimum value.
The situation above is simple; the final node in the graph accumulates all the report
results and finds the minimum, which is simple and doesn’t take up too much memory since floats are cheap to store.
However, what happens if we want to find more than just the minimum losses, and our data takes up much more memory? In this case, we would like to apply multiple reductions; Nodes 1-10 send their results to Reducer1, Nodes 11-20 send to Reducer2, and so forth. At the end, we have a final reducer which takes results from all the reducers for our final result
Our code now looks like the following:
Dataset = Tuple[NumericalArray, NumericalArray]
ValidationResults = List[float]
Data = Union[Dataset, ValidationResult]
def Node(object):
"""
A compute node on a single machine which either:
- runs the hyperparameter search
- runs a reduction on the data
"""
def __init__(self, data: Data, hyperparameters: Optional[Dict[str, Any]] = None):
"""
In the case of our data being of instance `ValidationResults`, hyperparameters is an empty dictionary
"""
# On our "reducer" nodes
if isinstance(data, List):
self.data = data
else:
self.train_data = data[0]
self.validation_data = data[1]
self.conf = hyperparameters
self.validation_losses = []
def run(self):
"""Run the training and validation"""
# Our reduce step
if hasattr(self, data):
self.validation_losses = self.data
return
# Our map-and-run step
for conf in self.conf:
trained_model = train(conf, self.train_data)
self.validation_losses.append(validate(trained_model, self.validation_data))
def report(self) -> float:
"""
All of the results here get collected and saved
"""
return min(validation_losses)
As we can see above, the code is quite messy. The messiness is because we have to care about the underlying data and what to do with it. We want to squint our eyes and abstract all the conditionals and checks.
Concretely, we would like to abstract away the values and make it cleaner, which we can do by the following:
class Dataset():
def __init__(self, data: Tuple[NumericalArray, NumericalArray], hyperparameters):
self.train_data = data[0]
self.validation_data = data[1]
self.conf = hyperparameters
self.validation_losses = []
def run(self):
for conf in self.conf:
trained_model = train(conf, self.train_data)
self.validation_losses.append(validate(trained_model, self.validation_data))
def report(self) -> float:
return min(self.validation_losses)
class ValidationResults():
def __init__(self, data: List[float], _: Optional[Any] = None):
self.data = data
def run(self):
return
def report(self) -> float:
return min(self.data)
Container = Union[Dataset, ValidationResults]
def Node(object):
"""
A compute node on a single machine which either:
- runs the hyperparameter search
- runs a reduction on the data
"""
def __init__(self, container: Container):
self.container = container
def run(self):
# The individual types handle their own run
self.container.run()
def report(self) -> float:
# The individual types handle their own reduction
return self.container.report()
As we can see, we defined two classes above, which will handle the run
and report
as necessary. By delegating the calls, we, as the programmer, do not have to care what the underlying Container
is.
In my opinion, this is much cleaner! This way, we have decoupled the run logic from the underlying data type. All we need to do is call the appropriate values.
At a higher level, this is freeing because we can treat these class instances are abstract containers - as long as something follows the type-signatures of run
and report
from Node
, it should, in theory, work out exactly as they expect.
However, none of this should be new to you. Creating an abstract interface to make code clean isn’t anything “interesting” in and of itself. Let’s go deeper.
In the second phase of the project, you decide that you want to add in things like:
Which would ultimately derail the structure we’ve got above…. or would it? Let’s take a look at the custom types we have defined so far:
Dataset
ValidationResult
We notice that our Dataset
doesn’t change much, other than the Dataset.run
. Our ValidationResult
will change, but that’s understandable.
Note In the following, I assume you’ll be keeping track of the top 100 best models in your own way. I’ll be “using” a heap, but I won’t include any logic for it because that’s not the point of this work.
The naive approach (which would probably come to mind first) would be the following
P.s: at the end of our reduce step, we have a dictionary of values which we must process to get whatever values you want.
class Dataset():
def __init__(self, data: Dataset = Tuple[NumericalArray, NumericalArray], hyperparameters):
self.train_data = data[0]
self.validation_data = data[1]
self.conf = hyperparameters
self.validation_losses = []
def run(self):
self.validation_loss_min_heap = heapify([])
for conf in self.conf:
trained_model = train(conf, self.train_data)
validation_losses = validate(trained_model, self.validation_data)
self.validation_losses.append(validation_losses)
# you do the checks and logic
self.validation_loss_min_heap.insert(validation_losses)
def report(self) -> Dict[str, float]:
return {
"min": min(self.validation_losses),
"sum": sum(self.validation_losses),
"count": len(self.validation_losses)
"best_100": self.validation_loss_min_heap
}
class ValidationResultDict():
def __init__(self, data: List[Dict], _):
self.data = data
def run(self):
return
def report(self):
min_so_far = math.inf
sum_so_far = 0
count_so_far = 0
validation_loss_min_heap = heapify([])
for data_dict in self.data:
min_so_far = min(min_so_far, data_dict["min"])
sum_so_far = sum_so_far + data_dict["sum"]
count_so_far = count_so_far + data_dict["count"]
# you do the checks and logic
validation_loss_min_heap.insert(data_dict["best_100"])
return {
"min": min_so_far
"sum": sum_so_far,
"count": count_so_far
"best_100": validation_loss_min_heap
}
Container = Union[Dataset, ValidationResult]
def Node(object):
"""
A compute node on a single machine which either:
- runs the hyperparameter search
- runs a reduction on the data
"""
def __init__(self, data: Data):
self.container = data
def run(self):
# The individual types handle their own run
self.container.run()
def report(self) -> Dict:
# The individual types handle their own reduction
# Also, you now have to process the returned dictionary
return self.container.report()
where we added custom code to track the state and update our dictionary container. However, as we can see, there is a LOT of similarity between the Dataset.report
and ValidationDatasetDict.report
. Can we make this cleaner?
To do so, we can first introduce the concept of a monoid but I wouldn’t bother reading that until after you’ve read this article.
How does a monoid help us? Well, what is a monoid? A monoid is a mathematical structure that has the following properties:
BLABLABLA
will always output an instance of BLABLABLA
when you apply the binary operation aboveKnowing this, could we abstract out our code? We’re making a bit of a jump below, but I promise I’ll add comments to the code. Let’s add a new class, Summary
, which we define as the following:
class Summary():
def __init__(self, validation_loss: Optional[float] = None, inplace=False):
"""
We define an identity and non-identity instantiation
There are 2 cases:
- validation_loss is None: where our compute node had an empty configuration file, or errored out
- validation_loss is not None: our computation node worked!
"""
self.count = 0 if validation_loss is None else 1
self.min = math.inf if validation_loss is None else validation_loss
self.sum = 0 if validation_loss is None else validation_loss
self.best_N = heapify([]) if validation_loss is None else heapify([validation_loss])
self.inplace = inplace
def reduce(self, other: Summary) -> Summary:
"""
We've defined an associative binary operation where
reduce(a, b) == reduce(b, a)
and the output is always a summary!
"""
to_assign = self if self.inplace else Summary()
to_assign.count += other.count
to_assign.min = min(self.min, other.min)
to_assign.sum += other.sum
to_assign.best_N = merge_heaps(self.best_N, other.best_N)
return to_assign
We’ve done three things above:
Summary
to handle the case where we’ve errored out or our configuration was empty (for various reasons)Summary
type!We can then restructure our code by noting a few things:
Dataset.report
will now always return a singleton Summary
ValidationResultDict
now accepts a List[Summary]
on __init__
as opposed to a List[Dict]
, and it now outputs a Summary
class Dataset():
def __init__(self, data: Tuple[NumericalArray, NumericalArray], hyperparameters):
self.train_data = data[0]
self.validation_data = data[1]
self.conf = hyperparameters
# Create one just to ensure we always have something when the `report` is called
# This way even if we do a `report` we can be sure that the code won't error out
self.summary = [Summary()]
def run(self):
for conf in self.conf:
trained_model = train(conf, self.train_data)
v = validate(trained_model, self.validation_data)
self.summary.append(v)
def report(self) -> List[Summary]:
return self.summary
class ValidationResult():
def __init__(self, summary_list_of_lists: List[List[Summary]], _, reduce_immediately=False):
# Reduce the LoL into a single list
self.summaries = sum(summary_list_of_lists, [])
self.reduce_immediately = reduce_immediately
def run(self):
return
def report(self) -> List[Summary]:
# Option 1
if self.reduce_immediately:
running_summary = Summary()
for summary in self.summaries:
running_summary.reduce(summary)
# Insert into a list to keep the types nice and tidy
running_summary = [running_summary]
# Option 2: reduce it all and then transmit, which saves bandwidth
else:
running_summary = []
for summary in self.summaries:
running_summary.append(summary)
return running_summary
Container = Union[Dataset, ValidationResult]
def Node(object):
"""
A compute node on a single machine which either:
- runs the hyperparameter search
- runs a reduction on the data
"""
def __init__(self, data: Data):
self.container = data
def run(self):
# The individual types handle their own run
self.container.run()
def report(self) -> List[Summary]:
# The individual types handle their own reduction
return self.container.report()
chefs kiss
P.s Again, you would need to do the final processing on Summary
but that’s easy.
Notice how, by modifying our logic, we made our code look extremely simple. If we decide to add another feature, e.g., a max, a standard deviation, etc., all we would have to change is our Summary
class to encapsulate the change.
QUICK: Before your eyes gloss over the following diagram, listen to what I’ve got to say. You already know all of the things in the diagram, which is from Wikipedia: monoids
In this case, M
is a category; think of it as a fixed but arbitrary class, e.g., ValidationResult
or Node
. As programmers, we operate on instances of those classes but ignore that for now
On the first line, we have three terms; let’s index them 0, 1, and 2. On the bottom line, we have two terms; index 3 and 4. In between these terms, we have arrows, which are transformations.
1->2
: we see that $\alpha$ is association where we move the parenthesis around. We introduced associativity as a property of a monoid earlier.
2->3
we see that we have “reduced” the equation $M \bigotimes (M \bigotimes M)$ into $M \bigotimes M$ by applying $1 \bigotimes \mu$, which is equivalent to saying that the first term (the M not in the parens) is the identity. We can do this because monoids must have an identity.
2->4
is the same as the above, but with the parens in a different location
4->5
&& 3->5
: is the result of just evaluation the x
, the $\mu$.
And there you go!
This post came about after a discussion with one of my mentees. That mentee was facing something similar, and as someone who has gone through this EXACT problem, I thought I’d write about it and share what I’ve learned.
Also, I firmly believe that one way to ensure you know something is by explaining it. And so, to finally understand what
A monad is a monoid in the category of endofunctors, what’s the problem?
I’ve decided to write a 3-part series on “What is a monoid?”, “What is an endofunctor” and “What is a monad”. All those posts will build off one another so stick around!
]]>PyTorch
. The series of tutorials cover the following network architectures:
1) Single-headed simple architecture
2) Single-headed complex architecture
3) Multi-headed architecture
The notebook for this tutorial can be found Google Colab gradient_flow_1
Note For the purpose of this discussion, we define a module to be a single layer or a collection of layers in a neural network.
The motivation behind this post was 3-fold:
PyTorch
.PyTorch
is easy to prototype in, but I don’t fully understand the PyTorch computation graph.
This is useful if we have a frozen layer that we want to avoid training. This problem is simple if we have a simple module that looks like
)
but what happens if the shared module is an intermediate component of the model?
What happens if we have a network that looks like
NOTE: Image sourced from IntelLabs: DDPG
where we have two primary modules: the actor and the critic. We see that the critic (the bottom module) accepts the actor’s output, but unless we stop the gradient flow, the computation graph will backpropagate critic updates through the actor, which we do not want.
We focus on 5 methods that we categorize into High-Level where we use built-in methods and Low-Level where we manually access the gradients.
All the methods listed below are only pertinent to stopping gradients.
returns a copied tensor of the same values and properties but detached from the graph. The original is persisted.
which is a context manager that disables gradient calculation. This method sets all variables created inside its scope to have requires_grad
to be False
.
which stops gradients entirely downstream, as well as upstream. This is a relatively new method (Sept 14, 20221), so it would be worth discussing.
Since we have direct access to the gradients, we can not only stop gradients but also manipulate them based on our needs.
Via the optimizer
, where we do not pass the optimizer the parameters of the module
Manual Manipulation
, where we extract the gradients and then choose to modify or manipulate them before applying.
eval
misconceptionWhen I first started using PyTorch
, I incorrectly assumed that eval
would:
put the model into inference mode (turning off dropout and making batchnorm run in eval mode)
turn off the computation graph construction
but this is not the case regarding turning off the computation graph.
At the end of the day, each of the methods above comes with various tradeoffs. We will discuss those tradeoffs below, but ultimately you will have to decide what is best for your application.
We have the following graph:
where we want to only update the network’s output head (L2). What are the various ways we can accomplish this?
I highly recommend having the colab notebook open as you work through this. I made it a point to plot the resulting computation graph for each setting, making it easier to understand what is happening.
detach
detach
detaches upstream values from the graph, so we only calculate the gradient backward up to the first detach
. Our current graph setup is too simple to illustrate this phenomenon, but the computation graph in the follow-up post will work well.
Notice 2 things from the cells:
the output of the print
statements show that the grad
of the L1 is None
.
L1
does not exist in the computation graph (contrast this with the Control).
stopping gradient flow.
saving memory.
torch tensors keep track of data such as the computation graph. We drop the computation graph of all upstream operations up to the current variable by detaching these tensors.
numpy
.Trying to convert directly to numpy
errors out (rightfully so) because numpy
does not keep track of the computation graph. It is safer to have a clear distinction between numpy
arrays and torch tensors.
import torch as T
a = T.tensor(1.0, requires_grad=True)
b = a + a
b.numpy()
no_grad
no_grad
in actionIt can be used as such:
#!pip install -q torchviz
import torch as T
from torchviz import make_dot
# Requires grad = True to construct graph
x = T.ones(10, requires_grad=True)
with T.no_grad():
pass
y = x ** 2
z = x ** 3
r = (y + z).sum()
make_dot(
r,
params={"y": y, "z": z, "r": r, "x": x},
show_attrs=True
)
Uncomment the first line if you do not already have torchviz. Then, play around with moving y
or z
into the T.no_grad()
context.
The graph of no_grad
is the same as the graph of detach
The printed information shows that L1
has None
gradients, similar to the previous method.
Stopping gradients.
Improves computational speed and memory consumption.
no_grad
tells PyTorch to not track all operations within the context, which means that the computation graph is not created.
Furthermore, no_grad
is faster than detach
as detach
returns a copy of the input tensor (just without the computation graph). By comparison, no_grad
does not persist the computation graph of variables within its scope.
Keeping both the torch tensor and numpy array around might not be your intention, and you might accidentally operate on the wrong variable.
inference
We discuss two observations for this code section:
Viewing the computation graph, we see that no values are tracked (hence an empty singular block)
Solution
If we want to allow downstream calculations that themselves are not in inference
mode, we must make a clone
of the tensor. We display the relevant sections of this in section 4.3.2) Relevant code
We see this method produced the same computation graph as in the detach
and no_grad
settings. Like no_grad
, inference()
is a context manager. In no_grad
and detach
, upstream values were not tracked in the computation graph; in inference
, even downstream values are not tracked.
*Pytorch CPP Inference mode docs
We generated the two graphs by following the setup from this official Twitter post in mind about
def _inference_forward(self, X):
# First var is a inferenced-var
with T.inference_mode():
tmp = self.l1(X)
try:
# Try to do a non-inference forward pass
return self.l2(tmp)
except Exception:
print(f"Trying to use intermediate inference_mode tensor outside inference_mode context manager")
# Getting pure-inference
with T.inference_mode():
grad_disabled = self.l2(tmp)
# Convert inferenced-var and allow us to
# do a normal forward pass
new_tmp = T.clone(tmp)
grad_enabled = self.l2(new_tmp)
return grad_disabled, grad_enabled
Gradient Propagation It is possible to use this method to stop gradients, but there are easier ways to accomplish this.
Inference Speed While no_grad
stops operation tracking, inference
disables two other autograd features: version counting and metadata tracking.
In the following methods, we work directly with the computed gradients instead of detaching variables or telling PyTorch to ignore blocks. This low-level manipulation is helpful if we want to make complex modifications to our gradients (it won’t be relevant here, but it is worth mentioning ahead of time).
Furthermore, whereas the methods in the High-Level section stopped all gradients from flowing upstream, both of the Low-Level methods allow us to skip modules.
Note: The gradients are stored in the model parameters when we call loss.backward
. The only thing our optimizer.step
call does is apply the gradients. This means that using the optimizer method is more or less equivalent to the manual manipulation method.
Unlike the resulting computation graphs in the High-Level section, we see that all variables here are tracked:
In the High-Level section, we see that no L1
information is kept around, but in both Low-Level solutions, L1
is still tracked even if it is unused (verified by quick tests in the corresponding cells).
we see that the gradients are non-zero, which means that it is just a matter of applying them, either via the optimizer or manually.
These methods can consume far more memory as the entire computation graph has to be computed.
optim.Optimizer
We modify our optimizer such that instead of doing something like optim.SomeOptimizer(model.parameters())
,
we instead do optim.SomeOptimizer(model.l2.parameters())
which tells our optimizer to only apply gradients for the L2
parameters.
As in the above methods, we can “freeze” a layer by using this method.
We can specify per-module hyperparameters
However, we do not have fine-grained control.
Manual manipulation
While the above section had the optimizer apply our gradients, we manually apply the gradient here.
The only use-case I see for this method over every other method listed above is custom gradient applications. For example, if you wanted to zero out gradients every other step or scale the gradients if certain conditions are met.
The “simple” methods are a lot easier to pull off and should be preferred if all you need to do is stop gradients from flowing upstream.
My recommendations are to use no_grad
wherever possible as it is faster than detach
. As with most style preferences, this is subjective, but I feel that no_grad
is also better because it is clear that you are excluding a block of computations that will be used further down. When you detach
a variable, you now have the torch tensor version, as well as the numpy array.
I recommend avoiding inference
for gradient manipulation unless you’re absolutely sure that you have a good reason. I do not see a scenario where you might prefer doing inference
and then copying the variable when you can use no_grad
directly.
If possible, use the optimizer approach as there’s less room for error. However, the Manual manipulation method is ideal if you need to apply custom operations.
One such use-case for manual manipulation is to scale only particular layers if specific conditions are met or if you want to zero out gradients every other step.
Thank you for taking the time to read this! If you ever want to contact me feel free to email me at firstname@website URL. You can also reach me on Linkedin: ianq, but if we don’t know each other, either attach a note to your invitation or send me an email along with the invitation. I tend to ignore requests otherwise.
]]>(j)np.random.seed
. This post aims to (briefly) discuss why I like jax
and then compare Jax and numpy
vis a vis randomness.
You can find the associated notebook for this post, but it’s relatively minimal. Feel free to open the link and play with the notebook, but know that running it’s not strictly necessary.
Given my current needs, I think that jax
is the best computational tool out there. I hope to write more about jax
in the coming months, and show you why you should consider trying it out. One important thing to realize is that jax
is not a deep learning framework (although it does have autograd built-in). First and foremost, jax
is a numerical computation library, like numpy
.
Over the weekend, I was working on porting some code from pytorch
to jax
. In the process, I stumbled onto some code that dealt with randomness, and I decided to read more about randomness in the context of numpy
. The material I had read over the weekend ended up being the motivation behind this blog post. To begin, let’s look at how we would deal with randomness in jax
:
key = jax.random.PRNGKey(SEED)
print(key)
# which outputs the following on my run:
# DeviceArray([1076515368, 3893328283], dtype=uint32)
Ironically, I felt like I understood numpy
’s randomness better after using jax
. This blog post hopes to exposit what I learned in the process.
As mentioned earlier, jax
is a computational framework akin to numpy
. I’d say the main difference between jax
and numpy
is that jax
was designed to be optimizer agnostic. Being optimizer agnostic means that jax
runs fast regardless of if you’re on a CPU, GPU, or TPU. I particularly like it because of:
how fast it is when compared to other frameworks (I got a 10X speed boost compared to raw vectorized numpy in a function with lots of dot products).
how easy it is to peek into its internals (admittedly, this is subjective).
how it allows you to implement the equations you see in papers directly. You can implement the line of code then call vmap
to apply it to all rows in your array. You don’t need to futz around with vectorizing your equations any longer.
I feel like jax
and XLA
are the future of computation in python. Granted, this isn’t exactly a hot take - lots of people and companies have begun to move to jax
:
DeepMind’s alphafold model is built in haiku, which is a deep-learning oriented library built on top of jax
Google Brain has also released a deep-learning called flax. From what I can tell, teams at Google Brain have begun transitioning over to it.
Huggingface has also begun releasing models in flax
In my last blog post PyTorch Gradients, I mentioned publishing a series of posts covering gradients in PyTorch. I fully intend to finish that series, but I’ve more or less abandoned PyTorch.
Anyways, on to the meat of this post: over the weekend, I was playing with the idea of porting over snnTorch to jax
. I first began by scanning through the tutorials where I read some material about creating random spike trains. The contents of the tutorial and what spike trains are aren’t crucial for this post. Still, it did remind me that jax
handles randomness differently from other frameworks. So, I thought I should do some deep(er) reading before naively moving code over.
If you look up randomness in jax
, one of the first things you’ll stumble on is how to generate a key and continually split the random key. To make a long story short, jax
is functional in nature, which means that it is stateless. Being stateless means (among other things) that jax
handles randomness explicitly; we have to explicitly seed a value every time we invoke randomness in our code. On the one hand, this makes our code more verbose, but on the other hand, it makes reproducibility far easier.
The following is merely a working example of what “statefulness” means. It is by no means a rigorous definition. Think of being stateful as the following:
class StatefulAdd():
def __init__():
self.count = 0
def __call__(self, x):
# The identity + number of times it has been called
self.count += 1
return x + 1
foo = StatefulAdd()
first = foo(1) # first := 1
second = foo(1) # second := 2
i.e. we can plug the same value in but obtain different values each time. There’s nothing inherently wrong about coding this way(regardless of what the func-ies will say); it can just be harder to reason about it.
Anyways, going back to jax
: by enforcing statelessness, we have to be explicit in terms of our random key every time we make a call. By enforcing statelessness, jax
sidesteps the reproducibility issue that plagued Tensorflow1.X (and probably pytorch too). Although jax
isn’t perfect in the reproducibility aspect, I believe it is going in the right direction.
How to get stable results with TensorFlow, setting random seed although, to be fair, there seems to be an official answer for Tensorflow 2 as of 2020
Why can’t I get reproducible results in Keras even though I set the random seeds? (asked in 2018) which contains my favorite answer I’ve seen so far. The answer states the following and has the following caveat:
In short, to be absolutely sure that you will get reproducible results with your python script on one computer’s/laptop’s CPU then you will have to do the following:
# Seed value
# Apparently you may use different seed values at each stage
seed_value= 0
# 1. Set the `PYTHONHASHSEED` environment variable at a fixed value
import os
os.environ['PYTHONHASHSEED']=str(seed_value)
# 2. Set the `python` built-in pseudo-random generator at a fixed value
import random
random.seed(seed_value)
# 3. Set the `numpy` pseudo-random generator at a fixed value
import numpy as np
np.random.seed(seed_value)
# 4. Set the `tensorflow` pseudo-random generator at a fixed value
import tensorflow as tf
tf.random.set_seed(seed_value)
# for later versions:
# tf.compat.v1.set_random_seed(seed_value)
# 5. Configure a new global `tensorflow` session
from keras import backend as K
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)
# for later versions:
# session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
# sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_conf)
# tf.compat.v1.keras.backend.set_session(sess)
Indeed, a thing of beauty.
numpy
First and foremost, I’d recommend opening the accompanying notebook, specifically the numpy
portion and playing with the code there. NB: the jax
portion is trivial and works as you might expect; I included the jax
portion primarily for completeness.
As you play with the numpy
portion, you’ll notice how you get new random values every time you call a random
module. We get new random values every time we call a random module without explicitly giving in a key, which tells us something is happening under the hood.
This “something” looks a lot like we are generating a new random key on every call. Note that this is not what happens under the hood, but it helps tie what we see to jax
and how it handles random state.
You have a program that only crashes once in a while, and you’ve identified the exact function that it crashes on! You’ve even managed to find a specific random seed on which that function works fine, so you’d like to set the state only inside that function and avoid the problem altogether.
Yes, this is a contrived example; sue me.
Note here how we have reset the random seed within the new_generate_np_weights
. If the randomness were only local to the context we are in, we would expect to “continue” the original randomness once we exit the function. Said differently, we would have two “sources” of randomness, the second of which would get garbage collected once new_generate_np_weights
returns; however, as we can see on the function labeled with “#3rd” call”, we have received the same random value as our “# 2nd call”.
Clearly, something “unexpected” is happening. At its core, np.random.seed
creates what is known as a RandomState
which, as we’ve discussed, creates a stateful object. In fact, as we saw in our code example, calling seed
recreates the object instead of reseeding it.
Obviously, this is the source of our issues.
In all honesty, I have previously stumbled on the new best practices for generating random numbers in numpy
, but I never bothered to read it. I don’t think that the reasoning behind the recommendation ever clicked with me, so I never felt a need to change how I was doing things.
However, now that we are clear on the limitations of the existing np.random.seed
, we can discuss the recommended way of doing things: RandomGenerator
. To make a long story short, you create an object which contains all your randomness; you “extract” whatever you need from this random object. For example, see random sampling
from numpy.random import default_rng
rng = default_rng()
vals = rng.standard_normal(10)
more_vals = rng.standard_normal(10)
as opposed to an older method
from numpy import random
vals = random.standard_normal(10)
more_vals = random.standard_normal(10)
Where we presumably mutate a global object.
This was an enlightening topic for me to dive into, and I hope you found reading this useful. I feel like I better understand what numpy
does under the hood when we use randomness. I also feel like I better understand the motivation behind numpy
’s API change recommendation when viewed through the lens of jax
. Finally, I got a cheap plug to advertise SNNax.
tl;dr
1) jax
handles randomness very well, even if it may be more verbose.
2) Join in on SNNax.
3) Use the new best practices if you are dealing with random numbers in numpy
Thanks for reading! Feel free to add me on LinkedIn but message me first saying why you’d like to connect. If LinkedIn doesn’t let you message me, you can also email me ian-AT-this_website_url with your profile URL and mention why you’d like to connect. I get quite a lot of spam, and I use this as a filter.
1) You can generate multiple keys with jax.random.split that you can consume
key_array = jax.random.split(key, num=X)
Overview
1) We introduce the PALISADE library and the cryptographic parameters that we need to specify. We then explain what the cryptographic parameters mean for our application.
2) We use the pTensor library and train a housing price predictor on the Ames dataset, a modern house price dataset.
3) We set up the discussion for the next post in the series.
Note: check the link at the very bottom for the complete source code. Sections have been omitted in this page to reduce clutter.
Instructions to install PALISADE can be found here: PALISADE-Dev build instructions. For users new to PALISADE and C++, we highly recommend bookmarking the PALISADE Doxygen page containing the library’s documentation.
From the README.md
on the PALISADE page
PALISADE is a general lattice cryptography library that currently includes efficient implementations of the following lattice cryptography capabilities:
The takeaway for us machine learning practitioners is that we can train encrypted machine learning models to output encrypted predictions after training said model on encrypted data.
We as machine learners(?) need to have a rough idea of the following parameters:
This describes the depth of multiplication supported. Informally, when we encrypt data, we add some noise to increase the scheme’s security. When doing mathematical operations on these data, our noise increases (linearly in addition and subtraction but squared in multiplication).
There is no single “best” value to set the multDepth to and this is highly dependent on your problem. The following are some example equations and their corresponding multiplication depth
(a * b) + (c * d) has a multiplication depth of 1
a * b * c has a multiplication depth of 2
In the original CKKS paper, the authors discuss a scaling factor they multiply values with. The scaling factor prevents rounding errors from destroying the significant figures during encoding. Unfortunately, it is difficult to discuss this parameter without discussing the paper’s core ideas, so we leave this for the next post. Thankfully, PALISADE is reliable in informing us if the scalingFactorBits
is set too low.
We tend to use values between 30 and 50 for most of the applications.
The batchSize is a tricky parameter to set correctly. The issue is that the batch size must be equal to
\[\frac{\text{Ring size}}{2}\]Unfortunately, one needs to set multDepth, then look at ring size before doing it all over again with batchsize set to be equal to half the ring size. It’s a little hairy, yes, but this is the price we pay for privacy.
For this discussion we encourage readers to refer to linear_regression_ames.cpp but we also highlight the critical sections in our discussion.
The pTensor library’s motivation is to provide those with a machine learning or data science background the ability to train encrypted machine learning models in a framework that looks and feels familiar. Where possible we aimed to mimic the numpy library in terms of behavior (e.g allows broadcasting, *
corresponds to the Hadamard product, etc.)
In line with the library’s motivation, there are many aspects hidden from the user, but we briefly discuss important concepts that the inquisitive user may stumble upon while perusing the source code.
CKKS operates on complex numbers for various reasons that we will discuss in the follow-up but know that we only focus on the real-number portion from these complex numbers.
To pack the data essentially means that we encode multiple data points into a single ciphertext. Homomorphic encryption is a slow process, but by leveraging SIMD, we can carry out our operations faster. An analogy would be doing a for-loop
vs. vectorized operation in numpy. Because the size of our ciphertexts is already very large, it is advantageous to store the data in transpose form to reduce the number of encryptions we need to do and to allow for faster element-wise operations.
The m_cc
object is the cryptographic context which we use to carry PALISADE’s operations.
We show
Should one attempt to follow the process in Numpy
or in Eigen
know that because of the noise and the way our encryption scheme operates, one may achieve slightly different results between those plaintext versions and this encrypted version.
We briefly introduce the parameters used below but defer further discussion to later.
auto cc = lbcrypto::CryptoContextFactory<lbcrypto::DCRTPoly>::genCryptoContextCKKS(
multDepth, scalingFactorBits, batchSize
);
cc->Enable(ENCRYPTION);
cc->Enable(SHE);
cc->Enable(LEVELEDSHE); // @NOTE: we discuss SHE and LeveledSHE in the follow up
auto keys = cc->KeyGen();
cc->EvalMultKeyGen(keys.secretKey);
cc->EvalSumKeyGen(keys.secretKey);
int ringDim = cc->GetRingDimension();
int rot = int(-ringDim / 4) + 1;
// @NOTE: we discuss EvalAtIndex in the followup
cc->EvalAtIndexKeyGen(keys.secretKey, {-1, 1, rot});
We create a cryptocontext object which takes our chosen parameters:
multDepth
- The maximum number of sequential multiplications we can do before our data becomes too noisy and the decryption becomes meaningless.
scalingFactorBits
- the scaling factor mentioned above and to be discussed later.
batchSize
- how many data points (think vector of data) we pack into a ciphertext. Homomorphic encryption is slow but can be sped up by conducting operations over batches of data (via SIMD)
Notice how the parameters that the function takes in are the plaintext X and y. The reason for passing in plaintext X’s and y’s is to allow for easy indexing into the data for shuffling. It would be possible to shuffle the data in encrypted form but it is prohibitively slow and an easier alternative already exists. Thus, to simulate shuffling the data every epoch, we allow the user to specify some number of shuffles, and the data owner creates n-shuffles of the data that is then encrypted.
While training, we can simulate this randomness by randomly indexing into any of the shuffles.
The following should look familiar to anyone familiar with machine learning
for (unsigned int epoch = 0; epoch < epochs; ++epoch) {
auto index = distr(generator);
auto curr_dataset = dataset[index];
auto X = std::get<0>(curr_dataset);
auto y = std::get<1>(curr_dataset);
auto prediction = X.encryptedDot(w);
auto residual = prediction - y;// Remember, our X is already a transpose
auto _gradient = X.encryptedDot(residual);
pTensor gradient;
gradient = _gradient;
auto scaledGradient = gradient * alpha * scaleByNumSamples;
w = pTensor::applyGradient(w, scaledGradient);
w = w.decrypt().encrypt();
However, there are a few things to note:
1) encryptedDot
instead of dot
(which is also supported)
In the first encryptedDot
, in the case of a matrix-matrix, we do a Hadamard product before doing a summation along the 0th axis. Again,our X is encrypted in transpose form, of shape (#features, #observations). Thus, our weight matrix is of shape (#features, #observations). We leave it to the reader to work out the details of why this works.
In the other case (not matrix-matrix), we default to the standard dot product.
2) applyGradient
To understand the motivation here, we must first discuss the shape of the incoming values
w: (#features, #observations)
scaledGradient: (1, #features)
So, we must modify the scaledGradient
into a repeated Matrix form to apply it to the weights
3) w.decrypt().encrypt()
The reason for our decrypt-encrypt has to do with the multDepth
parameter that we briefly discussed earlier. As mentioned, as we do operations on our ciphertexts, we accumulate noise. If this noise gets too large, our decryption will begin to fail. This failing results in random bits interpreted as (usually huge) random numbers. By decrypting and encrypting our results again, we can refresh this noise (reduce the noise to 0).
However, there is a caveat here: only the party with the secret key can do the encrypting. Consider a scenario where we have a data enclave-client setup where the client does all the computations. There is a limit to the maximum multDepth
one can set before CKKS becomes too unwieldy. Computations that exceed that multDepth
need either server reencryption (like shown here) or Bootstrapping (which we will address in the next post) to securely reencrypt the data. Bootstrapping resets the noise and thus the multiplicative depth. However Bootstrapping for CKKS is not yet available for PALISADE as of Feb 2021. This server reencryption process is considered less secure compared to a fully homomorphic setup, but we defer further discussion to the next post.
Thank you for taking the time to read this! We hope that this post has given the reader a rough understanding of how to use PALISADE and pTensor for real-number applications. We plan to continue developing pTensor and create more tutorials around it, so follow or add me on LinkedIn - ianquahtc where I will periodically share updates. The full source code can be found at linear_regression_ames.cpp.
P.s: visit PALISADE - PKE for further examples of how to use PALISADE (one of which I contributed to!).
]]>robust to outliers
scale invariance (returns a percentage) and is intuitive to compare across datasets.
You have forecasting data where a significant difference may exist between contiguous samples.
\[T_1 = 5, T_2 = 5000\]For example, you want to predict the price of Bitcoin, or ensure that your power plants can support when England brews up sufficient power for World Cup tea-time surge
We reproduce the equation below:
\[\text{MAPE} = \frac{1}{N} \sum_t^N |\frac{y - \hat{y}}{y}|\]Here’s hoping you learn from my mistakes and can avoid the time I wasted trying to solve this problem
A quick look at the Sklearn Linear Model - Linear Regression page tells you that it only supports OLS. This is unfortunate because sklearn
is, in general, heavily optimized and well tested.
Having worked through the examples, I was not clear how to handle enormous datasets, which I was modeling at the time. The solution I was after was how to generate indices to be passed in for minibatch training. After much searching, I eventually found what I was looking for in Convnet Example, which shows you how to pass minibatches in.
Note: you want to be sure that none of your y_true
values aren’t 0 as this can lead to division by 0 errors in the optimization. I suggest doing
def objective(params, X, y):
pred = np.dot(X, params)
non_zero_mask = y > 0
return (y[non_zero_mask - pred[non_zero_mask]]) / y[non_zero_mask]
Other options would be to add weights to the objective
function as it is possible that you are extremely unlucky, and the objective function returns 0 as all your labels, `y’, are 0. Additionally, you may want to weigh different samples more or less heavily.
Unfortunately, although I managed to get it to work, this solution was unbearably slow. Furthermore, for maintainability reasons, it would just be easier if you could use the sklearn
API (not to say that you couldn’t wrap your autograd
training into the sklearn
format).
It was time to head back to the drawing board.
I got lucky, and things lined up perfectly.
While researching for ways to use sklearn
packages to solve my issue, I also came across sklearn.SGDRegressor, but that only allows the following loss functions:
squared_error
: OLS
huber
: wherein errors below some $\epsilon$ are treated as a linear loss, while errors above that $\epsilon$ use the squared loss.
epsilon_insensitive
: ignores errors less than $\epsilon$ and is linear when greater than that
squared_epsilon_insensitive
: is epsilon_insensitive
but quadratic instead of linear.
Looking at the Wikipedia page for MAPE, one might notice that it resembles the formula for MAE
\[MAPE = \frac{1}{n}|\frac{Y - \hat{Y}}{Y}|\] \[MAE = \frac{1}{n}|Y - \hat{Y}|\]so this means that I just need to find an MAE
implementation.
By pure chance, I found Sklearn-mathematical formulation of SGD losses, and I decided to read it.
epsilon_insensitive
loss ignores errors less than $\epsilon$ and is linear when greater than that
was the description for one of the losses. However, it wasn’t apparent to me that they would also take the absolute error. Only after reading the contents in the link above, did I realize what it meant:
\[L(Y, \hat{Y}) = max(0, |Y - \hat{Y}| - \epsilon)\]This means that if we set $\epsilon$ to 0, we get the form we want!
For completeness, I list out the equation as I used it.
Y = ... # Our labels
X = ... # My forecast data
denominator = 1 / Y # we can do this
# Scaling
scaled_Y = Y * denominator
scaled_X = X * denominator #
model = SGDRegressor(loss="epsilon_insensitive`, epsilon=0)
model.fit(scaled_X, scaled_Y)
Although we managed to make autograd
and sklearn
work for my problem, the results were still not good. I suppose that the takeaway from this is that you can do everything “right” and still not have things turn out your way.
In hindsight, this was a simple problem, but it was a good reminder of what it takes to be a good machine learning engineer: good software and math skills. I needed to set up minor infrastructure, massage data via a pipeline, and work out the autograd
package, so being able to code was imperative. In addition, I needed to understand the math to come to the solution I did.
Please know that I am not blowing my own horn; in fact, I’m embarrassed about how long I took to find the solution. And even then, I stumbled backward into the solution.
Thank you for taking the time to read this, and happy holidays!
]]>If you’ve already taken Calculus or Linear Algebra, feel free to skip ahead to the next tutorial, Hessians and Jacobians
The equation below describes both the equation of a straight line as well as what happens if you take the derivative of that straight line with respect to some input value:
\[\begin{align*} y &= mx + c\\ \frac{d y}{dx} &= m \end{align*}\]Typically in a calculus class, we’d talk about the rate of change of $y$ with regards to $x$. In other words, how much does $y$ change as $x$ changes? In this case, we see that $y$ changes by a factor of m for every unit that $x$ changes. For the moment, we are focused on scalar values, but this concept will generalize to vectors and matrices (which segues us into….)
Math often deals with the concept of abstraction. For example, we often deal with numbers, e.g., 5 or 100. In Linear Algebra, we are concerned with collections of numbers (vectors), e.g., a collection of (5, 10), or a collection of those collections (matrices), and further abstractions. To make this notion concrete, consider the following example:
Edit: I have no idea if the following examples describe actual streets and avenues, so I’d like to apologize beforehand.
Say that we were somewhere in New York City, which works on a grid system. If I were on 4th and 5th, while you were on 10th and 7th, our (x, y) coordinates could be described as (4, 5) and (10, 7), respectively. Equivalently, our coordinates could be described as the following:
\[\text{My location:=} \begin{pmatrix} 4\\ 5\\ \end{pmatrix}\]and
\[\text{Your location:=} \begin{pmatrix} 10\\ 7\\ \end{pmatrix}\]We decide to meet for coffee, but since neither of us drives, we agree to meet in the middle as that is easiest. So, we would meet at:
\[x := \frac{4 + 10}{2} = 7\] \[y := \frac{5 + 7}{2} = 6\]which corresponds to 7th and 6th (7, 6).
We saw in the computation above that it can be tedious to write out both equations to describe our (x, y) position. This complexity only grows as we add more locations, e.g., what shop; what if we had a compact way of representing my location, your location, and the operation of averaging to determine where we should meet? Here I want to keep two concepts in the back of your mind:
1) The concept of abstraction on scalars.
2) The concept of a coordinate system and what it means for something to be in the coordinate system.
At the start of this Linear Algebra review, I said that Linear Algebra is concerned with numbers or collections. So far, we have already discussed one such collection: a coordinate system. In that case, my location is described as the collection of (4, 5), and yours is represented by (10, 7). The top element (4 and 10) represents the street, and the bottom represents the avenue.
Congratulations! We’ve just worked through the concept of a vector, albeit in a particular setting: New York streets and avenues. Let’s take a step back and our locations for what they are: specific instances of an abstract concept. We could just as well write:
\[X_1:= \begin{pmatrix} a\\ b\\ \end{pmatrix}\] \[X_2 := \begin{pmatrix} c\\ d\\ \end{pmatrix}\]where $X_1$ CAN represent my street-avenue, but it could just as well describe my latitude-longitude or my age-height. Whatever the case, if we are then looking for the average of these two containers, $X_1$ and $X_2$, we can represent them as the following:
\[\text{the middle := } \frac{X_1 + X_2}{2}\]. This equation holds for the street number and the avenue (our x and y coordinates).
Note, we can add more information, e.g., a Z
coordinate, which represents the shop number to meet at, or the corner I’m on, but we do not need to change anything. Our “middle” can still be represented by the same general equation above.
We can then expand on our scalars and vectors to a collection of collections. Say we had two other friends, all our locations could be described as
\[\text{Us := } \begin{pmatrix} 4 & 6 & 10 & 12\\ 5 & 7 & 7 & 15\\ \end{pmatrix}\]which would be a matrix. Phew, that was a mouthful.
When we first introduced the idea of vectors, we discussed it in the sense of streets and avenues on New York’s grid system. In that case, our locations would be described by whole numbers (we can’t be at avenue 10.5).
\[\text{My location: } \begin{pmatrix} 4\\ 5\\ \end{pmatrix}\]However, if we consider latitude and longitude, it makes sense that we can describe those numbers as numbers with some decimal point. For example, this random location I picked in New York has a latitude-longitude of (40.712776, -74.005974).
The first example, street-avenue, pertains to the Natural numbers. We say that the street and the avenue, individual elements of our collection, exist in $\mathbb{N}$, the natural numbers (also known as the counting numbers).
In the case of latitude-longitude, the individual elements of our collection exist in $\mathbb{R}$, the real numbers (have a decimal space). We denote these scalar values as elements of the sets of $\in \mathbb{N}$ and $\in \mathbb{R}$ respectively.
If we talked about the collection, as opposed to elements within the collection, my street-avenue would then be:
\[\text{My location: } \begin{pmatrix} 4\\ 5\\ \end{pmatrix}\]such that my location can be described as being in the naturals, $\in \mathbb{N}^2$, a vector of natural numbers. My latitude, longitude can be described as $\in \mathbb{R}^2$, a vector of real numbers. If we then added another number, e.g., the shop that I’m in, we would then have
\[\text{My location: } \begin{pmatrix} 4\\ 5\\ 6 \\ \end{pmatrix}\]and my location can thus be represented as $\text{my location } \in \mathbb{N}^3$. This same concept extends to matrices. Consider our group of friends from earlier:
\[\text{Us: } \begin{pmatrix} 4 & 6 & 10 & 12\\ 5 & 7 & 7 & 15\\ \end{pmatrix}\]Our location can then be described as $\text{my location } \in \mathbb{N}^{2 x 4}$. And that’s it for the linear algebra you’ll need for the rest of this series!
``` 1) Zico Kolter’s Linear Algebra Review and Reference - great professor at CMU, and I found this guide to be handy.
I love Pusheen, and I’m also a fan of playing around in my terminal. After talking to someone the other day, I was inspired to work on this; she mentioned how an officemate commented on the Pusheen that popped up whenever she opened her shell.
I didn’t use any statistics other than the standard deviation for a small portion of the image segmentation (cat v. background). Having said that, I think that this is a fun exercise to occupy my time.
A quick Google search revealed about 3 Pusheen ASCII art images online, which is disappointing given how many Pusheen images and GIFs there are. After a long week at work and some climbing earlier today, I’m ready to spend this Friday night in. So, it looks like I’m making a Pusheen ASCII art converter and some shell scripts. Also, Pusheen sounds like pushin’, which opens up several cute GitHub project names.
1) Create a folder wherein we will store many Pusheen images.
2) Load, resize, and convert those images to ASCII art.
3) Make some shell scripts
}}
}|))|) )) |
) ) ) xX }
| uhMMoQ )} }| Q#WWWk |}}))||||||||||||||)}}
/ O&8oaW%h d%Whbo%Mc )|)}
/ w%Wdpdpo%MY|/)/jxo%*pdbpb8&0XQZwdbkhaaaaaaakbpZCj |)
}}}}} ) m%Mpbbbbpk8&WWWWW&8apbbbbddW8W&WWWM8888888W#88888W#hZ/ |)
} J8Wpbbbbbbpa8#o8*o8opbbbbbbddkbdpqqwhMWWWW#dp#WMWW&8W&&oQ |
vCLCUzrtW8bdbbbddbbbabbhbbakbbbdbbbbbddbhao**M#ooabbbk#WWWMapdaW8#Q ) }||||}
) b&WMMMMW%adbbbdbbdbdpppahppdbbdbddbbbd#&WWMM##hpddbbbddkhkbdbbdpkW%a/ ) ) }
rUOmd%Wpbbbd#88*ddo&o%&*&hddM88adbbdoM****oohbbbbbbbbddddbbbbbdpa%WJ ) | vdoadv }
) wWWWWM&%hdbbbd*88*dbkMB&8B#bbdM8&adbbdoWMMMMMW#dbbbbbbbbbbbbbbbbbbdd&8C ) }/ m8%88%&U})
} CqLc)r&Wpbbbbbdbbdbbdd*WMopbbbdbddbbbbdpppppppdbbbbbbbbbbbbbbbbbbbbdd8&u|/ p%W#WWWBZ |
qBadbbbbbbbbbbbbbpdddbbbbbdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbdkBa ra%Wddko%Mt)}
}))t #%bdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbpM%wh&%&WWok&&U )
} x8Wpbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbda888ob*WW&%WY }
| Z%*pbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbdbMWWhqdM%8h} |
/ bBhdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbd#W&&W&WhY })
/ h%kdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbp#%MobO) |
/ o%kdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbpM8r |}
/ o%kdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbd&&t/|)
/ k%hdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb8# |
) Z%*pbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbdaBb /
}}j&&dbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbpM%Y )
/ kBhdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbdk%o |
})rW8ddbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbp&8z}}
) C8&ddbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbpM%Z )
) Q&8hpdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbpdW%w |
| u*%Whppdbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbdpd*8WQ |
) 0*8&obbdpppppppppdbdpppppppppppppppppdbdppppppppppdbb*&&b |
}| YkB#waWWWWWWWWMopk#MMMMMMMMWWWWWWMMophMMWWWWWWWW*qaBa| )
)/ &8*%Wwqqqppp&%kMBhdbdddddddddddd88bW%bdppqqqwMBoMBO j
))Jh*ku LMWWk mMWWp ch##p|)
|)|||)| j )|||||||||||||| x /|||||)|
}|||} }) | )} ) }|)||
} }
looks FAR better than
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$@$$$@@M*#oa@@@@@@@$@@$$$@@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@$BB@@$$@@$W*#q**o@$$$$$$$$@8MMW%$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$@@8b*#&B@$$WaWppw##W8B&B@@@B&M*aMa8$@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$@B&###*%%aMbqppq#&*MMWM&M&#hpmh#*@@@@@$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@%8BBBW##MWkqdppqk*d*MkW#aapqpppM#B$$$$$@$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$@@@Wo**#W8hMMdqpwqpppqpdpqbpqqppppwooM@%%B$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$@$B8%&W#*W*qppb#opppqqpppqppqpppqpk&*W#o*#%@$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$@a*#qppqa%&pqwh*qppppbkppqh#**&oWW&W%$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@@@@o*#wppppqpqpo#MWppppp&Boqpdpqm*oW$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@@@**#qppqwqbao*aqc*Mqppphodpppo###&#W8@$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$@$$WaMwqqpko**kkbZuud#wpppqwpppppdddoW*M#8$@$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@Wo&pdao*od0uh#aWhjZMppppppppppppppwMoW%@$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$@@@*o#oM#hbZufUX#MoMpcCMdqppppppppppppqk#a@$@$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$@$$M*&hoMd*o*#wXUYwqCzCu*opqqpppppppppppw#oW$@$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$B8#*dddYh&hkWonJUrOwUXz#M*#aqppppppppppqdMh%@@$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$@@&##&kQUnvuqaokLUJXo#oMmvMoqkopppppppppppko&#%$@$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$@$@WMwqa***akwJ/ uXUUM#a&q/d8#bwppppppppppdMW&M#@$@$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$@@@##hqZOOwdko#ohdZLcXpbQzcpMqdppppppppppppo#MWoB$@$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$@$B&##W*okqwZZpba**oabqZ0zbMwpppppppppppppqpq#MB$@$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$W#oph*###*hbwOZqdka**oMWppppppppppppppppmo##@@$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$@B8hwqwqppka*##obpqZ0Om0M*wpppppppppppppk*MWM@$$$$@@$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$&Whqpppppqqwqba**##*oo*#bqppppppppppppp#WW&8@$@@$$$@$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$&Wkqpppppppppqqwqqdkhkkqqppppppppppppppk**Wo%$$$@BB$@$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$8Wkqpppppppppppppppqqqqppppppppppppppppqqq#*%$@&WW#&$@$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$8WkqppppppppppppppppppppppppppppppppppppppM*%$Mh&W&*%@@$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$&WaqppppppppppppppppppppppppppppppppppppqbWW@@#Wka&*%$@$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$#**wppppppppppppppppppppppppppppppppppppwo##@#MWdooo$@$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$@$%aMqpppppppppppppppppppppppppppppppppppppMM&##M#&Wh%@@$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$@@**omppppppppppppppppppppppppppppppppppqo&MMW#maWb%$@$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$@$@oMawppppppppppppppppppppppppppppppppwkWWah&M*#o8$@$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$@$B*#odqqpppppppppppppppppppppppppppqqa#MMoo##&8@$@$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$@$@WW#okqpwqqqqqqqqqqqqqqqqqqqqqpqph*#MW&8&&%@$$@$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$@$$B8MM*p#aaaaaaaaaaaahhhahhhokp**#8%$$$$$$$$@@$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$@@$$@####o**##M*M###MW&WM&&W*M**o%@$@@$$@@@$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$@@$%W*MB@@@@@@@@@@@@@@@@@@M*##B$@@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$@$$@$$$$$$$$$$$$$$$$$$$$$@@$$@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$@@$@@@@@@@@@@@@@@$@@$$@@$$@@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$@$$$$$$$$$$$$$$$$$$$$$@$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
this.
We can apply some heuristics to clear out the background. One heuristic is that Pusheen is typically at the center of the image, which means that we can probably use the corners to act as a threshold to remove the background.
In the real world, you’d probably want to 0 out everything non-Pusheen, but since this image will be piped to the terminal, it helps to contrast with the non-empty characters around the image.
1) We need to add a background (and after we went through all that trouble to get rid of it….)
We are using img.max()
to scale our image, so one hacky solution is to use the max value and scale it by some percentage.
2) Our choice to scale the image size before changing the background, our chosen parameters are all wonky. We can just swap the order of our operations by changing the background, then scaling the image.
3) However, we now have to contend with scenarios where the background is black or white. We simplify the problem by checking if the image is below some “sensible” threshold, and if it is, we set it to some percentage of the max.
Special Thanks
* _{ ASCII converter 1 for providing me with a starting point for code, and ASCII converter 2 for providing a more detailed ‘gradient’ of colors for Pusheen to exist in. Both were extremely useful in providing a starting point for the ASCII art converter }
* _{ Frolian - flothesof for making me realize that OpenCV is for lazy people (lazy people who happen to be able to figure out how to install it ¯\(ツ)/¯) }
]]>For each of the topics covered, Jacobian and Hessian, I try to provide 3 levels of information: a high level, a mid-level, and a low level for you to review, depending on your level of interest.
0) x describes a scalar value, $\vec{x}$ describes a vector, and X describes a matrix.
1) Vector-valued function is a function that returns a vector.
2) Matrix-valued function is a function that returns a matrix.
3) Tensor: a scalar value is a 0-order tensor, a vector is a 1-order tensor, and a matrix is a 2-order tensor. For the purpose of most Machine Learning applications, a tensor is just an n-th order tensor (more abstractions). We’ll come back to this idea later when considering the not-yet-defined Jacobian and Hessian.
4) $\mathbb{R}^n$: basically means a point in n-dimensional space. For example, if you drew a Cartesian map, any point you pick has an (x,y) coordinate that describes it. Thus, we can say that the point exists in $\mathbb{R}^2$. If you restrict the points to taking on “whole numbers” (aka Natural numbers, or counting numbers), you can say that it exists in $\mathbb{N}^2$.
This section is a little awkward as it’s not covered in Calculus 101; however, discussing it is extremely important before broaching the rest of the blog post. A partial derivative is basically a derivative of a “part” of a multivariable function, i.e., we take the derivative along a single dimension while keeping all others constant.
A Jacobian and a Hessian are just derivatives of the first and second-order multivariate functions, i.e., applied once and twice, respectively.
The Jacobian is, in essence, the first derivative of some tensor. We begin with the following example: given a single point of data about you (age, height, favorite food), we want to find out how likely it is that you’re in certain clubs: (reading, sleeping) we’ll reference this problem while discussing the Hessian as well.
The Jacobian describes how changing each of the input dimensions affects each of the output dimensions. Looking at our example, changing how much we weigh or how our favorite food changes, we can observe the linear effect on our clubs.
Given our 3 input dimensions and our 2 output dimensions, we’d have 6 pairs to look at (3 possible things to manipulate for each of those 2 outputs). This intuition will come in handy if you read on.
Consider our example from earlier:
1) Your input data, X $\in \mathbb{R}^{1 \times 3}$.
2) You have some weight matrix, W $\in \mathbb{R}^{3 \times 2}$
3) Your output, $\vec{y}$ $\in \mathbb{R}^2$
4) If we were to put some classifier algorithm, defining it by some function, $f$, it would look like this:
\[\vec{y} = f(\vec{x}) = W\vec{x}\]5) If we calculated the Jacobian of this, it would look along the lines of
\[\textbf{J}(f) = \begin{pmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \frac{\partial y_1}{\partial x_3}\\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \frac{\partial y_2}{\partial x_3}\\ \end{pmatrix}\]where we’re iterating through each dimension of X (3 of them), and $\vec{y}$ (2 of them). This can be expressed more compactly as:
\[\textbf{J}(f) = \begin{pmatrix} \frac{\partial \vec{y}}{\partial x_1} & \frac{\partial \vec{y}}{\partial x_2} & \frac{\partial \vec{y}}{\partial x_3}\\ \end{pmatrix}\]where $\vec{y}$ is a vector describing the vector which we partially differentiate.
If we think of our inputs as a point lying on some n-dimensional plane, we can think of our weights as some linear transformation, $f$, that takes us from our current point $\mathbb{x} \rightarrow \mathbb{y}$. What the Jacobian then gives us is the best $\textbf{linear local estimation}$ of how the points are warped in that small area.
Remember, our n-dimensional vector $\in \mathbb{R}^{1 \times 3}$ of features is just some point in n-dimensional space. If we take the ‘rate of change’ of something that transforms it, W, in our concrete case, into an m-dimensional space, we get the amount of linear transform in the small region around the point.
1) Loss function:
By imposing some restrictions on the neighborhood around a point, we can do some interesting work on making the values invariant (or close to) small changes in an area. if the explanation sounds a little hand-wavy, and you’d like a concrete example, check out Hugo Larochelle’s Contractive Autoencoder video.
2) Discussions about local linearity in non-linear settings:
Neural Networks are known to be non-convex, but analyzing them from a linear standpoint can still be useful. I’d suggest watching the video above for an example of how it can be beneficial to analyze in this way.
Note: I’m hoping to talk about convexity down the line as it is a fascinating topic.
The Hessian is essentially the derivative of the Jacobian.
In calculus 1, you might have learned that the derivative describes the rate of change, and the derivative of the derivative describes the maximum/ minimum. The Hessian is the equivalent of the concept mentioned above but applied to N-dimensional tensors in an abstract sense.
Recall our Jacobian function from earlier:
\[\textbf{J} (f) = \nabla f \begin{pmatrix} \frac{\partial \vec{y}}{\partial x_1} & \frac{\partial \vec{y}}{\partial x_2} & \frac{\partial \vec{y}}{\partial x_3}\\ \end{pmatrix}\]If we then take the Jacobian of THAT, we end up with the following:
\[\textbf{J}(\textbf{J} (f)) = \nabla (\nabla f) \begin{pmatrix} \frac{\partial^2 \vec{y}}{\partial x_1^2} & \frac{\partial^2 \vec{y}}{\partial x_2 \partial x_1 } & \frac{\partial^2 \vec{y}}{\partial x_3 \partial x_1}\\ \frac{\partial^2 \vec{y}}{\partial x_1 \partial x_2 } & \frac{\partial^2 \vec{y}}{\partial x_2^2} & \frac{\partial^2 \vec{y}}{\partial x_3 \partial x_2 }\\ \frac{\partial^2 \vec{y}}{\partial x_1 \partial x_3} & \frac{\partial^2 \vec{y}}{\partial x_2 \partial x_3 } & \frac{\partial^2 \vec{y}}{\partial x_3^2}\\ \end{pmatrix}\]An interesting tidbit that the eagle-eyed among you may have realized is that we went up in dimensions from a compact vector representation to a compact matrix representation. Intuitively this makes sense as we are now varying our variables on our variables (hence the denominators like $\partial x_1 \partial x_2$)
Bear with me for a bit. If we were to expand out our $\vec{y}$ into its components ($y_1, y_2$), we’d need another axis to put them on. So, our Hessian from above would need to “expand” into another dimension to store them. Still with me? I hope so because if you are, you’ll understand why:
1) I’m not going to actually list out the ‘tensor.’
2) I’ll call the ‘expanded’ version a 3-order tensor
3) When we differentiate a vector with regards to a vector, we increase dimensionality. See Old and New Matrix Algebra Useful for Statistics for a summary of the different forms of differentiation. Also, Wikipedia: Matrix Calculus Layout Conventions has some interesting notes.
Positive Semi-definite: if A is your matrix, then for any non-zero $\vec{x}$ \(\vec{x}^T A \vec{x} \geq 0\)
If we calculate the Hessian of our loss function (I’d suggest going online and working through one of the proofs), we see that it is positive semidefinite, which means that it is convex. This brings about the property of a guaranteed global minimum; reaching it in a method like gradient descent is another matter.
One example of a convex loss function is the logistic, where because we know it is a bowl shape (convex), we know that we are guaranteed global minima if we find the minima
Observed Information matrix where we’re looking at the negative Hessian of the log-likelihood function.
I’m not going deep into the details, but if we have some estimated parameters $\theta$ (also called the weights in ML) , one way of evaluating how well $\theta$ fits our data is by first taking the log-likelihood:
\[\mathcal{L} (X_1, X_2, ..., X_n \| \theta) = \sum_{i=1}^{n} log f(X_{i} \theta)\]Taking the negative Hessian of our log-likelihood tells us how our loss varies as we manipulate different parameters.
If we know the curvature of the surface, this can guide us in our gradient descent. Some papers talk about using the diagonals of the Hessian to estimate the optimal learning rate as mentioned in A Robust Adaptive Stochastic Gradient Method for Deep Learning
Intuition
Admittedly, I’ve not read the paper above in-depth, and I’ve not read the papers referenced at all. Still, I’d wager that utilizing the diagonals of the Hessian allows them to weigh the importance of the different features as they make their gradient descent. I say this because not all features are equally informative, so it doesn’t make sense to treat them equally (especially since your error is typically just a scalar value that you propagate backward). I may be completely wrong, but this example stresses 2 things:
i) read the paper
ii) intuition is only helpful so long as it’s right, so it falls to you to make sure you’re correct.
1) Matrix Cookbook - Page 8-16 - I’m personally not a fan of recommending this off the bat as I think a collection of facts in itself isn’t useful except for as a reference.
2) Zico Kolter’s Linear Algebra Review and Reference - great professor at CMU, and I found this guide to be very useful.
4) Old and New Matrix Algebra Useful for Statistics
A learning resource for myself. I believe that teaching others a new concept is a fantastic way to poke holes in your own understanding of things. A learning resource for others. I’m not a fan of most medium articles, and unfortunately, it seems like the ones recommended to me are always the low-quality ones that just clog up my searches. Here’s hoping that this blog one day becomes helpful to you, my dear reader.
A learning resource for myself
Whenever I learn new things, I try to break down ideas into concepts I already know. I’ve always found it enlightening to approach new concepts by looking at them through the lens of what I already know. Often, this fresh reframing forces me to see things from a different point of view or review concepts to see if I had missed anything.
Overall, my goals are: Conveying the motivation behind some topics (likely related to Machine Learning) Sharing my thoughts and simplified ideas as I try to understand the topic(s). I also try to tie it to existing concepts I am familiar with Getting feedback about whether I’m correct in what I’ve said, assuming Cunningham’s Law still holds true for blog posts?
Reminding future me that it’s okay to say, “I don’t know.”, and that the most important thing is to never stop learning.
Special Thanks
_{This blog and honestly this website, were all kicked off by a discussion I had with someone I opened an issue with on Github and I’d like to thank this person very much Ericmjl. I’d highly recommend heading to his website to listen to his talks or just to learn from him as he comes across as an excellent teacher.}
_{Special thanks to http://jmcglone.com/ for posting an amazing tutorial which basically walked me through the process of getting up a website. }