Burning Diffuse from Both Ends

Part one of an unknown length series

If you stumbled onto this article and expect to learn something of AI in a traditional fashion, you might spend your time better elsewhere (even if that’s another article on this blog).

Usually if I’m going to bother writing and posting a blog item, I usually want it to be worth reading to all within a given audience — to provide some (I think) interesting perspective or observation. This post doesn’t do that. This post is not deeply contemplative, doesn’t teach anything directly practical, or even offer a thoughtful opinion. This blog post just tracks a record of something I’ve done — one step in a very long process.

In the long process of learning AI, I am trying to “meet myself in the middle” as it were. Studying the fundamentals of ML and DS is critical, but it is a far cry from the day-to-day growth of the AI industry. And working with practical AI tools may be both fun and productive, but does little to explain the inner workings. To understand as best as possible, and in a practical way, I am hoping to keep working my way up from the fundament, and chipping my way down from the paramounts, until I have as complete an understanding as is possible without obsessing myself with it to the exclusion of all else, and without being a math whiz.

I’m writing this post mainly for my own edification and reference, but I’m publishing it on the chance it might happen to prove useful to another. In any case, documenting your journey aids in personal accountability so they say.

And if you have interest in seeing this thing through, then I apologize in advance for the lack of brevity inherent both to my style and to the medium — but not for the process of learning, which is best when slightly ponderous, so long as the ponderousness does not hide sloppiness.

Onward.

Step 1: Get something running

Since I’m approaching AI learning from top-down and vice versa, hoping to meet somewhere in the middle, I wanted to get the “feel” first of running one of the major AI tools locally — StableDiffusion seemed like the obvious choice, with some historicity already and with a number of readily available approaches for getting started.

I worked through the diffusers python package using their tutorial, mostly: https://github.com/huggingface/diffusers/tree/a28acb5dcc899b363c9dd1c8642cddc9b301cd9d
(It’s since been updated somewhat, you can check that out too).

I recorded my process as I went along (actually, that’s kind of a lie; the first time I recorded nothing, but then I switched machines and tried it again, and recorded that) so that I could end up with a kind of “foolproof” plan for getting set up locally in a way that avoided the somewhat generic and nebulous explanations often found in these tutorials, which assume, reasonably enough, that you either have a decent background in AI if you’re following along, or will just be reading and not trying to implement it yourself.

Well, I’m somewhere in between all that, and I found many supposed guides difficult, at least in starting. So I “boiled down” a few of those into something digestible and repeatable I could use on my system, and in the process was able to ruminate a bit better on what it is I was actually doing.

I went through the process a handful of times to make sure I could nail down the exact steps. This walkthrough was recorded on a Win10 desktop with 32GB ram, but I completed these same steps on a Macbook with 16GB ram, the only difference being I used miniconda instead of pip as the package manager.

The process I ended up with proceeds as follows:

  1. Install python (python3)
  2. Install pip
  3. Install pytorch and associated packages: pip install torch torchvision torchaudio
    • You might need to install them directly from source:
      • pip3 install –pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
      • pip3 install –pre torch -f https://download.pytorch.org/whl/nightly/cpu/torchvision.html
      • pip3 install –pre torch -f https://download.pytorch.org/whl/nightly/cpu/torchaudio.html
    • You might also need to install additional dependencies if the output from pip tells you it didn’t resolve these.
  4. Install diffusers and related boilerplate packages: pip install diffusers transformers accelerate
  5. Set up large file support on git if it’s not already: git lfs install
  6. Clone the stablediffusion repo: git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 (there are newer ones, may require further steps).
  7. Create a file. The file I used leveraged some of the examples in the tutorial to run locally, use CPU instead of GPU, and limit memory usage.

I found that this process was repeatable (I ran through it a couple of times) and that the actual image generation was relatively speedy — the training already having been completed, leveraging a pre-trained model.

I created a reference for myself, which only a couple of weeks later is probably already outdated, given the pace of things:

 

Further references and code:

Step 2: Understand diffusers

“Great, you’ve used diffusers to run StableDiffusion yourself, and it took all of five lines of code, because everyone did the work for you!” — so I hear myself thinking. I am glad I was able to load and run everything on my machine; it’s cool, interesting, useful, and I did learn something — but can I really claim any understanding?

Well, I can dig into the usage of the diffusers library, start tweaking my code and reading the documentation, but I think before I do that, there’s some utility in actually understanding a bit more of how this thing’s built!

So, only a click or two will land you on https://huggingface.co/blog/annotated-diffusion (at least at time of writing, I hope it sticks around, at least as long as this is relevant), and you’ll learn how to implement diffusers. Cool, actually building it from scratch! — well, not from scratch, since you’ll be using pytorch, numpy, Pillow, tensors, matplotlib, and others, but if you follow step by step you’ll at least gain something akin to comprehension of the purpose of the papers that form the foundation of this thing — and have a clue what your code is doing.

It’s something I had to struggle to keep in mind while working through this breakdown, and which I think afflicts us easily when we stretch ourselves on a new endeavor — the goal is not total and unerring comprehension. You’re not aiming to be an expert; and even if you are, eventually, you’re not aiming to be one right now, at the conclusion of this single tutorial — that would be foolhardy indeed.

As one of my compsci professors once told me, and it stuck: “If you understand 50% of a paper, that’s good enough.”

Few can hope to be functional experts in more than one niche in an industry, and even if you have solid comprehension of the breadth of an area of research, that won’t help you in understanding the details any particular branch which you’ve never spent significant time studying, and which isn’t your main focus.

But that’s not a problem — reading comprehension in research is its own skill. Look at a formula you’ve never seen before, representing a concept you’re encountering for the first time, and be able to get what it’s trying to say, rather than the exact mathematical construction, the purpose of each variable and function. If you recognize an inequality, pick out a ratio and note that as it increases, something else scales logarithmically, then you’re following. If you can scan the methodology section of a paper and understand what the main pitfalls are and the source of those problems, then you don’t need to be able to replicate or even understand each step yourself in order to glean useful comprehension from the paper as a whole.

The takeaway for this and other purposes like it is that you shouldn’t, in most cases, be trying to do the author’s work for them, but rather be trying to replicate the actual comprehension process that got them there — just like when you learn math or physics, no one expects you to imagine the formula for Newton’s law of cooling on your own, but as you learn it, you become able to derive it, you follow the reasoning and in doing so, understand the process and the purpose.

I found great challenge in this particular tutorial / code walkthrough because my math skills are just below the level to totally grasp the nuances of what makes the denoising methodology they implement work so darn well. I could “get the idea” but I couldn’t track the details of the Gaussian noising, or even always spot the purpose for a particular line of code (Why is it necessary to rearrange the shape of the matrix here? Why couldn’t we sample as is?). It’s not only frustrating, it feels very defeating to the purpose itself — how am I really learning anything if I can’t reproduce it myself? I almost stopped twice partway through.

But then I remembered I am learning — a lot, for that matter. I’m learning what is happening even if I can’t personally reconstruct the details of why. It’s not my job to know the why — certainly at least not here and now. My job is to learn, to follow along as best I can, focus on the meaning of the code, absorb the goal of each step and the approximate method by which it’s accomplished and what intrinsically makes it work. I trust that the details that I need to know, I will know, in time. If I push it now, it’ll be that much hard to learn, and I’ll just end up hating it to boot. But if I get the sense of things now, along with the sense of accomplishment, then I’ll have a successful basis from which to learn more.

This is all pretty fundamental stuff when it comes to learning, but it’s easy to forget, especially for me, and especially when I get wrapped up in a length learning process which comes to fruition only through the more hand-on process of “let’s get something working.”

I’m happy to say I did get it working, and do basically understand it — and I have three more tabs open already with further papers and tutorials that I’m looking forward to following along, encourages and not dispirited by what I’ve already seen, learned, and accomplished.

* * *

Well, enough of that tangent. Here was my process:

In this case I mostly was able to stick to the script. All my setup and packages were already in place from the last step so there wasn’t nearly as much fidgeting with my environment.

The only real difference was that I ran in a command-line environment, not in jupyter or similar, so I had to make a few changes to compensate for that:

  1. Comment out %matplotlib inline (wasn’t using inline output anyway).
  2. Install torchvision directly, and add an import for it
    • pip3 install torchvision -f https://download.pytorch.org/whl/torch_stable.html
  3. Add an import for ImageShow from PIL to display images at the steps, then add lines:
    • ImageShow.show(i), or
    • i.save(“out_noisy.jpg”), etc.
  4. Add show() to plots where needed
  5. Enforce caching on the dataset load, which did not occur automatically. Interestingly, once I included this flag the first time I ran it, it would no longer run on further instances until I turned it off — I can speculate on what blocked the caching and made it hang (OS quirk?) but I don’t know, and once I removed the flag, it worked smoothly and was able to draw from the cache.
  6. Lastly, I encountered the strangest difference in behavior where the script never triggered the image save step. I had to modify the code as written to use a different save counter, and also had to add a step to manually concatenate the sublists of tensor-y images or torch.cat would fail with an argument exception — I don’t know how the structure differed from that which was expected, given I rewrote / copied most of the code verbatim, but somehow the functionality differed sufficiently that that part needed rewriting. Maybe I did something wrong?

Most of this process was by trial and error, which is astonishingly shameful to my skills, given that the code was offered up on a platter. A lot of the debugging process relied on comparing the intended functionality with the results the code turned out. All of it was theoretically “pointless.” In the end you’re not really building anything so much as demonstrating an implementation. But it was still a great way to learn! The abstract math made a lot more sense when you struggle to figure out the way the tensors are transformed in each step.

Further references and code:

Side effects of some of the issues I ran into: I now have a better sense of torch and even parts of python itself (especially pdb, the debugger!) than otherwise I would if everything had proceeded smoothly. Coming from a background of Ruby and JS, and before that PHP, Python has always been like a vaguely foreign but eerily familiar tongue to me — reasonably easy to understand until it breaks out some completely incomprehensible line that might as well be gibberish. But that’s where the learning starts!

* * *

Well, one step farther along in this somewhat ill-defined journey. I can’t say I possess any new concretely applicable skill, but I feel more informed as a whole, certainly than before I started these implementations.

Onward!

Soon to come: Step 3, and who knows what else.