From 76723ea3ce1735cdc19e3ce4cece494e5af31d0b Mon Sep 17 00:00:00 2001
From: Alessandro Romeo <a.romeo@cineca.it>
Date: Fri, 23 Feb 2024 10:04:22 +0000
Subject: [PATCH] updating files

---
 2024_notes/Day 2/01_Benchmark.ipynb           |  559 --------
 .../Day 2/02_SerialCodeOptimization.ipynb     | 1155 -----------------
 2 files changed, 1714 deletions(-)
 delete mode 100644 2024_notes/Day 2/01_Benchmark.ipynb
 delete mode 100644 2024_notes/Day 2/02_SerialCodeOptimization.ipynb

diff --git a/2024_notes/Day 2/01_Benchmark.ipynb b/2024_notes/Day 2/01_Benchmark.ipynb
deleted file mode 100644
index bc5a669..0000000
--- a/2024_notes/Day 2/01_Benchmark.ipynb	
+++ /dev/null
@@ -1,559 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "10fb3eb9",
-   "metadata": {},
-   "source": [
-    "# Benchmarking Julia\n",
-    "\n",
-    "1. Define the sum function\n",
-    "2. Implementations & benchmarking of __sum function__ in:  \n",
-    "    * Julia (built-in)  \n",
-    "    * Julia (hand-written)  \n",
-    "    * C (hand-written)  \n",
-    "    * python (built-in)  \n",
-    "    * python (numpy)  \n",
-    "    * python (hand-written)  \n",
-    "    \n",
-    "Consider the  __sum__ function `sum(a)`, which computes\n",
-    "$$\n",
-    "\\mathrm{sum}(a) = \\sum_{i=1}^n a_i,\n",
-    "$$\n",
-    "where $n$ is the length of `a`."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5812ea46",
-   "metadata": {},
-   "source": [
-    "## 1. Julia built-in `sum` function"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a17adb59",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "import Pkg; Pkg.instantiate()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "12536e98",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "a = rand(10^7)  # 1D vector of random numbers, uniform on [0,1]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1c220406",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@which sum(a)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8a7ab7a3",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "sum(a)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4fcbcdbe",
-   "metadata": {},
-   "source": [
-    "The expected result is ~ $ 0.5 * 10^7 $, since the mean of each entry is 0.5.  \n",
-    "So let's try to time the execution time of this function by using `@time` macro:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "edd112c1",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "?@time"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "320efc31",
-   "metadata": {},
-   "source": [
-    "So what is the performance of Julia's built-in sum? "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3b17f7df",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@time sum(a)  # try to repeat the execution of this cell!"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "69d7b634",
-   "metadata": {},
-   "source": [
-    "The `@time` macro can yield noisy results, so it's not our best choice for benchmarking!\n",
-    "\n",
-    "Luckily, Julia has a `BenchmarkTools.jl` package to make benchmarking easy and accurate:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "40217f68",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "import Pkg; Pkg.add(\"BenchmarkTools\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a916508a",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "using BenchmarkTools"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4efb5fcf",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@benchmark sum(a)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "3bfc362f",
-   "metadata": {},
-   "source": [
-    "If the expression to benchmark depends on external variables, one should use `$` to \"interpolate\" them into the benchmark expression to avoid the problems of benchmarking with globals. Essentially, any interpolated variable `$x` or expression `$(...)` is \"pre-computed\" before benchmarking begins. So in short with `@btime` `$` is used to \"interpolate\" them into the benchmarked expression in order to get a correct benchmark results."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "bcc80f47",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "@benchmark sum($a)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5b8abfde",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "x = 1\n",
-    "@btime (y = 0; for _ in 1:10^6 y += x end; y)\n",
-    "@btime (y = 0; for _ in 1:10^6 y += $x end; y)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "777708f1",
-   "metadata": {},
-   "source": [
-    "We have seen before the performances of Julia built-in sum function. Let's save them in a dictionary:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ac55aa78",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "j_bench = @benchmark sum($a)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ddf52068",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "d = Dict()\n",
-    "d[\"Julia built-in\"] = minimum(j_bench.times) / 1e6\n",
-    "d"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2af9911a",
-   "metadata": {},
-   "source": [
-    "But that could be doing any number of tricks to be fast, including not using Julia at all in the first place! Of course, it is indeed written in Julia, but would it perform if we write a naive implementation ourselves?  \n",
-    "\n",
-    "\n",
-    "## 2. DIY Julia `sum` function"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "05bdb0a4",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "FIXME mysum(A)\n",
-    "    s = 0.0\n",
-    "    for FIXME\n",
-    "        s += a\n",
-    "    \n",
-    "    return FIXME"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7ef98bdf",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "TO HIDE\n",
-    "\n",
-    "function mysum(A)\n",
-    "    s = 0.0\n",
-    "    for a in A\n",
-    "        s += a\n",
-    "    end\n",
-    "    return s\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "22454cc3",
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": [
-    "j_bench_hand = @benchmark mysum($a)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d269e7d2",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "d[\"Julia hand-written\"] = minimum(j_bench_hand.times) / 1e6\n",
-    "d"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "545ab9e4",
-   "metadata": {},
-   "source": [
-    "So that's about 2x slower than the builtin definition. We'll see why later on.\n",
-    "\n",
-    "But first: is this fast?  How would we know?  Let's compare it to some other languages..."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7c591bf6",
-   "metadata": {},
-   "source": [
-    "## 3. C `sum` function\n",
-    "\n",
-    "C is often considered the gold standard: difficult on the human, nice for the machine. Getting within a factor of 2 of C is often satisfying. Nonetheless, even within C, there are many kinds of optimizations possible that a naive C writer may or may not get the advantage of.\n",
-    "\n",
-    "If you do not speak C, do not read the cell below, but one could be happy to know that it is possible to put C code in a Julia session, compile it, and run it. Note that the `\"\"\"` wrap a multi-line string."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6c1e8f7d",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "using Libdl\n",
-    "C_code = \"\"\"\n",
-    "    #include <stddef.h>\n",
-    "    double c_sum(size_t n, double *X) {\n",
-    "        double s = 0.0;\n",
-    "        for (size_t i = 0; i < n; ++i) {\n",
-    "            s += X[i];\n",
-    "        }\n",
-    "        return s;\n",
-    "    }\n",
-    "\"\"\"\n",
-    "\n",
-    "const Clib = tempname()   # make a temporary file\n",
-    "\n",
-    "\n",
-    "# compile to a shared library by piping C_code to gcc\n",
-    "# (works only if you have gcc installed):\n",
-    "\n",
-    "open(`gcc -fPIC -O3 -misel -xc -shared -o $(Clib * \".\" * Libdl.dlext) -`, \"w\") do f\n",
-    "    print(f, C_code)\n",
-    "end\n",
-    "\n",
-    "# define a Julia function that calls the C function:\n",
-    "c_sum(X::Array{Float64}) = ccall((\"c_sum\", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d3b912c0",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "c_sum(a)\n",
-    "c_sum(a) ≈ sum(a) # type \\approx and then <TAB> to get the ≈ symbol"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f2659088",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "c_bench = @benchmark c_sum($a)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b64d7233",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "d[\"C\"] = minimum(c_bench.times) / 1e6  # in milliseconds\n",
-    "d"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "69124ca3",
-   "metadata": {},
-   "source": [
-    "## 4. Python's built in `sum`\n",
-    "The `PyCall` package provides a Julia interface to Python:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "34a95ee9",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import Pkg; Pkg.add(\"PyCall\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "357dd956",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "using PyCall"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "99f540ab",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# get the Python built-in \"sum\" function:\n",
-    "pysum = pybuiltin(\"sum\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "986968c8",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "pysum(a)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "89ec8102",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "py_list_bench = @benchmark $pysum($a)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "35c95364",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "d[\"Python built-in\"] = minimum(py_list_bench.times) / 1e6\n",
-    "d"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "9084038c",
-   "metadata": {},
-   "source": [
-    "## 5. Python's DIY `sum`\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "eb87f43b",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "py\"\"\"\n",
-    "def py_sum(A):\n",
-    "    s = 0.0\n",
-    "    for a in A:\n",
-    "        s += a\n",
-    "    return s\n",
-    "\"\"\"\n",
-    "\n",
-    "sum_py = py\"py_sum\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d78d2891",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "py_hand = @benchmark $sum_py($a)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "38b153bc",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "sum_py(a)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a55fc8ec",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "d[\"Python hand-written\"] = minimum(py_hand.times) / 1e6\n",
-    "d"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "84e1a285",
-   "metadata": {},
-   "source": [
-    "## 6. Summary"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7e01b8ed",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "for (key, value) in sort(collect(d), by=last)\n",
-    "    println(rpad(key, 25, \".\"), lpad(round(value; digits=1), 6, \".\"))\n",
-    "end"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Julia 1.10.0",
-   "language": "julia",
-   "name": "julia-1.10"
-  },
-  "language_info": {
-   "file_extension": ".jl",
-   "mimetype": "application/julia",
-   "name": "julia",
-   "version": "1.10.0"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/2024_notes/Day 2/02_SerialCodeOptimization.ipynb b/2024_notes/Day 2/02_SerialCodeOptimization.ipynb
deleted file mode 100644
index 7466c51..0000000
--- a/2024_notes/Day 2/02_SerialCodeOptimization.ipynb	
+++ /dev/null
@@ -1,1155 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "id": "37a71ff3",
-   "metadata": {},
-   "source": [
-    "# Serial code optimization\n",
-    "\n",
-    "Several decisions can be considered while designing the code (e.g. Types, Type Inference, and Stability, multiple dispatch, meta-programming, vectorization, ...). Such methodologies allow to write an efficient computer program for high performance computing.\n",
-    "\n",
-    "Before writing any fast parallel code, one should consider write a fast serial code. Here are some tips that could make your serial code run faster:  \n",
-    "1. __Write your performance critical code inside a function__  \n",
-    "    Code inside functions tends to run much faster than top level code, due to how Julia's compiler works. Functions not only enhances the performance but it is more reusable and testable.  \n",
-    "2. __Make variables local__\n",
-    "    A global variable could change its type during execution, so it could be difficult for the julia compiler optimize it. Using local variable instead or declaring the global variable as constant (`const:`) will greatly improve performances.\n",
-    "3. __Use `@time` or `@btime`__  \n",
-    "    Use `@time` or `@btime` macro to measure the execution time of a function. They can also indicate the amount of allocated which could be sometimes significant.  \n",
-    "4. __Declare types when possible__  \n",
-    "    Type stability makes for loops faster. Declaring types help compilers to convert into machine code."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "983794bc",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "function xpow(x)\n",
-    "    return [x x^2 x^3 x^4]\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "86c8219c",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "function xpow_loop(n)\n",
-    "    s= 0\n",
-    "    for i = 1:n\n",
-    "        s = s + xpow(i)[2]\n",
-    "    end\n",
-    "    return s \n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "db90c677",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "using BenchmarkTools; @btime xpow_loop(1000000)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a9fafbc8",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "x = 0.0\n",
-    "@time for i = 1:10e4\n",
-    "    global x\n",
-    "    x += i\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "cb482534",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "function sum!(x)\n",
-    "    x = 0.0\n",
-    "    @time for i = 1:10e4\n",
-    "        global x\n",
-    "        x += i\n",
-    "    end\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a05f1ff7",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "sum!(10)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "1ebd5490",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "function sum!(x)\n",
-    "    x = 0.0\n",
-    "    @time for i = 1:10e4\n",
-    "        x += i\n",
-    "    end\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d35bc8fb",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "sum!(10)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "35df028c",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "function sum!(x::Int64)\n",
-    "    x = 0.0\n",
-    "    @time for i = 1:10e4\n",
-    "        #global x\n",
-    "        x += i\n",
-    "    end\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "cf5dece6",
-   "metadata": {},
-   "source": [
-    "sum!(10); x"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1f9ff500",
-   "metadata": {},
-   "source": [
-    "# Julia Array-type optimization  \n",
-    "Arrays are a fundamental data structure in all programming languages. Moreover, they can improve performance. Thus, a special attention is given to the usage of array in numerical programming.\n",
-    "\n",
-    "We will discuss how to use arrays in Julia in the most efficient way:\n",
-    "- Computer memory model and array representation and storage in Julia\n",
-    "- Bounds checks and **@inbounds**\n",
-    "- Specialized array types\n",
-    "- __Broadcasting__\n",
-    "- __SIMD__ parallelization"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "35080185",
-   "metadata": {},
-   "source": [
-    "## Computer memory model\n",
-    "\n",
-    "### Key elements of a CPU \n",
-    "- The control unit (CU), which turns binary into timing and control signals.\n",
-    "- The arithmetic logic unit (ALU), which does \"the math\".\n",
-    "- The cache, which is a place to store data and instructions for quick access. The cache stores quick-to-access versions of what is in main memory (RAM).\n",
-    "- The registers, tiny little caches that the ALU can get to directly.\n",
-    "\n",
-    "**Core**  \n",
-    "CPUs are composed of one or more smaller cores. For example, a dual-core processor has two ALUs, one or more CUs, and may have caches unique to them (though commonly L2 or L3 are shared).\n",
-    "\n",
-    "**Memory model: High Level View**\n",
-    "\n",
-    "![](https://hackernoon.com/hn-images/1*nT3RAGnOAWmKmvOBnizNtw.png)\n",
-    "\n",
-    "1. **L1-cache**\\\n",
-    "    A CPU's core memory accesses directly L1 cache which is very close to the processor. It is really fast (a typical latency ~ 0.8 ns). As thumb rule, the things we need to use soon, are kept there. There can be two L1 caches: Icaches (Instruction) and Dcaches (Data).   \n",
-    "2. **L2-cache**\\\n",
-    "    L2 is a larger cache that's a bit slower (a typical latency ~ 2.4 ns).  \n",
-    "3. **L3-cache**\\\n",
-    "    L3 is a specialized cache designed to make L1 and L2 work better. It is slower than L2 and L3 (a typical latency ~ 11.1 ns).  \n",
-    "    \n",
-    "\n",
-    "**Why worry about this?**\n",
-    "- For optimizing purposes, the important parts are the three cache levels (L1, L2, L3).\n",
-    "- We should use the things that are already in a closer cache. The code will run faster since data doesn't have to be queried for and move up this chain.\n",
-    "\n",
-    "**Cache miss**  \n",
-    "A *cache miss* happens when something needs to be pulled directly from main memory. Caches store instructions and data. When the ALU needs data to do some math (x + y), it has to ask the cache in order if they are available. If not, you may need to go and retrieve it from RAM, but this operation can have a very important computational cost.  \n",
-    "Cache-aware and cache-oblivious algorithms are methods which change their indexing structure to optimize their use of the cache lines.   "
-   ]
-  },
-  {
-   "attachments": {},
-   "cell_type": "markdown",
-   "id": "b281adc6",
-   "metadata": {},
-   "source": [
-    "### How are arrays stored?\n",
-    "\n",
-    "When an array is allocated, the contiguous storage simply consists of pointers to the actual elements."
-   ]
-  },
-  {
-   "attachments": {
-    "pointers.png": {
-     "image/png": ""
-    }
-   },
-   "cell_type": "markdown",
-   "id": "d2279bda",
-   "metadata": {},
-   "source": [
-    "![pointers.png](attachment:pointers.png)"
-   ]
-  },
-  {
-   "attachments": {
-    "elements.png": {
-     "image/png": ""
-    }
-   },
-   "cell_type": "markdown",
-   "id": "2624e4bf",
-   "metadata": {},
-   "source": [
-    "When an array has only one dimension, its elements can be stored one after the other in a contiguous block of memory.  \n",
-    "\n",
-    "![elements.png](attachment:elements.png)"
-   ]
-  },
-  {
-   "attachments": {
-    "colrow.png": {
-     "image/png": ""
-    }
-   },
-   "cell_type": "markdown",
-   "id": "edb16fc5",
-   "metadata": {},
-   "source": [
-    "But what happens with 2D arrays (or multidimensional arrays)?  \n",
-    "![colrow.png](attachment:colrow.png)"
-   ]
-  },
-  {
-   "attachments": {
-    "rowcol.png": {
-     "image/png": ""
-    }
-   },
-   "cell_type": "markdown",
-   "id": "41dd9764",
-   "metadata": {},
-   "source": [
-    "Two-dimensional (or greater) arrays can be stored in two different ways:\n",
-    "- __row major order__, i.e the linear array of memory is formed by stacking the rows one after another\n",
-    "- __column major order__, i.e. column major puts the column vectors one after another.  \n",
-    "![rowcol.png](attachment:rowcol.png)  \n",
-    "Julia implement column major ordering, just like Fortran, Matlab, R. On the other hand, arrays in C/C++ and Python's numpy are stored as row major-ordered.  \n",
-    "The following code squares and sums the elements of a two-dimensional floating point array, writing the result at each step back to the same position. The following code exercises both the read and write operations for the array:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "654d6615",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "function col_iter(x)\n",
-    "    s=zero(eltype(x))\n",
-    "    for i in 1:size(x, 2)\n",
-    "        for j in 1:size(x, 1)\n",
-    "            s = s + x[j, i] ^ 2\n",
-    "            x[j, i] = s\n",
-    "        end\n",
-    "    end\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "511ce4f8",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "function row_iter(x)\n",
-    "    s=zero(eltype(x))\n",
-    "    for i in 1:size(x, 1)\n",
-    "        for j in 1:size(x, 2)\n",
-    "            s = s + x[i, j] ^ 2\n",
-    "            x[i, j] = s\n",
-    "        end\n",
-    "    end\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "66ea350d",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "a = rand(1000, 1000)\n",
-    "@btime col_iter(a)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "bda3ff9f",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "@btime row_iter(a)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b51f3b04",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#Another way of writing row_iter function\n",
-    "function row_iter(x)\n",
-    "    s=zero(eltype(x))\n",
-    "    for j in 1:size(x, 2)\n",
-    "        for i in 1:size(x, 1)\n",
-    "            s = s + x[j, i] ^ 2\n",
-    "            x[j, i] = s\n",
-    "        end\n",
-    "    end\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e3891759",
-   "metadata": {},
-   "source": [
-    "### Lower Level View: the Stack and the Heap\n",
-    "\n",
-    "**Stack**  \n",
-    "- The stack requires a static allocation. It is ordered and accesses are very quick.\n",
-    "- Because this is static, it requires that the size of the variables is known at compile time (to determine all of the variable locations).\n",
-    "    \n",
-    "**Heap**  \n",
-    "- The heap is essentially a stack of pointers to objects in memory. When heap variables are needed, their values are pulled up the cache chain and accessed.\n",
-    "- Heap allocations are costly because they involve this pointer dereferencing."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f5fece0a",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "a = Int64[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n",
-    "b = Number[1,2,3,4,5,6,7,8,9,10]  #Number is a supertype of Int64"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d0ac9708",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "function arr_sumsqr(x::Array{T}) where T <: Number  # with <: T inherits Number \n",
-    "    r = zero(T)\n",
-    "    for i = 1:length(x)\n",
-    "        r = r + x[i] ^ 2\n",
-    "    end\n",
-    "    return r\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2af26fc4",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "@btime arr_sumsqr($a)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6a5cbd89",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "@btime arr_sumsqr($b)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "37672634",
-   "metadata": {},
-   "source": [
-    "When the array is defined to contain a specific concrete type, the Julia runtime can store the values inline within the allocation of the array, since it knows the exact size of each element. When the array contains an abstract type, the actual value can be of any size. Thus, when the Julia runtime creates the array, it only stores the pointers to the actual values within the array. The values are stored elsewhere on the heap. This not only causes extra memory load when reading the values, but the indirection can mess up pipelining and cache affinity when executing this code on the CPU."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "id": "350cf13b",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "NameError",
-     "evalue": "name 'rand' is not defined",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
-      "Cell \u001b[0;32mIn[2], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m A \u001b[38;5;241m=\u001b[39m \u001b[43mrand\u001b[49m(\u001b[38;5;241m100\u001b[39m,\u001b[38;5;241m100\u001b[39m)\n\u001b[1;32m      2\u001b[0m B \u001b[38;5;241m=\u001b[39m rand(\u001b[38;5;241m100\u001b[39m,\u001b[38;5;241m100\u001b[39m)\n\u001b[1;32m      3\u001b[0m C \u001b[38;5;241m=\u001b[39m rand(\u001b[38;5;241m100\u001b[39m,\u001b[38;5;241m100\u001b[39m)\n",
-      "\u001b[0;31mNameError\u001b[0m: name 'rand' is not defined"
-     ]
-    }
-   ],
-   "source": [
-    "A = rand(100,100)\n",
-    "B = rand(100,100)\n",
-    "C = rand(100,100)\n",
-    "\n",
-    "function inner_alloc!(C,A,B)\n",
-    "    for j in 1:100, i in 1:100\n",
-    "        val = [A[i,j] + B[i,j]]\n",
-    "        C[i,j] = val[1]\n",
-    "    end\n",
-    "    return C\n",
-    "end\n",
-    "@btime inner_alloc!(C,A,B)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "cf21c41c",
-   "metadata": {},
-   "source": [
-    "The array is allocated for each step of the loop. This allocation and the subsequent garbage collection take a significant amount of time."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3284a9ae",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "function inner_noalloc!(C,A,B)\n",
-    "    for j in 1:100, i in 1:100\n",
-    "        val = A[i,j] + B[i,j]\n",
-    "        C[i,j] = val\n",
-    "    end\n",
-    "end\n",
-    "@btime inner_noalloc!(C,A,B)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "29b01eb3",
-   "metadata": {},
-   "source": [
-    "Consider `StaticArrays.jl` for small fixed-size vector/matrix operations.  \n",
-    "- StaticArrays.jl library uses statically-sized arrays and thus arrays which are stack-allocated. \n",
-    "- It can be used for many small (< 100 element) arrays of fixed sizes."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6058653d",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import Pkg; Pkg.add(\"StaticArrays\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ec194cd0",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "using StaticArrays\n",
-    "function static_inner_alloc!(C,A,B)\n",
-    "    for j in 1:100, i in 1:100\n",
-    "        val = @SVector [A[i,j] + B[i,j]]\n",
-    "        C[i,j] = val[1]\n",
-    "    end\n",
-    "end\n",
-    "@btime static_inner_alloc!(C,A,B)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c52bd387",
-   "metadata": {},
-   "source": [
-    "## Bounds checks \n",
-    "Julia performs bounds checks on arrays by default at runtime. This means that the Julia compiler and runtime verify that the arrays are not indexed outside their limits and that all the indexes lie between the actual start and end of an array.\n",
-    "\n",
-    "The `@inbounds` macro eliminates array bounds checking within expressions and is applied in front of a function or loop definition."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7d10ae75",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "len = (100, 100);\n",
-    "a, b = fill(1.0f0, len), fill(2.0f0, len);"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b5dc2255",
-   "metadata": {},
-   "source": [
-    "Using `fill` in this way is not only convenient but also safer, since the memory that is returned to the program is filled with known good values. However, the operation to fill the memory is also expensive. For a performance-critical method to create arrays, the constructor can be called with a special `undef` keyword. In this case, memory is allocated, but not filled."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8cc42a32",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "function prefix_bounds(a,b)\n",
-    "    c = similar(a)\n",
-    "    for j in 1:100, i in 1:100\n",
-    "        c[i,j] = a[i,j] + b[i,j]\n",
-    "    end\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3761b33b",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "@btime prefix_bounds($a, $b)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ce83d5ab",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "@code_llvm prefix_bounds(a,b)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b059ee31",
-   "metadata": {},
-   "source": [
-    "Notice that this getelementptr inbounds stuff is bounds checking\n",
-    "- bounds checking is enabled by default in order to not allow the user to index outside of an array.\n",
-    "- Indexing outside of an array is dangerous:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "603e50fd",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "a[101,1]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ddd4ffe3",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "function prefix_bounds(a,b)\n",
-    "    c = similar(a)\n",
-    "    @inbounds for j in 1:100, i in 1:100\n",
-    "        c[i,j] = a[i,j] + b[i,j]\n",
-    "    end\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "deee358b",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "@btime prefix_bounds($a, $b)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b67a1011",
-   "metadata": {},
-   "source": [
-    "When the Julia environment is started with `-check-bounds=yes`, all @inbounds annotations in the code are ignored, and bounds checks are mandatorily performed. Alternatively, when the Julia runtime is started with `-check-bounds=no`, no bound checking is done at all. This is equivalent to annotating all array access with the `@inbounds` macro. This option should only be used sparingly in the case of extremely performancesensitive code, in which the system is very well tested and with minimal user inputs."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "fc6b62f9",
-   "metadata": {},
-   "source": [
-    "## Broadcasting\n",
-    "\n",
-    "Performs an operation on each element of an array, rather than on the array as a whole. In many high level languages this is simply called vectorization.  \n",
-    "*In Julia, we will call it array vectorization to distinguish it from the SIMD vectorization which is common in lower level languages like C, Fortran, and Julia.*"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "35be5ea7",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "a=collect(1:4)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "dc548e01",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "sqrt.(a)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "39c5589d",
-   "metadata": {},
-   "source": [
-    "More generally, it allows operations between arrays of different shapes, such as adding a vector to every column in a matrix:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "db39c829",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "b=reshape(1:8, 4, 2)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "eaab2429",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "b .+ a"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d2c913e6",
-   "metadata": {},
-   "source": [
-    "Unlike many other scientific computing languages, writing vectorized code in Julia is not a performance optimization. Writing loops in Julia, unlike say NumPy or Matlab, is pretty fast.  \n",
-    "Even in vectorized languages, there is a downside to operating on vectors—combined operations do not compose efficiently. For example, for a vector `a`, the `b=sin.(cos.(a))` operation will usually compile to something like `temp=cos.(a)`, `Y=sin.(temp)`.  \n",
-    "This code has two problems: a temporary temp array is allocated, which is a problem for large input arrays, and gets worse as the length of the function chain increases; secondly, there are two loops through the elements of `a`.  \n",
-    "In Julia, however, code like this will compile down to a single loop, with no temporary array allocated. The compiled code will look conceptually similar to the following code (__loop fusion__):"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2f126edc",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "function broadcasting_ex(a)\n",
-    "    b = similar(a)\n",
-    "    for i in 1:length(a)\n",
-    "        b[i] = sin(cos(a[i]))\n",
-    "    end\n",
-    "    return b\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "037dfd09",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "a=ones(10);\n",
-    "@btime broadcasting_ex(a)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e6e91e7f",
-   "metadata": {},
-   "source": [
-    "Broadcasting can also help with preallocated output using the dotted assignment operator, `.=`. This allows the use of preallocated output with loop fusion, avoiding any extra copying."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ab8f56ba",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "a = collect(1:10);\n",
-    "b = fill(0.0, 10);\n",
-    "@btime b = sin.(cos.(a))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "76d75064",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# using the .= operator results in reduced allocations\n",
-    "@btime b .= sin.(cos.(a))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f8986c4b",
-   "metadata": {},
-   "source": [
-    "## SIMD parallelization (AVX2, AVX512)\n",
-    "__SIMD__, stands for __Single Instruction, Multiple data__ and is a CPU method of parallel computation, whereby a single operation is performed on many data elements simultaneously. Modern CPU architectures contain instructions sets that can perform these operations on many variables at once. For example Intel has such implmentation as __AVX2__ (256 bits of data in one instruction), __AVX512__ (512 bits of data in one instruction)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "fab4c47f",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "using Test"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "abd8a002",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "len = Int64(1e7);\n",
-    "a, b = fill(1.0f0, len), fill(2.0f0, len);"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "eba8facf",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "function serial_sum_vectors(a::Array, b::Array)\n",
-    "    c = similar(a)\n",
-    "    @inbounds for i in eachindex(a, b)\n",
-    "        c[i] = a[i] + b[i]\n",
-    "    end\n",
-    "    return c\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7289b273",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "res0 = serial_sum_vectors(a, b);"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0055b1d1",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "@test all(res0 .== 3.0f0)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c591fc81",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#The returned time is the minimum elapsed time measured during the benchmark.\n",
-    "serial_t = @belapsed serial_sum_vectors(a, b)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6b4b9d8d",
-   "metadata": {},
-   "source": [
-    "The function essentially performs $10^7$ sequential additions. A typical SIMD-enabled processor, however, can add up to eight numbers in one CPU cycle. Adding each of the elements sequentially can, therefore, be a waste of CPU capabilities. On the other hand, rewriting code to operate on parts of the array in parallel can get complex quickly. Doing this for a wide range of algorithms can be an impossible task.  \n",
-    "Julia, as you would expect, makes this significantly easier using the `@simd` macro. Placing this macro against a loop gives the compiler the freedom to use SIMD instructions for the operations within this loop if possible: "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "14d275f8",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "function simd_sum_vectors(a::Array, b::Array)\n",
-    "    c = similar(a)\n",
-    "    @inbounds @simd for i in eachindex(a, b)\n",
-    "        c[i] = a[i] + b[i]\n",
-    "    end\n",
-    "    return c\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "aef641b8",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "res1 = simd_sum_vectors(a, b);"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5ff6d6d8",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "@test all(res1 .== 3.0f0)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5c307d84",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "simd_t = @belapsed simd_sum_vectors(a, b)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "cb457ae2",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#Let's see what is the obtained gain\n",
-    "times = [serial_t, simd_t]\n",
-    "speedup = maximum(times) ./ times"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "349e27f8",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#Let's check if our code is using SIMD: the keywords to look for in the output are sections prefixed with vector and vectorized operations that look similar to <n * float>\n",
-    "@code_llvm serial_sum_vectors(a, b)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "23c3c08b",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "@code_llvm simd_sum_vectors(a, b)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e543524c",
-   "metadata": {},
-   "source": [
-    "### Limitations of `@simd` :\n",
-    "\n",
-    "Adding `@simd` does not make every loop faster. In particular, note that using SIMD implies that the order of operations within and across the loop might change. The compiler needs to be certain that the reordering will be safe before it attempts to parallelize a loop. So we need to be sure that:\n",
-    "- Each iteration of the loop is independent of the others.\n",
-    "- Arrays being operated upon within the loop do not overlap in memory.\n",
-    "- The loop body is straight-line code without branches or function calls.\n",
-    "- The number of iterations of the loop is obvious.\n",
-    "- Bounds checking is disabled for SIMD loops."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "d9d5bee7",
-   "metadata": {},
-   "source": [
-    "### SIMD.jl\n",
-    "The `@simd` macro we saw previously is only a hint to the compiler, it does not guarantee that SIMD operations are used on the CPU. As a programmer, you want to be absolutely certain that some operations are implemented as SIMD instructions, the `SIMD.jl` package provides low-level types and functions that allow you to specify this."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "45f7bd3c",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import Pkg; Pkg.add(\"SIMD\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "703c4df0",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "using SIMD"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f18cdb48",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "function simd_sum_vectors!(a::Array, b::Array)\n",
-    "    c = similar(a)\n",
-    "    @inbounds @simd for i in eachindex(a, b)\n",
-    "        c[i] = a[i] + b[i]\n",
-    "    end\n",
-    "end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "4e9e140f",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "res2 = simd_sum_vectors!(a, b);"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2685f7e6",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "simd_t! = @belapsed simd_sum_vectors!(a, b)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "d093fd8f",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "times = [serial_t, simd_t, simd_t!]\n",
-    "speedup = maximum(times) ./ times"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a7491ceb",
-   "metadata": {},
-   "source": [
-    "A lot of operations in Julia already uses SIMD Broadcasting!"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "726556ea",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "operation = a .*b\n",
-    "broad_t = @belapsed operation"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "dbced515",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "times = [serial_t, simd_t, simd_t!, broad_t]\n",
-    "speedup = maximum(times) ./ times"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "114ab3db",
-   "metadata": {},
-   "source": [
-    "## Resume \n",
-    "- SIMD is data level parallelism.\n",
-    "- When using SIMD macro, Julia re-arranges floating point additions — even if it would change the answer.\n",
-    "- Depending on your CPU, this may lead to 2x or 4x or even 8x parallelism.\n",
-    "- The parallelism can (sometimes) happen automatically.\n",
-    "- Not necessary, it is not always signficantly fast, although it is easy to use.\n",
-    "- Best for small, tight, innermost loops."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "4133fe69",
-   "metadata": {},
-   "source": [
-    "# CHECKPOINT\n",
-    "\n",
-    "Using these libraries:\n",
-    "\n",
-    "```julia\n",
-    "using BenchmarkTools, Test, SIMD, InteractiveUtils\n",
-    "```\n",
-    "\n",
-    "### **Q1**  \n",
-    "Consider the following function that sums all the elements of a vector:  \n",
-    "\n",
-    "```julia\n",
-    "A = rand(__put a BIGNUMBER here__)  \n",
-    "\n",
-    "function simplesum(A)  \n",
-    "    result = zero(eltype(A))  \n",
-    "    for i in eachindex(A)  \n",
-    "        result += A[i]  \n",
-    "    end  \n",
-    "    return result  \n",
-    "end  \n",
-    "```   \n",
-    "\n",
-    "- Try to optimize it for bounds checks\n",
-    "- Benchmark against the in-built julia function `sum`\n",
-    "- Implement SIMD parallelization and compute the obtained speedup\n",
-    "\n",
-    "### **Q2**\n",
-    "\n",
-    "The following function estimates pi with n samples. \n",
-    "\n",
-    "```julia\n",
-    "function estimatepi(n)\n",
-    "    area_circle = 0\n",
-    "    for i = 1:n\n",
-    "        x, y = rand(), rand();\n",
-    "        r = x^2 + y^2\n",
-    "        if r < 1.0\n",
-    "            area_circle +=1\n",
-    "        end\n",
-    "    end\n",
-    "    return 4* area_circle/n\n",
-    "end\n",
-    "```\n",
-    "- Optimize it for bounds checks\n",
-    "- Use `time()` function in order to measure elapsed execution time for the function\n",
-    "- Parallelize the code in order to exploit SIMD\n",
-    "- Compute relative error with respect to true Julia value of `PI`\n",
-    "\n",
-    "### **Q3**  \n",
-    "Write the function that computes the multiplication of two square matrices $ C = A \\times B$ of order $n$, namely:  \n",
-    "$$\n",
-    "c_{ij} = \\sum_{k=1}^n{a_{ik}b_{kj}}\n",
-    "$$\n",
-    "\n",
-    "- Invert the order of involved for loops to exploit row-major and column major ordering and collect different execution times in a dictionary\n",
-    "- Try to optimize the function ensuring that the result is correct (by using test and built-in Julia matrix multiplication)\n",
-    "- Is it possible to use SIMD for this problem?"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "70d01b33",
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.9"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
-- 
GitLab