4.1. Deep Learning Frameworks


4: Current Trends and Applications in AI

4.1. Deep Learning Frameworks

Introduction

Deep learning frameworks are specialized software libraries that simplify the development and deployment of neural networks by providing ready-made building blocks (tensor operations, neural network layers, optimizers, etc.) and tools for automatic differentiation. Over the past decade, numerous frameworks have emerged to help researchers and engineers assemble, train, and scale deep learning models without implementing everything from scratch. Early pioneers like Theano (developed at Université de Montréal in 2007) provided a foundation for Python-based deep learning, introducing the concept of constructing mathematical expressions (and their gradients) as computation graphs. Since then, the field has seen a rapid evolution of frameworks – from static graph approaches to more dynamic, Pythonic ones – each balancing ease of use, flexibility, performance, and production readiness. In this post, we will explore the major deep learning frameworks, explain their underlying principles and features, compare their performance and ecosystems, and discuss how they support hardware accelerators and model deployment. We will also highlight practical tips (with code snippets) and real-world applications across domains (healthcare, finance, autonomous vehicles, and more) to illustrate how these frameworks are used in both research and industry.

Evolution of Deep Learning Frameworks

The development of deep learning frameworks mirrors the maturation of the deep learning field itself. Early frameworks like Theano and Torch7 (a Lua-based framework from 2011) were groundbreaking in their time, allowing researchers to leverage GPUs for training neural networks by expressing computations as graphs. Theano in particular is often considered the “grandfather” of deep learning frameworks – it let users define and optimize symbolic mathematical expressions and was a dominant tool in academic research for years. Building on these foundations, more user-friendly libraries appeared. For example, Caffe (released in 2014) provided an expressive way to define convolutional neural networks (especially popular in computer vision) via configuration files, while Microsoft’s CNTK and Apache MXNet further expanded the landscape. However, two frameworks released in the mid-2010s — TensorFlow and PyTorch — rapidly became the leaders, thanks to their balance of flexibility and performance, large community support, and backing by tech giants.

TensorFlow, developed by the Google Brain team and open-sourced in 2015, popularized the dataflow programming model for deep learning. It introduced the idea of building a static computation graph first and then executing it, which offered advantages in distributed training and deployment optimization. Around the same time, Keras (introduced by François Chollet in 2015) provided a high-level, user-friendly API that could run on top of multiple backends (initially Theano or TensorFlow, and later CNTK), enabling fast experimentation with deep nets. In fact, Keras was integrated into TensorFlow’s core in 2017 (accessible as tf.keras) to combine ease-of-use with TensorFlow’s powerful engine. On the other side, PyTorch, released by Facebook’s AI Research lab in 2016, took a different approach with its dynamic computation graph (also known as define-by-run). PyTorch allows the graph to be built on the fly as you execute code, which made debugging and development more intuitive and “Pythonic,” especially for researchers experimenting with novel models. This flexibility helped PyTorch quickly gain traction in the research community, challenging TensorFlow’s earlier dominance. By the late 2010s, most new deep learning research projects were opting for PyTorch – by 2023, over 75% of new papers used PyTorch, indicating its widespread adoption in academia (see Figure 1).

Figure 1: Trend of deep learning framework usage in research papers (2019–2023). PyTorch (red) has grown to dominate by 2023, while TensorFlow (orange) saw a relative decline. JAX (green) has a smaller but rising share.

The competition and cross-pollination between frameworks have led to rapid improvements. TensorFlow responded to PyTorch’s popularity by releasing TensorFlow 2.0 (2019) with eager execution by default and tighter Keras integration – effectively making TensorFlow more dynamic and user-friendly. Meanwhile, PyTorch introduced tools for deployment (such as TorchScript for graph-based execution and mobile runtimes) to strengthen its production story. New frameworks like JAX (introduced around 2018 by Google) emerged, focusing on high-performance and composability, and leveraging XLA (Accelerated Linear Algebra) compiler techniques to fuse operations and run computations efficiently on GPUs/TPUs. Moreover, higher-level abstractions built on existing libraries came about, such as PyTorch Lightning (2019) which sits atop PyTorch to automate routine training loops and engineering, and Flax (built on JAX) which provides a neural network API for JAX. There was also a push for framework-agnostic model formats for deployment, leading to ONNX (Open Neural Network Exchange) in 2017 – an open format backed by Facebook and Microsoft to let models trained in one framework be used in another. The following sections delve into each of these modern frameworks (TensorFlow, PyTorch, JAX, Keras, PyTorch Lightning, ONNX), comparing their design philosophies, features, and real-world use cases.

TensorFlow and Keras

TensorFlow is an end-to-end open-source deep learning framework developed by Google and first released in November 2015. Its core concept is the execution of dataflow graphs, where nodes represent operations and edges represent multi-dimensional data arrays (tensors) flowing between operations. In TensorFlow 1.x, users typically defined a static graph first (using APIs like tf.Graph, tf.Session, and tf.placeholder for inputs) and then executed the graph. This static graph approach allowed TensorFlow to optimize computations and deploy models efficiently to different environments (e.g. running the same graph in C++ for production or on mobile devices). A key advantage was the ability to do distributed training and deployment easily – the graph could be serialized and run on remote servers or accelerated hardware without the original Python code. Google’s emphasis on production use is evident in TensorFlow’s ecosystem: it offers TensorBoard for visualizing training, TensorFlow Serving for deploying models as scalable services, TensorFlow Lite for mobile/embedded inference, and even TensorFlow.js for in-browser execution. In fact, TensorFlow was used internally at Google for many projects – for example, Google’s RankBrain algorithm for search and portions of Google Maps’ Street View were powered by TensorFlow models running on specialized hardware (TPUs) as early as 2016. TensorFlow’s robustness and versatility made it popular in industry; many companies adopted it for applications ranging from image recognition to language translation. However, TensorFlow (especially 1.x) had a steeper learning curve for beginners – the static graph paradigm felt less intuitive and required more ceremony to debug or change models on the fly.

Keras was created to mitigate exactly those usability pain points. Released in 2015 as a high-level neural network API, Keras emphasizes user-friendliness and quick prototyping. With Keras, one can build a model by just stacking layers and call model.fit() to train – the library handles the graph construction and training loop internally. Keras initially could use either TensorFlow, Theano, or CNTK as a backend compute engine, which also helped it gain wide adoption. In 2017, Keras was integrated into TensorFlow’s core library (as tf.keras), making it the recommended high-level interface for TensorFlow 2.x. This gave users the convenience of Keras’ API with the full power of TensorFlow’s performance and deployment capabilities. A simple example of defining and training a model in Keras vs. pure TensorFlow (or PyTorch) highlights the difference in abstraction level:

# High-level API (TensorFlow Keras)
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(20,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit(X_train, y_train, epochs=10, batch_size=32)

In just a few lines, the above defines a two-layer neural network and trains it for 10 epochs. By contrast, a lower-level approach requires manual training loop management:

# Lower-level API (e.g., PyTorch style pseudocode)
model = Net()                            # define model architecture
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for epoch in range(10):
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()           # reset gradients
        outputs = model(X_batch)        # forward pass
        loss = loss_fn(outputs, y_batch)
        loss.backward()                 # backpropagate gradients
        optimizer.step()                # update parameters

Both code snippets accomplish the same task, but Keras abstracts away the boilerplate, letting users focus on architecture. This ease of use made Keras very popular for fast development and teaching. Notably, Keras has recently evolved to become framework agnostic again: Keras 3.0 (2023) introduced support for multiple backends beyond TensorFlow, including JAX and PyTorch, via a “Universal Keras” effort. This means developers can choose Keras as a high-level interface and execute the model with the backend of their choice, reflecting an interesting convergence of the ecosystem.

TensorFlow’s current incarnation (2.x) effectively offers two levels of usage: the high-level Keras API for productivity, and the lower-level TensorFlow Core (e.g., tf.Tensor operations, tf.GradientTape for manual differentiation in eager mode) for maximum flexibility when needed. It supports a broad range of platforms and languages – models can be deployed to CPUs or GPUs, and Google’s custom TPUs are natively supported for lightning-fast training. TensorFlow also has bindings for languages like C++ and Java, and even a JavaScript library (TensorFlow.js), reflecting its goal of being a universal, production-ready framework. Many real-world systems have been built with TensorFlow: for instance, in healthcare, Google’s research on detecting diabetic retinopathy from retinal images was implemented with deep neural networks in TensorFlow, achieving ophthalmologist-level accuracy in diagnoses. In finance, banks have leveraged TensorFlow to build fraud detection systems and risk models that analyze streaming transaction data in real-time. The framework’s ability to scale and serve models (through TensorFlow Serving or on the cloud) makes it attractive for such industrial applications where reliability and scalability are paramount.

PyTorch

PyTorch is an open-source deep learning framework released by Facebook’s AI Research (FAIR) in 2016. In contrast to TensorFlow’s early approach, PyTorch introduced a more imperative, eager execution style: models are defined as Python code (e.g., as nn.Module classes), and every forward pass dynamically builds the computation graph on the fly. This dynamic computation graph (or define-by-run) paradigm was a game changer for researchers. It allows using regular Python control flow (loops, conditionals, etc.) within model definitions, making it easy to implement complex models like recursive networks or novel layer types that can change shape or behavior between iterations. Debugging is straightforward – if a model has an error, the stack trace points to the exact line in Python, just like any other code. This flexibility “dethroned TensorFlow’s stable position” in the research community around 2017–2018, as many found PyTorch more intuitive for experimentation. As a result, by 2020, PyTorch had become the de facto choice for most academic and cutting-edge industrial AI research.

Under the hood, PyTorch provides an autograd engine for automatic differentiation (tracking operations on tensors and computing gradients via reverse-mode differentiation). It also introduced optimized GPU operations and supports distributed training. Initially, one perceived drawback was that PyTorch lacked some of TensorFlow’s deployment and production features; a PyTorch model was essentially a Python program, which made it harder to deploy in environments where Python isn’t available. The PyTorch team addressed this by developing TorchScript, which can serialize a model (and even just-in-time compile parts of it) into an intermediate representation that a C++ runtime can execute – enabling model serving in C++ applications or on mobile devices. Additionally, Facebook (now Meta) and other contributors built out the production ecosystem for PyTorch: for instance, PyTorch Mobile and Lite Interpreter allow running models on Android and iOS, and there’s integration with NVIDIA’s TensorRT for optimized inference. By now, PyTorch is considered a mature framework suitable not just for research but also for production at scale.

One of PyTorch’s biggest strengths is its ecosystem and community support. Being open-source, it rapidly accumulated extensions and libraries for various domains. For computer vision, there is TorchVision (with common image models and transforms); for NLP, TorchText; for audio processing, TorchAudio; for probabilistic modeling, PyTorch has Pyro and BoTorch; and so on. The community contributed high-level libraries like fast.ai (which simplifies training loops for practitioners) and Hugging Face Transformers, which originally were built on PyTorch to provide off-the-shelf state-of-the-art models in NLP. Specialized toolkits such as MONAI (Medical Open Network for AI) target healthcare imaging with PyTorch, and PyTorch Geometric addresses graph neural networks. This rich ecosystem makes PyTorch incredibly versatile – whatever the application (vision, text, speech, graphs, reinforcement learning, etc.), one can likely find a PyTorch-based toolkit or pretrained model to start with. For example, autonomous vehicle teams have used PyTorch to train perception models: notably, Tesla’s Autopilot team (led by Andrej Karpathy at the time) chose PyTorch to develop their full self-driving computer vision models. In one talk, Karpathy explained that PyTorch’s flexibility in defining “hydranet” multi-task models (with a shared backbone and multiple output heads) was crucial for iterating on complex tasks like simultaneous object detection, lane prediction, and collision avoidance in real time. PyTorch is also heavily used in the generative AI space – for instance, the popular Stable Diffusion image generation model and many of OpenAI’s projects (GPT models in recent years) utilize PyTorch for training. In the healthcare domain, researchers have used PyTorch to build medical image analysis models (e.g., pneumonia detection on chest X-rays) and benefited from its intuitive debugging when iterating on model improvements.

From a features standpoint, PyTorch has continued to evolve. It now has robust support for distributed data-parallel training (for multi-GPU and multi-node training), mixed-precision training (to leverage tensor-core GPUs for faster training), and an interface to leverage hardware like Google Cloud TPUs (through the PyTorch/XLA library). It’s worth noting that Google’s Cloud TPU service supports PyTorch and JAX in addition to TensorFlow – Cloud TPUs can accelerate all these frameworks via the XLA compiler. PyTorch has also embraced compile-time graph optimizations in the latest version (PyTorch 2.0 introduced a JIT compiler called TorchDynamo and backend graph optimizers to get performance closer to static graph frameworks). In summary, PyTorch’s focus on an excellent developer experience, combined with its growing production capabilities, has made it a well-rounded framework for both research and industrial applications.

PyTorch Lightning

While PyTorch’s flexibility is great for research, writing boilerplate code for training loops, checkpointing, logging, etc., can become tedious or error-prone, especially in larger projects. PyTorch Lightning was introduced in 2019 by researchers (later formalized under the company Lightning AI) as a high-level framework on top of PyTorch to abstract away much of this boilerplate. Lightning essentially lets you organize PyTorch code into reusable components: you define a LightningModule (which includes your model, training and validation steps, and optimizers), and then use a Trainer class to handle the rest. Under the hood, it uses regular PyTorch, but it automates things like moving data to GPUs, managing multiple GPUs or even multiple nodes, saving model checkpoints, early stopping, logging to TensorBoard or other loggers, etc..

The main benefits of PyTorch Lightning include: (1) Clean code design – research code can be written in a structured, object-oriented way (e.g., separating data preparation, model definition, and training logic into different classes) which makes it more maintainable; (2) Automatic handling of training details – you don’t need to manually write training loops, perform .backward() or .step() calls, or worry about scaling to multiple GPUs – Lightning’s Trainer.fit() will do that based on flags you set (for example, Trainer(accelerator="gpu", devices=4) could run your training on 4 GPUs, and it even has support for TPU training); (3) Consistency and Reproducibility – by abstracting training boilerplate, Lightning encourages a consistent approach to things like checkpointing models (so you can resume training easily) and evaluating on validation data, which leads to fewer mistakes. In short, Lightning adheres to the principle of “don’t repeat yourself” for PyTorch code. A training script that might be hundreds of lines in raw PyTorch can often be reduced to tens of lines in Lightning, with the library handling the rest.

For example, in vanilla PyTorch you might write separate loops for training and validation and manually track metrics, whereas in Lightning you could do something like:

class MyLitModel(pl.LightningModule):
    def __init__(self, ...):
        super().__init__()
        self.model = ...
        self.loss_fn = ...
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        loss = self.loss_fn(y_hat, y)
        self.log('train_loss', loss)
        return loss
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)

model = MyLitModel()
trainer = pl.Trainer(max_epochs=10, accelerator="gpu", devices=2)
trainer.fit(model, train_dataloader, val_dataloader)

This code will train on 2 GPUs, log training loss, do validation after each epoch, and save checkpoints, all without the user writing those parts. As noted in the official blog, PyTorch Lightning’s design makes it easy to scale up experimentation (to multiple GPUs or nodes) and incorporate best practices (like early stopping or model checkpointing) by flipping a switch. Lightning has been widely adopted in the research community for Kaggle competitions and academic projects where quick iteration is needed, and it serves as a nice bridge to production as well – once a Lightning model is trained, you still have a pure PyTorch model (the LightningModule can be used as a regular PyTorch nn.Module), which you can export or deploy as needed. It’s worth mentioning that Lightning is just one example of high-level frameworks; others exist (for example, FastAI for beginners or Hugging Face Trainer for transformer models), but Lightning has gained a strong following for its flexibility and plugin ecosystem.

In terms of performance, Lightning adds a thin layer of overhead but uses PyTorch under the hood, so training speed is usually comparable to raw PyTorch. One has to be mindful of some differences (for instance, Lightning by default will try to use 32-bit precision unless configured for mixed precision, etc.). The SoftwareMill comparison we’ll discuss later found that Lightning can have a small overhead in certain scenarios (possibly due to additional callbacks/logging), but generally it aims to match PyTorch’s efficiency. Overall, PyTorch Lightning can dramatically improve productivity for researchers by automating the engineering aspects of model training while preserving the full flexibility of PyTorch when you need to dive into customization.

JAX (and Flax)

JAX is a relatively newer machine learning framework (open-sourced by Google in 2018) that has been attracting a lot of attention in the research community for high-performance computing and advanced automatic differentiation capabilities. At its core, JAX is not a deep learning framework in the sense of having a built-in neural network API; rather, it’s a numerical computing library that blends NumPy-like ease of use with accelerator-backed execution and automatic differentiation. JAX’s tagline is essentially "NumPy on GPUs and TPUs, with grad". You write code that looks like Python/Numpy code, and JAX can automatically JIT-compile that code (using the XLA compiler) to run efficiently on CPU, GPU, or TPU. It also has powerful primitives for parallelizing computations across devices (e.g., pmap for SPMD parallel programming) and for vectorizing operations (via vmap).

In practical terms, JAX enables a very functional programming style for machine learning. You can define functions for your model forward pass and loss, and then use jax.grad or jax.jit to get a compiled gradient function. This is different from PyTorch or TensorFlow eager mode, where the gradient calculations are implicit – in JAX you explicitly transform functions. The benefit is that those transformations (like JIT compilation, vectorization, etc.) can lead to significant speed-ups by fusion of operations and parallel execution. Indeed, JAX programs often run extremely fast, sometimes even outperforming equivalent PyTorch or TensorFlow code, especially on TPUs (which JAX was designed to leverage efficiently). Another killer feature is JAX’s support for automatic vectorization – using vmap, one can automatically turn a function that processes a single example into one that processes a batch of examples, without writing loops, and it will be as efficient as if you’d hand-written the batch code.

However, JAX by itself is quite low-level for neural network building. This is where libraries like Flax and Haiku come in. Flax is a high-level neural network library built on JAX (by Google Research) that provides modules (like Dense, Conv, etc.) and a convenient way to manage model parameters, similar to how you’d use layers in Keras or PyTorch. Flax emphasizes a functional core (your model computations are pure functions) with an object-oriented wrapper to manage state (parameters and random number generators). Using Flax, you can define a model class and then use it within JAX’s functional paradigm. Haiku (by DeepMind) is another library with a slightly different approach to managing state in JAX models, but similarly aimed at making neural network building easier.

One of the reasons JAX has gained prominence is that it has been the “secret sauce” behind some recent Google and DeepMind breakthroughs. For example, the famous AlphaFold 2 system for protein folding by DeepMind was implemented in JAX/Flax. Google’s large language model efforts (such as parts of the PaLM model) and other cutting-edge research projects have extensively used JAX because of its performance and ability to scale to very large models across many TPU devices. Indeed, combining JAX with Flax has been described as the “primary weapon of choice” for both Google AI and DeepMind teams. This speaks to JAX’s strength in a research setting where you might want to write novel math-heavy model code and still execute it with optimized performance on tens or hundreds of accelerators in parallel.

To give a flavor of JAX, consider that you can write something like:

import jax.numpy as jnp

def relu(x):
    return jnp.maximum(x, 0)

# A simple two-layer perceptron forward pass
def model_forward(params, x):
    W1, b1, W2, b2 = params
    hidden = relu(x @ W1 + b1)
    logits = hidden @ W2 + b2
    return logits

# Compute loss for a batch
def loss_fn(params, batch):
    x, y = batch
    preds = model_forward(params, x)
    return jnp.mean(jnp.square(preds - y))  # MSE loss

# Get gradient function
grad_loss_fn = jax.grad(loss_fn)

# JIT compile the forward pass for efficiency
model_forward_compiled = jax.jit(model_forward)

This code is very close to pure numpy, but now grad_loss_fn can be used to get gradients of the loss w.r.t. parameters, and jax.jit will compile model_forward to a fused, optimized sequence of operations on whatever device (CPU, GPU, TPU) you run it on. JAX’s approach requires a different way of thinking (one typically passes around immutable parameter dictionaries, for example, rather than having stateful model objects), which has a learning curve for those coming from PyTorch/TensorFlow. But it offers maximal performance and has some of the most advanced autodiff features (for instance, JAX can do forward-mode autodiff and even higher-order differentiation easily).

In terms of ecosystem, JAX is still younger and has fewer pre-built models available compared to TensorFlow/PyTorch. There are efforts like the FLAX models repository and DeepMind’s DM-Hub that provide some pretrained models, and the Hugging Face library has started to add JAX/Flax versions of popular models. But as of mid-2020s, the pretrained model zoo in JAX is not as extensive. This is a noted drawback: many models (and their weights) are in PyTorch or TensorFlow, so using them in JAX may require conversion or re-training. Tools like Hugging Face’s Transformers can help by providing conversion utilities, but it’s something to be aware of. Nonetheless, JAX is increasingly being used in scientific computing and large-scale deep learning research where its capabilities shine. It might be less common (so far) in typical industry deployments or small-scale projects due to the lack of out-of-the-box pretrained models and the fact that it’s Python-only (no official multi-language support or mobile deployment story yet).

To sum up, JAX + Flax offers an exciting blend of theoretical elegance and high performance. It is pushing the boundaries in research (especially within Google/DeepMind), and we can expect some of its ideas (like JIT compilation of models) to influence other frameworks as well. In fact, PyTorch’s recent advancements in JIT and TensorFlow’s XLA efforts are converging towards some of the strengths that JAX demonstrated.

ONNX and Interoperability

While the above frameworks (TensorFlow, PyTorch, JAX, etc.) are primarily concerned with training and defining models, ONNX (Open Neural Network Exchange) addresses the problem of portability and deployment of trained models. Introduced in 2017 through a collaboration between Facebook and Microsoft, ONNX is essentially an open standard for representing machine learning models in a common format. The idea is simple but powerful: if you can export your trained model (be it from PyTorch, TensorFlow, or other frameworks) into the ONNX format, you can then deploy or run that model using any ONNX-compatible runtime, decoupling the training environment from the inference environment. This helps avoid “lock-in” to a specific framework for deployment. For example, you might train a model in PyTorch because of its ease of use, but you might want to embed the model into a mobile app or a high-performance C++ application. Rather than loading PyTorch on the mobile device, you could export the model to ONNX and use a lightweight ONNX Runtime on the device to execute it.

ONNX defines a computational graph format with a standardized set of operators (like convolution, Relu, matmul, etc.) that most frameworks support. Exporters exist for all major frameworks (PyTorch has torch.onnx.export, TensorFlow can convert to ONNX via tf2onnx or through Keras, etc.). Once in ONNX format, a variety of runtimes can run the model: ONNX Runtime (an optimized engine by Microsoft), NVIDIA’s TensorRT (which can import ONNX models and compile them for GPUs), Intel’s OpenVINO for CPUs, or even Web frameworks. This means a model trained in one framework can be deployed in many environments without needing that original framework. ONNX is especially useful for deploying to mobile and edge devices. For instance, consider an example use-case: you train a food image classification CNN in PyTorch on your workstation, and you want to deploy it inside an iOS app. iOS expects models in CoreML format for efficient inference. Using ONNX as an intermediary, you can export the PyTorch model to model.onnx, then use a converter (ONNX -> CoreML) to get a .mlmodel file for iOS. This way, you didn't need to reimplement the model in a different framework for deployment – ONNX handled the translation.

Another advantage of ONNX is in optimizing models. The ONNX Runtime and tools can perform graph optimizations (constant folding, operator fusion, etc.) independent of the training framework. There’s also ONNX Model Zoo which provides a collection of pre-trained models in ONNX format ready to use with various runtimes. Major cloud providers have embraced ONNX for production; for example, Microsoft Azure’s machine learning services often use ONNX for deploying models, and Windows even includes ONNX runtime for AI tasks. NVIDIA’s TensorRT uses ONNX as the primary input format to convert models for high-speed inference on GPUs. Essentially, ONNX has become the lingua franca of model interchange.

It’s important to note that ONNX is focused on the inference aspect. It does not cover training a model (although there are extensions like ONNX-ML and ONNX training format, these are not widely used compared to the inference use-case). ONNX supports not just deep learning models but also classical ML models to some extent. The format itself is evolving; new operators are added as needed to support complex models (for example, to support BERT or other transformer models, new operations were introduced). The community keeps ONNX up-to-date with the latest developments in model architectures.

In practice, using ONNX usually comes down to a few steps: export your model, then load it in the target environment. For example, in PyTorch: torch.onnx.export(model, dummy_input, "model.onnx") will save the model. Then, in a C++ app or Python deployment server, one could use ONNX Runtime like:

import onnxruntime as ort
sess = ort.InferenceSession("model.onnx")
outputs = sess.run(None, {"input": input_data_numpy})

This would execute the model (potentially using hardware accelerators if available). ONNX Runtime is highly optimized and can often approach the performance of framework-specific inference engines. Moreover, ONNX makes it easier to deploy to different hardware – for instance, you could take the same ONNX model and run it on an NVIDIA Jetson (with TensorRT), on an Intel CPU (with OpenVINO), or in a browser (with ONNX.js or WebAssembly backend), all without retraining or altering the model code. This flexibility is extremely valuable for companies that need to deploy AI models across various platforms.

To illustrate ONNX conceptually, Figure 2 shows a simple linear regression represented as an ONNX computation graph, where the operations are nodes (Matrix Multiplication and Addition) and the data (inputs, weights, bias) flows along edges in the graph.

Figure 2: An example ONNX computational graph for a linear regression model. The model y = X·W + b is represented with a MatMul node and an Add node. ONNX provides a common “language” of tensor operations so that models can be saved and run independently of the original training framework.

In summary, ONNX does not replace TensorFlow or PyTorch, but rather complements them by facilitating interoperability and efficient deployment. It has become a standard component in the workflow of moving models from the lab to production, especially when the training environment and deployment environment differ. For instance, many teams train large models in PyTorch, use ONNX to convert them, and then deploy with ONNX Runtime or TensorRT for maximum runtime performance. This separation of concerns allows each tool to do what it’s best at – PyTorch/TensorFlow for training, ONNX for inference – thereby streamlining the end-to-end machine learning pipeline.

Feature Comparison of Frameworks

Having discussed each framework individually, we now compare their features and typical use cases side by side. Each framework has its own philosophy and strengths, which often makes it preferable in certain scenarios. Table 1 provides a high-level comparison:

Framework Initial Release (Developer) Graph Execution Model Notable Features & Ecosystem Typical Use Cases & Domains
TensorFlow (2.x) 2015 (Google Brain) Static graph or eager (dynamic) – supports both modes in TF2 Rich ecosystem: TensorBoard for visualization, TFX for pipelines, TF Lite for mobile, TF Serving for deployment; multi-language support (Python, C++, Java, JavaScript). High-level API: tf.keras integrated. Distributed training: Built-in support (MirroredStrategy, TPU Strategy). Production environments requiring scalability (cloud services, web services), cross-platform deployments (Android/embedded via TF Lite, browser via TF.js). Used in Google products (Search, Photos), healthcare AI (medical imaging diagnostics), finance (fraud detection, risk analysis). Well-suited for large-scale systems and teams that benefit from a full pipeline and tooling.
PyTorch 2016 (Facebook AI Research) Dynamic graph (eager execution; option to script/trace for static graph) Pythonic design: intuitive imperative style, easy debugging. Strong research community: Many academic papers and latest models use PyTorch. Growing ecosystem: TorchVision, TorchText, etc., plus community libraries (HuggingFace, fastai, etc.). Deployment: TorchScript & ONNX export for production, native PyTorch Mobile support. Distributed training: Yes (DataParallel, DistributedDataParallel). Research and rapid prototyping (experiments with novel architectures), especially in academia (75%+ of new research uses PyTorch). Also used in production for computer vision and NLP at companies like Meta, Microsoft, and Tesla (e.g. Tesla Autopilot uses PyTorch). Suitable for projects that require flexibility in model development and a Python-friendly workflow.
JAX + Flax ~2018 (Google Research) Dynamic functional model with XLA JIT-compilation (traces and compiles end-to-end computations) High-performance computing: optimized via XLA, great for massive parallelism on GPU/TPU. Automatic differentiation and vectorization: grad, vmap, pmap allow advanced usage. Functional style: encourages pure functions, making reasoning about code easier (but with a learning curve). Ecosystem: Flax (official NN library), Haiku (DeepMind’s NN lib), limited pre-trained model availability compared to TF/PyTorch. Cutting-edge research requiring speed and scalability (e.g., large-scale models on TPUs). Used internally by Google/DeepMind for state-of-the-art projects (AlphaFold, large language models). Great for experiments in meta-learning, reinforcement learning, or any scenario where fast compilation and multiple dispatch (CPU/GPU/TPU) are needed. Less common in traditional deployment settings (since mainly Python-only and fewer production tools), but excellent for experimenting with new training algorithms and extremely large models.
Keras 2015 (open-source, François Chollet) (Runs on top of other frameworks: TensorFlow, and with Keras 3.0 can use JAX or PyTorch as backend) User-friendly API: high-level building blocks (Layers, Models) and methods like model.fit. Rapid prototyping: minimal code to get a model working. Modularity: plug-and-play layers, easy experimentation. Now multi-backend: can utilize TF, JAX, or PyTorch engines. Comes with many pre-built layers and some pre-trained models (via Keras Applications). Education and beginner projects, or fast prototyping in research. Often used in Kaggle competitions and early-stage model development because of its simplicity. With tf.keras, it’s used in many production TensorFlow projects as the interface. Suitable when a developer wants to quickly build a model without worrying about low-level details – e.g., a data scientist building a proof-of-concept model for tabular data or a small image classification task.
PyTorch Lightning 2019 (Lightning AI) (Wraps PyTorch’s eager execution; abstracts training loop) Training loop abstraction: handles epochs, batching, optimizers, etc. Built-in best practices: automatic checkpointing, logging, early stopping. Scalability: easy multi-GPU and multi-node training flags. Readable code: enforces structure (LightningModule) leading to cleaner, modular code. Integrates with PyTorch ecosystem (you still write PyTorch models). Research teams that want to reduce boilerplate and standardize experimentation. Useful in academic labs and industry R&D groups for quick iteration on PyTorch models – e.g., training dozens of models with varying hyperparameters or scaling up a prototype to multi-GPU training with minimal code changes. Also gaining traction for production pipelines where consistent training procedures are needed (e.g., retraining models periodically with new data). Not typically used for inference deployment (you’d export the underlying PyTorch model).
ONNX 2017 (Facebook & Microsoft) Static computational graph (intermediate representation for model) Framework-agnostic format: supports models from PyTorch, TensorFlow, Keras, MXNet, etc.. Interoperability: ONNX Runtime, TensorRT, OpenVINO, CoreML, and others can execute ONNX models. Optimizations: graph optimizers and hardware-specific acceleration (e.g., NVIDIA uses ONNX as input to TensorRT compiler). Extensibility: continually updated operator set to accommodate new model types. Use-case is model deployment and cross-framework compatibility rather than training. Employed when you need to deploy a trained model on a different platform than it was trained on: e.g., a PyTorch-trained model running in a C++ server or a mobile app, or optimizing a model for latency using TensorRT. Also useful for model interoperability in a multi-framework organization (passing models between teams using different tools). ONNX is used across industries for deploying AI models on diverse hardware – from cloud servers to edge devices – ensuring the model behaves consistently everywhere.

Table 1 compares the frameworks on several axes. A few observations stand out:

  • Performance: In terms of raw speed, all frameworks now utilize accelerated libraries (like cuDNN, cuBLAS) under the hood for common operations. Specific differences do appear: for example, JAX’s XLA compilation can yield very high performance for large batch computations or TPU workloads, sometimes outperforming PyTorch or TensorFlow in similar tasks. TensorFlow’s static graph mode (still accessible via tf.function in TF2) can also achieve optimizations that dynamic eager execution might miss (though TF2 bridges that gap with AutoGraph). PyTorch’s latest 2.0 release with the TorchInductor compiler is closing the performance gap by introducing more graph-level optimizations. Empirical benchmarks (such as one by SoftwareMill) have shown that performance can depend on the scenario: in an I/O-heavy training job (streaming data from disk), TensorFlow was faster than vanilla PyTorch due to more efficient data pipelining, whereas for in-memory workloads PyTorch was very competitive or faster. The takeaway is that performance differences are often workload-dependent and all major frameworks are within the same ballpark, with each offering ways to optimize (JAX via JIT, PyTorch via TorchScript/Inductor, TensorFlow via XLA/Graph mode).

  • Usability and Flexibility: PyTorch (and by extension, PyTorch Lightning) arguably leads in flexibility and a gentle learning curve – its imperative style and Pythonic design make it easier for newcomers to grasp and for researchers to iterate on complex ideas. TensorFlow 1.x was considered more difficult to learn due to static graphs, but TensorFlow 2.x with eager execution and Keras feels much more accessible; still, some legacy complexity remains (for instance, mixing low-level tf APIs with Keras models can confuse new users). Keras is the easiest of all in terms of API simplicity, but that comes at the cost of some flexibility – if your model doesn’t fit the standard layer stack paradigm, you might drop to the backend framework code. JAX requires comfort with functional programming and is best suited for those with a strong computer science or math background who need full control and performance. In summary: for beginners, Keras (or PyTorch with high-level libraries) is friendly; for researchers, PyTorch is often a go-to for flexibility; for software engineers and production, TensorFlow’s comprehensive tools can be advantageous; and for advanced researchers, JAX offers ultimate control and performance at the cost of a higher learning curve.

  • Ecosystem and Community: Both TensorFlow and PyTorch have vast ecosystems. TensorFlow, being older and backed by Google, has integration with things like TensorFlow Hub (a repository of models), TFX for end-to-end ML pipelines, and a large collection of pre-trained models in the TF/Keras format. PyTorch’s community-driven ecosystem (with contributions from companies like Facebook/Meta, Microsoft, Amazon, Hugging Face, etc.) has resulted in many cutting-edge models and libraries being PyTorch-first. For example, most implementations on paperswithcode or open-source model checkpoints you find on GitHub are in PyTorch format nowadays, which is a big draw for that framework. JAX’s ecosystem is growing; Google has released some libraries (Flax, Optax for optimization, etc.), and DeepMind has released others (Haiku, Chex), but community contributions are fewer simply because the user base is smaller than TF/PyTorch. ONNX’s ecosystem is a bit different – it’s supported by many companies in their deployment tools, but you don’t “develop” models in ONNX; you use it alongside either TF or PyTorch. The ONNX community maintains converters and ensures new ops (operations) are added to keep up with framework advances.

  • Hardware Support: All major frameworks support GPUs (NVIDIA GPUs via CUDA; AMD GPUs are supported by TensorFlow (with ROCm) and by PyTorch to some extent on Linux, though NVIDIA is far more common). For TPUs: TensorFlow and JAX have native support (Google’s TPUs were initially built with TensorFlow in mind, and JAX was designed around TPUs), whereas PyTorch uses the XLA backend to run on TPUs. In terms of mobile and edge: TensorFlow Lite is a dedicated solution for mobile/embedded, converting models into optimized forms (with quantization support) for Android, iOS, even microcontrollers. PyTorch can export models to C++ and has mobile runtimes, but the mobile optimization tooling is slightly less developed than TF Lite (though improving). ONNX shines here by allowing one to use whatever runtime is best for the hardware – e.g., ONNX Runtime with NNAPI on Android, or CoreML on Apple devices. There are also specialized accelerators (like FPGA-based inference chips, or AI accelerators in smartphones) that often provide ONNX support to execute models. This means if maximum deployment portability is needed, training in PyTorch then exporting to ONNX is a common pipeline in industry.

To concretize these comparisons with an example: imagine a project to build a deep learning model for autonomous driving. One team might choose PyTorch to design and train their models due to its flexibility (perhaps they iterate quickly on new neural network architectures for perception). They utilize libraries like TorchVision for pretrained backbones and maybe Lightning to manage training across multiple GPUs. Another team, focused on deployment in the car (an embedded system), might export these models via ONNX and run them on an optimized runtime on the car’s hardware (where C++ and efficiency matter). Meanwhile, a research group in the same company exploring reinforcement learning for planning might use JAX to take advantage of TPU pods and fast experimentation with novel optimization algorithms. And for a simpler sub-problem, say detecting driver distraction using a webcam, a developer might prototype a solution in Keras because of how quickly they can get something working. This hypothetical scenario highlights that different frameworks can coexist, each suited to different stages or aspects of a complex project.

Practical Tips and Recent Developments

To get the most out of these frameworks, here are some practical considerations and recent trends:

  • Mixing and converting frameworks: It’s increasingly common to train in one framework and deploy in another, as discussed with ONNX. Even without ONNX, there are conversion tools (e.g., TensorFlow’s tf.keras can load a model and you can sometimes save it in SavedModel format and convert; Hugging Face provides conversion scripts between PyTorch and TensorFlow for popular models). If you use PyTorch and need to deploy on TensorFlow Serving, you might convert the PyTorch model to ONNX, then to a TensorFlow SavedModel or directly use ONNX Runtime. Plan your pipeline early – e.g., if mobile deployment is a goal, design your model in a way that is exportable (avoid ops that are not supported by TFLite or ONNX). Both TensorFlow and PyTorch have lists of supported operations for their mobile/embedded scenarios.

  • Taking advantage of hardware accelerators: If training on TPUs is desired (for instance, using Google’s free TPU pods on Kaggle or Google Colab for research), know that TensorFlow and JAX will have smoother support. PyTorch on TPU (using torch_xla) works but may require more effort and currently lags a bit in features. On GPUs, all frameworks are fine, but for using multiple GPUs, PyTorch Lightning or TensorFlow’s high-level APIs can save you effort compared to writing your own distribution code.

  • Use the ecosystem libraries: Don’t re-invent the wheel. Need an optimizer like LAMB or RAdam? – TensorFlow’s addons or PyTorch’s torch.optim likely have it. Need a certain layer or model? – Check the model hubs (TensorFlow Hub, PyTorch Hub, Hugging Face, etc.) where you might find ready implementations. This can greatly accelerate development.

  • Monitor resource usage: Different frameworks have different default memory allocation strategies. For instance, TensorFlow by default tries to grab a large chunk of GPU memory upfront (to avoid fragmentation) which is “greedy”, while PyTorch uses a more “incremental” approach, grabbing more GPU memory as needed. JAX also allocates large buffers due to XLA. Knowing this, if you are mixing frameworks (say running a PyTorch script and a TensorFlow script on the same GPU sequentially), be sure to properly release memory or configure limits (TensorFlow allows setting tf.config.experimental.set_memory_growth to avoid grabbing all memory). In multi-tenant environments (like shared GPUs), these differences can cause conflicts – e.g., TensorFlow might pre-allocate memory that PyTorch then cannot use. Tools and flags exist to tweak these behaviors.

  • Latest developments: Both TensorFlow and PyTorch are incorporating ideas from each other and from JAX. PyTorch 2.0 (released 2023) introduced the torch.compile function which uses an internal JIT to speed up models – early tests show significant speedups for many models, bringing PyTorch closer to compiled-graph performance while keeping the eager interface. TensorFlow, for its part, has been working on TensorFlow 2.X where XLA compilation can be applied more seamlessly and where Keras is becoming backend-agnostic (as noted with Keras 3.0 supporting multiple backends). We also see ONNX expanding to cover new domains – e.g., ONNX-ML for traditional ML, and even preliminary support for training graphs (though that’s nascent). There’s also the rise of MLIR (Machine Learning IR) from Google, which is influencing how frameworks compile and optimize computations under the hood (TensorFlow is built on MLIR in part, and PyTorch is also aligning some compiler work with it). For the end user, these developments mean better performance and more flexibility, but the high-level usage of frameworks remains similar.

  • Visualization and debugging: One advantage TensorFlow long had was TensorBoard for visualizing training curves, model graphs, etc. PyTorch now can use TensorBoard as well (via a plugin or torch.utils.tensorboard), and there are other third-party tools (like Weights & Biases) that work with any framework. If you are doing research, these tools are framework-agnostic and very helpful. Also, debugging dynamic models is easiest in PyTorch (just use Python’s pdb or print statements). For TensorFlow 2 eager, similar debugging is possible, but if you use tf.function (which makes parts of your code graph/compiled), it can be trickier to debug inside those functions. JAX is the hardest to debug in the traditional sense, because once you JIT compile a function, you can’t step through it in Python. The common practice there is to test your code without JIT on small inputs to ensure correctness before adding the jax.jit for speed.

Finally, it’s worth highlighting a few real-world application examples across sectors to see which frameworks shine where:

  • Healthcare: Google’s retinal disease detection (TensorFlow) and DeepMind’s protein folding (JAX) are two flagship examples. In hospitals, startups have used Keras/TensorFlow to quickly prototype models for, say, tumor detection in MRI scans (Keras’ ease can be valuable for domain experts who are not full-time engineers). PyTorch, with MONAI, is heavily used in medical imaging research for tasks like segmentation of CT scans, etc., leveraging the latest CNN and transformer models.

  • Finance: Banks often prefer frameworks with strong support and long-term stability. TensorFlow has seen uptake in fraud detection systems (as described, using its anomaly detection capabilities and streaming data pipelines). But PyTorch is also used, for example, by fintech companies building credit scoring models or high-frequency trading models that rely on neural networks. The choice may boil down to team expertise – if a team has more researchers, they might prefer PyTorch; if they have more software engineers or need to integrate with Java, they might lean TensorFlow (since TF can be served in Java or C++).

  • Autonomous Vehicles: As noted, Tesla uses PyTorch for their vision networks. Waymo (Google’s self-driving car project) has historically used TensorFlow for 3D perception and Lidar data processing (and they open-sourced parts of their dataset with TensorFlow code). There is also a lot of ONNX here: models trained in PyTorch might be converted to ONNX for deployment in cars where a C++ engine runs them for latency reasons. Real-time systems like self-driving require optimized inference – TensorRT (via ONNX) or TorchScript can be used to achieve the necessary speed on car-mounted GPUs.

  • Natural Language Processing: This domain saw a shift from TensorFlow to PyTorch around the time of the Transformers revolution. Google’s original BERT was in TensorFlow, but OpenAI’s and Facebook’s language models (GPT, RoBERTa, etc.) were in PyTorch. Today, Hugging Face provides both TF and PyTorch implementations for most models, but PyTorch is slightly more popular. That said, production NLP services (like translation APIs, etc.) inside Google likely still run on TensorFlow. Also, JAX has been used in research for large language models – e.g., the recent Google Brain/DeepMind large models have JAX implementations. An interesting development is that some of these large models are trained with JAX for speed, then converted (via ONNX or direct weight porting) to PyTorch for release to the public (because more users are familiar with PyTorch).

  • Robotics and others: In robotics, reinforcement learning and control, we see a mix: researchers love PyTorch for RL (with libraries like PyTorch RL), but Google’s Dopamine RL was in TensorFlow, and JAX is gaining ground for some simulation tasks. In edge AI (like running on smartphones, drones, IoT devices), TensorFlow Lite and ONNX are heavily used to deploy models that might have been trained in any framework.

Given how fast this field moves, anyone working with these frameworks should stay updated through official release notes and community forums. Both TensorFlow and PyTorch have major conferences and forums (e.g., the “TensorFlow Dev Summit” and “PyTorch Conference”) which announce new features. As of 2025, one can expect further convergence: frameworks becoming more interoperable, easier model portability, and continuous improvements in performance (perhaps leveraging compilers and specialized hardware even more). The good news is that as a practitioner, these frameworks are all open-source and free, so one can try each and use the best tool for the job. Often the “best” framework is determined by the context: the team’s expertise, the problem requirements, and deployment constraints.

Conclusion

Deep learning frameworks have undeniably been a catalyst in the rapid progress of AI in the last decade. From the early days of Theano (hand-coding layer operations) to today’s high-level APIs and model zoos, we have come a long way in making deep learning both accessible and powerful. In this post, we explored Section 3.3.4 Deep Learning Frameworks in depth, covering TensorFlow, Keras, PyTorch, PyTorch Lightning, JAX/Flax, and ONNX. We discussed their theoretical underpinnings (static vs. dynamic graphs, eager vs. JIT execution) and provided practical insights into their performance characteristics, usability, and ecosystems. We also highlighted real-world usage across sectors: how these frameworks are employed in healthcare diagnostics, financial fraud detection, autonomous driving, and more. A balanced understanding of both the theoretical foundations (e.g., how autograd works or why XLA compilation matters) and the practical implementation details (like writing a training loop or exporting a model for mobile) is crucial for choosing the right framework for a task.

It’s clear that no single framework is “best” for all purposes. Instead, each has carved out its niche: TensorFlow shines in production environments and end-to-end solutions, PyTorch in research and fast prototyping, JAX in cutting-edge high-performance research, Keras in ease-of-use for simpler projects, Lightning in structuring PyTorch experiments, and ONNX in bridging the gap between training and deployment. The great thing is these tools are increasingly interoperable – demonstrating the community’s push towards flexibility and avoiding siloed ecosystems. For someone starting in deep learning today, the recommendation often is: try out PyTorch and TensorFlow (and maybe Keras) on a toy problem, see which feels more natural, but be aware of what the other offers. As you advance, experiment with JAX to understand the next-generation approach to high-performance ML. And always consider the deployment needs: a brilliant model is only useful if it can be effectively deployed, which is where formats like ONNX or frameworks’ own deployment tools come into play.

In closing, deep learning frameworks are like languages – each with its grammar and quirks – but all expressive enough to translate our ideas into reality. Mastering more than one can only strengthen an AI practitioner’s ability to tackle diverse problems. With continuous innovation in this space (from better compilers to new hardware accelerators), these frameworks will also continue to evolve. The logical structure and chronology we traced (from Theano to modern libraries) shows a clear trajectory toward more user-friendly and efficient tools. Keeping an eye on the latest releases and community trends will ensure you leverage the best of what these frameworks have to offer. Happy modeling!

References (IEEE Style)

  1. R. Pytel, “ML Engineer comparison of PyTorch, TensorFlow, JAX, and Flax,” SoftwareMill Tech Blog, 24 Sep 2024. [Online]. Available: https://softwaremill.com/ml-engineer-comparison-of-pytorch-tensorflow-jax-and-flax/

  2. C. Morris, “Tesla’s Andrej Karpathy Discusses Autopilot, Full Self-Driving, PyTorch,” InsideEVs, Nov. 20, 2019. [Online]. Available: https://insideevs.com/news/383390/video-tesla-andrej-karpathy-autopilot-full-self-driving/

  3. Databricks, “TensorFlow,” Databricks Glossary, (accessed Jan. 10, 2025). [Online]. Available: https://www.databricks.com/glossary/tensorflow-guide

  4. V. Kurama and B. Whitfield, “PyTorch vs. TensorFlow: Key Differences to Know for Deep Learning,” BuiltIn, Updated Oct. 23, 2024. [Online]. Available: https://builtin.com/data-science/pytorch-vs-tensorflow

  5. X. Ji, “How TensorFlow is reshaping banking?” BytePlus Tech Blog, May 10, 2025. [Online]. Available: https://www.byteplus.com/en/topic/509361

  6. L. Peng and V. Gulshan, “Deep Learning for Detection of Diabetic Eye Disease,” Google AI Blog, Nov. 29, 2016. [Online]. Available: https://research.google/blog/deep-learning-for-detection-of-diabetic-eye-disease/

  7. PyTorch/XLA Developers, “Cloud TPUs for PyTorch and JAX,” PyTorch Documentation, version 2.8.0, 2025. [Online]. Available: https://docs.pytorch.org/xla/master/accelerators/tpu.html

  8. Dillon (author), “What every ML/AI developer should know about ONNX,” DigitalOcean Community, Updated Sep. 25, 2024. [Online]. Available: https://www.digitalocean.com/community/tutorials/what-every-ml-ai-developer-should-know-about-onnx

  9. NVIDIA, “Example Deployment Using ONNX – TensorRT-RTX Documentation,” Version 2025.1, Jun. 11, 2025. [Online]. Available: https://docs.nvidia.com/deeplearning/tensorrt-rtx/latest/installing-tensorrt-rtx/example-deployment.html

  10. L. Ngo, “Deep Learning Library Framework (Workshop Notes),” Clemson Univ. (Creative Commons 4.0 Licensed), 2022. [Online]. Available: https://clemsonciti.github.io/rcde_workshops/python_deep_learning/02-Deep-Learning-Framework.html

  11. G. Boesch, “Pytorch vs TensorFlow: A Head-to-Head Comparison,” Viso.ai Blog, Dec. 4, 2023. [Online]. Available: https://viso.ai/deep-learning/pytorch-vs-tensorflow/

  12. Rafał Pytel, “Which Deep Learning Framework is Best? (Qualitative Analysis),” SoftwareMill Blog, Sep. 2024. [Online]. Available (via SoftwareMill): ibid.

댓글

이 블로그의 인기 게시물

Expert Systems and Knowledge-Based AI (1960s–1980s)

Core Technologies of Artificial Intelligence Services part2

3.1.4 Linear Algebra and Vectors