As artificial intelligence (AI) continues to shape the future of technology, the need for more efficient and scalable AI solutions is greater than ever. The Falcon 2 11B model, developed by the Technology Innovation Institute (TII), is one such solution, offering powerful AI capabilities with a remarkable 11 billion trainable parameters. In this guide, we’ll walk you through Optimize Falcon 2 11B on Amazon EC2 c7i, detailing the use of quantization techniques (INT8 and INT4) and providing benchmarks to help you optimize performance for real-time AI inference applications.
What is the Falcon 2 11B Model?
The Falcon 2 11B is a large-scale, high-performance foundation model developed by TII, part of their next-generation AI model series. This model has been trained on a massive 5.5 trillion token dataset and supports various languages such as English, French, Spanish, German, and Portuguese. With 11 billion parameters, Falcon 2 11B excels in autoregressive tasks and is optimized for both accuracy and efficiency, making it an ideal candidate for diverse AI applications ranging from natural language processing (NLP) to multilingual text generation.
What sets Falcon 2 11B apart is its ability to run efficiently on CPU-based instances. Traditionally, large models like Falcon 2 11B required GPUs to perform at scale. However, with the support of Intel Advanced Matrix Extensions (Intel AMX) on EC2 c7i instances, developers can now deploy this model on cost-effective and widely available hardware, without sacrificing performance.
Why Choose Amazon EC2 c7i Instances?
Amazon EC2 c7i instances are powered by Intel’s next-generation processors, which include support for Intel’s Advanced Matrix Extensions (Intel AMX). Intel AMX is a hardware accelerator optimized for matrix operations, which are at the core of AI and deep learning tasks. By leveraging Intel AMX, EC2 c7i instances can significantly accelerate the performance of AI workloads, making them a perfect fit for running large models like Falcon 2 11B.
EC2 c7i instances also provide cost-efficiency, which is crucial for developers looking to scale their AI applications without incurring hefty infrastructure costs. Running Falcon 2 11B on EC2 c7i instances allows you to tap into high-performance computing resources without the complexity and expense of GPUs.
The Role of Model Quantization in Real-Time Use Cases
Quantization plays a crucial role in optimizing AI models, particularly when it comes to reducing latency and memory footprint. Falcon 2 11B is supported by OpenVINO, a toolkit that allows developers to leverage INT8 and INT4 quantization techniques to make the model more efficient for real-time applications such as code generation, retrieval-augmented generation (RAG), and other low-latency tasks.
INT8 Quantization: This technique reduces the precision of model weights from 32-bit floating-point (FP32) to 8-bit integers (INT8), significantly lowering memory consumption and computational cost. It allows for faster execution without compromising too much on model accuracy.
INT4 Quantization: Going one step further, INT4 reduces the precision to 4 bits per weight, further cutting down memory usage and speeding up inference. This is particularly beneficial for real-time applications where responsiveness is critical.
By using these quantization techniques, Falcon 2 11B becomes optimized for scenarios that demand quick, reliable results, such as chatbot interactions, real-time content generation, or search retrieval tasks.
How to Optimize Falcon 2 11B on Amazon EC2 c7i
Here’s a step-by-step guide to getting Falcon 2 11B up and running on an EC2 c7i instance:
- Set Up an AWS Account: If you haven’t already, create an AWS account. You’ll need it to provision EC2 instances and access other AWS services.
- Choose the Right EC2 Instance: Navigate to the EC2 dashboard and select the c7i instance family. These instances are optimized for AI workloads, providing the best performance per dollar for CPU-based inference.
- Deploy Falcon 2 11B Model: Download and configure the Falcon 2 11B model. Use the pre-trained model provided by TII or your own custom-trained version. Ensure the model is optimized for CPU execution by applying quantization techniques to reduce the computational overhead.
- Install Dependencies: Install the necessary machine learning libraries, such as TensorFlow or PyTorch, on the EC2 instance. Make sure the instance has Intel’s oneAPI AI tools to fully leverage Intel AMX.
- Test and Optimize: Run inference tests to validate the model’s performance on EC2 c7i instances. You can fine-tune the model further for specific use cases or deploy it directly for production.
- Monitor and Scale: Once deployed, use AWS CloudWatch to monitor your model’s performance. EC2 auto-scaling features can also help ensure that your application scales based on traffic needs, ensuring maximum uptime and performance.
Benchmark Results Summary: Performance Improvements
To evaluate how well Falcon 2 11B performs on EC2 c7i instances with OpenVINO quantization, we conducted benchmarking tests using two different quantization techniques—INT8 and INT4. The tests were run on an EC2 c7i.24xlarge instance, which provides excellent performance for deep learning workloads due to its powerful Intel processors and support for Intel AMX.
Case 1: INT8 Quantization with Falcon 2 11B
In this case, we examine the performance of Falcon 2 11B when optimized with INT8 quantization. INT8 is a powerful technique for reducing model size and computational demands without sacrificing too much on model accuracy, making it ideal for real-time applications.
Using the EC2 c7i.24xlarge instance, we tested various batch sizes and measured the latency and throughput of the model during inference. The results demonstrate significant improvements in performance, providing faster response times for applications like language generation, chatbots, and other AI-powered services.
Here’s a summary of the performance metrics when using INT8 quantization:
Quantization | Batch Size | Input Tokens | Output Tokens | First Token Latency (ms/token) | Second Token Latency (ms/token) | Throughput (tokens/s) |
INT8 | 1 | 32 | 128 | 148.07 | 82.51 | 12.12 |
INT8 | 1 | 64 | 128 | 189.98 | 79.74 | 12.54 |
INT8 | 1 | 128 | 128 | 283.50 | 80.26 | 12.46 |
INT8 | 1 | 512 | 128 | 1037.25 | 82.03 | 12.19 |
INT8 | 1 | 1024 | 128 | 1961.91 | 84.76 | 11.80 |
INT8 | 1 | 2048 | 128 | 4068.90 | 90.40 | 11.06 |
As seen, the INT8 quantization provides excellent throughput and minimal latency, making it an excellent choice for applications requiring low-latency responses.
Case 2: INT4 Quantization with Falcon 2 11B
Now let’s look at the performance of Falcon 2 11B when utilizing INT4 quantization, a more aggressive technique that further reduces the model’s memory footprint and speeds up inference times.
While INT4 quantization comes with a slight trade-off in accuracy, it offers a substantial performance boost, which is especially useful for ultra-low latency tasks. In this case, the model showed remarkable improvements in throughput and reduced first-token latency, which is critical for applications where every millisecond counts.
Here’s a summary of the performance metrics when using INT4 quantization:
Quantization | Batch Size | Input Tokens | Output Tokens | First Token Latency (ms/token) | Second Token Latency (ms/token) | Throughput (tokens/s) |
INT4 | 1 | 32 | 128 | 142.00 | 59.06 | 16.93 |
INT4 | 1 | 64 | 128 | 195.73 | 82.58 | 12.11 |
INT4 | 1 | 128 | 128 | 274.67 | 80.40 | 12.44 |
INT4 | 1 | 512 | 128 | 991.34 | 82.22 | 12.16 |
INT4 | 1 | 1024 | 128 | 1922.87 | 85.18 | 11.74 |
INT4 | 1 | 2048 | 128 | 4079.58 | 90.11 | 11.10 |
As you can see, INT4 quantization significantly reduces the first-token latency and improves throughput, making it ideal for use cases with ultra-low latency requirements.
Quantize Falcon 2 11B Using OpenVINO and Run Inference
Developers can easily quantize Falcon 2 11B and optimize it for running on CPU instances using the OpenVINO framework. Here’s a breakdown of the process:
Model Pull and Quantization:
First, the model is pulled from the HuggingFace hub through the OpenVINO framework, where it undergoes quantization. Note that model quantization requires large amounts of RAM, peaking at 116 GB for 11 billion parameters at full precision. For convenience, we run experiments on c7i.24xlarge instances equipped with 192 GB of RAM.
Figure 1: Model quantization
Memory Footprint for Inference:
Once quantization is complete, the model weights are stored on the instance disk, ready for inference. Compared to the quantization process, inference has a significantly lower memory requirement. The quantized model requires around 12 GB of memory for inference, taking advantage of the increased number of cores on EC2 c7i instances for faster execution.
Figure 2: Memory footprint of quantized model for inference
How to Quantize Falcon 2 11B
To begin quantizing Falcon 2 11B using OpenVINO, follow these steps:
- Provision EC2 Instance: Create a c7i.24xlarge EC2 instance on your AWS account.
- Allocate Storage: Ensure you have sufficient storage (at least 150 GB of Amazon EBS storage) for model weights and experiments.
Set Up Virtual Environment: Create a virtual environment for Python dependencies:
python3 -m venv falcon-llm-env
source falcon-llm-env/bin/activate
pip install --upgrade pip
Clone Repository: Clone the OpenVINO repository for model conversion.
git clone https://github.com/openvinotoolkit/openvino.genai.git
cd openvino.genai/llm_bench/python/
Install Dependencies: Install all required dependencies.
pip install -r requirements.txt
Run Model Conversion: Perform the model conversion and quantization.
python convert.py --model_id tiiuae/falcon-11B --output_dir model_weights/int8/ --compress_weights INT8 4BIT_DEFAULT
Test Falcon 2 11B for Inference
Once the quantization is complete, you can test Falcon 2 11B for inference using the following Python script, which integrates with the HuggingFace transformers library:
python
from transformers import AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM
model_path = "model_weights/int8/pytorch/dldt/compressed_weights/OV_FP32-INT8/"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = OVModelForCausalLM.from_pretrained(model_path)
inputs = tokenizer("What is OpenVINO?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text) # Falcon2-11B inference result
The following screenshot shows a sample generated from the previous code. The output is truncated to generate 200 tokens maximum.

Conclusion
Deploying Falcon 2 11B on Amazon EC2 c7i instances with OpenVINO quantization provides a highly efficient solution for real-time AI inference. By utilizing INT8 and INT4 quantization techniques, developers can significantly reduce memory usage and computational demands while maintaining high model performance. Whether you’re building an AI chatbot or creating advanced language generation systems, this setup offers a scalable, cost-effective alternative to GPU-based deployment.
By leveraging Intel AMX-powered EC2 instances, OpenVINO’s optimization features, and Falcon 2 11B’s advanced capabilities, you can unlock the full potential of large language models without sacrificing speed or accuracy.
For seamless deployment and optimized cloud infrastructure, Gigabits Cloud provides high-performance solutions that can support large-scale AI models like Falcon 2 11B.