Hi there !

Welcome to Lay’s blog ! 👋

Spoiler : it’s not gonna be about chips, but about data and AI. 😎

Journey into building a custom RAG (Day 2)

Now that I have a basic RAG working, I performed a few experiments to assess its accuracy in retrieving relevant documents. I did an experiment where I took one part of huggingface TRL library that is supposed to be included in my embeddings, regarding DPO (Direct Preference Optimization) : Then, I asked my model questions about it, but it wouldn’t get them right, and outputted something like chatGPT would, without context and knowledge about DPO....

Journey into building a custom RAG (Day 1)

I am currently building a RAG (Retrieval Augmented Generation) LLM application, and recently decided to share my journey on this blog. I skip the intro and jump straight to my current status. Here’s where I am: Starting Point: To get started with all the boilerplate code, I used the Metaflow RAG demo repository and this anyScale blog post for insights and second point of view. Metaflow’s repo include a streamlit chat webapp, utilities for scraping and parsing markdown documentation files, etc....

Develop a Retrieval Augmented Generation (RAG) LLM Application

This article explains how to customize a large language model for your specific needs. There are many approaches to adapt a Large Language Model (LLM) for your data, with the goal of enhancing the responses relevance. Prompt engineering : you spend time working on a good prompt, eventually a templated prompt where you can include your data. This is the fastest way, but will also quickly become limited, as the context size (maximum input you can give to a model) is usually in the range of 4 - 16K tokens (3 - 11K words), and it is shared between question and response....

Detect Deep Fakes in 30 minutes with Computer Vision

In the realm of digital content, the rise of deep fakes has presented a unique set of challenges. I joined the Kaggle Deep Fake Detection Challenge (DFDC) and challenged myself to submit a working solution in two days maximum. I wrote my approach below. Code is here. The Challenge at a Glance The dataset size is a challenge by itself, being a 500GB of video content dataset. Approach and Methodology Frame Extraction: Capture a few frames from each video to immediately reduce the dataset size by 90%, and obtain something small to iterate with....

Apache Spark Best Practices Megalist

Navigating through the complexities of Apache Spark job optimization can be challenging. Based on recent consulting experiences with various companies, I have compiled a list of best practices for optimizing Apache Spark applications. These strategies have proven to be effective in enhancing performance and cost-efficiency. Partitioning Partition wisely: It’s important to balance the size of partitions. Avoid too many small partitions (big coordination overhead) and too few large ones (no parallelism)....

Detecting bot traffic on a webserver

TL;DR I did a basic bot detection POC/analysis on HTTP logs of a website, for an interview with a company in 2017. Here are the results : Among the 68,000 unique IPs, 10% of them were classified as potential bots by our algorithm. Among the 5 million requests, 80% were classified as bot traffic. According to this article, 40% of the overall web traffic comes from bots. Given the simplicity of our algorithm, it is most probable that it flags a lot of false positives, as we will see very soon....