In this tutorial, we explore the FineWeb dataset through an advanced hands-on workflow. We stream a manageable sample of the dataset without downloading the full multi-terabyte corpus, inspect its schema and metadata, and analyze key fields such as URL, language, language score, and token count. We also reproduce simplified versions of FineWeb’s quality-filtering pipeline, apply MinHash-based near-duplicate detection, verify token counts with the GPT-2 tokenizer, and generate useful analytics on domains, language scores, document lengths, and tokenizer efficiency.
import subprocess, sys def pip(*pkgs): subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True) pip("datasets>=2.19", "datasketch", "tiktoken", "pandas", "matplotlib", "tqdm") import re, math, random, collections from urllib.parse import urlparse import pandas as pd import numpy as np import matplotlib.pyplot as plt from tqdm.auto import tqdm from datasets import load_dataset random.seed(0); np.random.seed(0) pd.set_option("display.max_colwidth", 90)
We begin by installing all required libraries for streaming, analysis, deduplication, tokenization, and visualization. We import the core Python packages needed to process FineWeb documents and work with tabular data. We also set random seeds and display options so that our results remain consistent and easier to inspect.
N_DOCS = 3000 print(f"Streaming {N_DOCS} docs from FineWeb sample-10BT ...") stream = load_dataset( "HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True, ) docs = [] for i, doc in enumerate(tqdm(stream, total=N_DOCS)): docs.append(doc) if i + 1 >= N_DOCS: break df = pd.DataFrame(docs) print("\nColumns:", list(df.columns)) print(df[["url", "language", "language_score", "token_count"]].head(5)) ex = docs[0] print("\n--- Example record (fields) ---") for k, v in ex.items(): preview = (v[:120] + "…") if isinstance(v, str) and len(v) > 120 else v print(f"{k:>16}: {preview}")
In this tutorial, we explore the FineWeb dataset through an advanced hands-on workflow. We stream a manageable sample of the dataset without downloading the full multi-terabyte corpus, inspect its schema and metadata, and analyze key fields such as URL, language, language score, and token count. We also reproduce simplified versions of FineWeb’s quality-filtering pipeline, apply MinHash-based near-duplicate detection, verify token counts with the GPT-2 tokenizer, and generate useful analytics on domains, language scores, document lengths, and tokenizer efficiency.
We stream a fixed number of documents from the FineWeb sample-10BT subset without downloading the full dataset. We convert the streamed records into a DataFrame and inspect key metadata fields, including URL, language, language score, and token count. We also print a complete example record to better understand the dataset’s structure.
WORD = re.compile(r"\b\w+\b") def gopher_quality(text): words = WORD.findall(text) n = len(words) if n 100_000: return False, "word_count_out_of_range" mean_len = sum(len(w) for w in words) / n if mean_len 10: return False, "bad_mean_word_length" if (text.count("#") + text.count("...")) / n > 0.1: return False, "too_many_symbols" lines = text.split("\n") if lines and sum(l.lstrip().startswith(("•", "-", "*")) for l in lines) / len(lines) > 0.9: return False, "mostly_bullets" stops = {"the", "be", "to", "of", "and", "that", "have", "with"} if len(stops & {w.lower() for w in words}) 0 and text.count("{") / max(len(lines), 1) > 0.5: return False, "too_many_braces" return True, "ok" def fineweb_custom(text): lines = [l.strip() for l in text.split("\n") if l.strip()] if not lines: return False, "empty" dup_frac = 1 - len(set(lines)) / len(lines) if dup_frac > 0.3: return False, "duplicated_lines" short_frac = sum(len(l) 0.67 and len(lines) > 5: return False, "list_like" return True, "ok" results = [] for d in docs: t = d["text"] g_ok, g_r = gopher_quality(t) c_ok, c_r = c4_quality(t) f_ok, f_r = fineweb_custom(t) reason = "kept" if (g_ok and c_ok and f_ok) else (g_r if not g_ok else c_r if not c_ok else f_r) results.append(reason) filter_summary = pd.Series(results).value_counts() print("\n--- Quality-filter outcomes on already-clean FineWeb data ---") print("(Most pass: FineWeb is pre-filtered. Rejections show what the rules catch.)") print(filter_summary)
We recreate simplified versions of FineWeb’s quality filters using Gopher-style, C4-style, and custom text-cleaning heuristics. We check each document for issues such as abnormal word counts, poor word statistics, boilerplate text, repeated lines, and list-like structure. We summarize how many documents pass or fail these filters to understand the quality of the already-cleaned FineWeb sample.
from datasketch import MinHash, MinHashLSH def shingles(text, k=5): toks = WORD.findall(text.lower()) return {" ".join(toks[i:i+k]) for i in range(max(len(toks) - k + 1, 1))} NUM_PERM = 128 THRESHOLD = 0.7 lsh = MinHashLSH(threshold=THRESHOLD, num_perm=NUM_PERM) minhashes = {} for idx, d in enumerate(tqdm(docs, desc="MinHashing")): m = MinHash(num_perm=NUM_PERM) for s in shingles(d["text"]): m.update(s.encode("utf8")) minhashes[idx] = m lsh.insert(str(idx), m) dup_pairs = set() for idx, m in minhashes.items(): for cand in lsh.query(m): c = int(cand) if c != idx: dup_pairs.add(tuple(sorted((idx, c)))) print(f"\nFound {len(dup_pairs)} near-duplicate pairs (Jaccard ≥ {THRESHOLD}).") if dup_pairs: a, b = next(iter(dup_pairs)) j = minhashes[a].jaccard(minhashes[b]) print(f"Example pair (estimated Jaccard ≈ {j:.2f}):") print(" DOC A:", docs[a]["text"][:160].replace("\n", " "), "…") print(" DOC B:", docs[b]["text"][:160].replace("\n", " "), "…") else: print("No near-dupes in this slice — expected, since FineWeb is dedup'd per crawl.")
We implement MinHash-based near-duplicate detection to approximate how large web corpora identify repeated or highly similar documents. We convert each document into word shingles, generate MinHash signatures, and index them with Locality Sensitive Hashing. We then search for near-duplicate document pairs and inspect an example if any similar texts are found.
import tiktoken enc = tiktoken.get_encoding("gpt2") check = docs[:200] recomputed = [len(enc.encode(d["text"])) for d in tqdm(check, desc="Tokenizing")] stored = [d["token_count"] for d in check] diffs = np.array(recomputed) - np.array(stored) print(f"\n--- Verifying token_count field (gpt2) on 200 docs ---") print(f"Mean abs diff vs stored token_count: {np.abs(diffs).mean():.2f} tokens") print(f"Exact matches: {(diffs == 0).mean()*100:.0f}% (small drift = tokenizer version)") df["chars_per_token"] = df["text"].str.len() / df["token_count"].clip(lower=1) print(f"Avg characters per token: {df['chars_per_token'].mean():.2f}")
We verify the dataset’s token_count field by recomputing GPT-2 token counts with the tiktoken tokenizer. We compare the recomputed token counts with the stored values and measure the average difference between them. We also calculate characters per token to understand tokenizer efficiency across the sampled documents.
df["domain"] = df["url"].apply(lambda u: urlparse(u).netloc.replace("www.", "") if isinstance(u, str) else "?") top_domains = df["domain"].value_counts().head(15) print("\n--- Top 15 domains in sample ---") print(top_domains) fig, axes = plt.subplots(2, 2, figsize=(14, 10)) axes[0, 0].hist(df["token_count"].clip(upper=4000), bins=50, color="#7b2d26") axes[0, 0].set_title("Token count per document (gpt2)") axes[0, 0].set_xlabel("tokens"); axes[0, 0].set_ylabel("docs") axes[0, 1].hist(df["language_score"], bins=40, color="#2d5d7b") axes[0, 1].axvline(0.65, color="red", ls="--", label="FineWeb cutoff 0.65") axes[0, 1].set_title("fastText English language score") axes[0, 1].set_xlabel("score"); axes[0, 1].legend() axes[1, 0].hist(df["chars_per_token"].clip(upper=8), bins=40, color="#3f7b2d") axes[1, 0].set_title("Characters per token (compression)") axes[1, 0].set_xlabel("chars / token") top_domains.iloc[::-1].plot(kind="barh", ax=axes[1, 1], color="#7b5d2d") axes[1, 1].set_title("Top domains") plt.tight_layout() plt.show() print("\n" + "=" * 70) print("SUMMARY") print("=" * 70) print(f"Docs streamed : {len(df):,}") print(f"Total gpt2 tokens : {df['token_count'].sum():,}") print(f"Median tokens/doc : {int(df['token_count'].median())}") print(f"Unique domains : {df['domain'].nunique():,}") print(f"Mean language_score : {df['language_score'].mean():.3f}") print(f"Near-duplicate pairs : {len(dup_pairs)}") print(f"Docs flagged by filters : {(pd.Series(results) != 'kept').sum()} / {len(results)}") print("\nNext steps:") print(" • Swap name='sample-10BT' for a real crawl, e.g. name='CC-MAIN-2024-10'") print(" • Raise N_DOCS for stronger statistics") print(" • Use the full datatrove pipeline to reproduce FineWeb end-to-end")
We extract domain names from URLs and identify the most frequent domains present in the FineWeb sample. We create visualizations for token count distribution, language score distribution, characters per token, and top domains. We finish by printing a compact summary of streamed documents, total tokens, median length, unique domains, language quality, duplicate count, and filter results.
In conclusion, we developed a practical understanding of how large-scale web datasets such as FineWeb are explored, filtered, deduplicated, and analyzed for language model training. We worked efficiently with streaming data, tested quality heuristics on real documents, identified near-duplicate text patterns, and validated token-level metadata using a production-style tokenizer. It can be used to scale the workflow to larger FineWeb crawls, perform deeper corpus analysis, and design high-quality preprocessing pipelines for LLM dataset preparation.
Check out the Full Codes with Notebook. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.
Sana HassanHow to Build a QwenPaw Agent Workspace with Custom Skills, Model Providers, Console Access, and Streaming API Testing
Sana HassanA Coding Implementation on Spatial Graph Neural Networks for Urban Function Inference Using city2graph, OSMnx, and PyTorch Geometric
Sana HassanA Coding Implementation on MONAI for End-to-End 3D Spleen Segmentation Using UNet on Medical CT Volumes
Sana HassanA Coding Implementation on Microsoft SkillOpt for Instrumented Prompt Optimization, Skill Evolution Analysis, and Baseline Comparison
Sana HassanBuilding a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata with Streaming, Pandas, and tiktoken
Sana HassanNVIDIA cuTile Python Tutorial: Building Tiled GPU Kernels for Vector Addition, Matrix Addition, and Matrix Multiplication in Colab
Sana HassanClawHub Security Signals: A Coding Guide to End-to-End Security Signal Analysis and Verdict Classification on the AI Skills Dataset
Sana HassanBuilding Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation
Sana HassanNVIDIA garak Tutorial: Build a Complete Defensive LLM Red-Teaming Workflow with Custom Probes and Detectors
Sana HassanA Hands-On Coding Tutorial on Qualcomm AI Hub Models for Classification, Object Detection, and Hardware-Aware Deployment
Sana HassanMicrosoft Fara Tutorial: Run a Browser-Use Agent in Google Colab with a Mock OpenAI-Compatible Endpoint
Sana HassanBuilding a Semantic Search Engine and Open-Status Classifier over the ResearchMath-14k Dataset
Sana HassanHow to Build a Document Intelligence Backend with iii Using Workers, Functions, and Cron Triggers
Sana HassanHow to Fine-Tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Coding Tutorial on Google Colab
Sana HassanHow to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp
Sana HassanAn Implementation of the Microsoft Agent Governance Toolkit for Safe AI Agent Tool Use with Policies, Approvals, Audit Logs, and Risk Controls
Sana HassanA Coding Implementation on Loguru for Designing Robust, Structured, Concurrent, and Production-Ready Python Logging Pipelines
Sana HassanBuild Skill-Augmented AI Agents with SkillNet for Search, Evaluation, Graph Analysis, and Task Planning
Sana HassanHow to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean ShareGPT SFT Dataset in Python
Sana HassanHow to Design an End-to-End Ansible Automation Lab with Playbooks, Inventories, Roles, Vault, Dynamic Inventory, and Custom Modules
Sana HassanA Coding Guide to Implement a pgvector-Powered Semantic, Hybrid, Sparse, and Quantized Vector Search System
Sana HassanDesign a High-Precision Retrieve-and-Rerank Pipeline with ZeroEntropy Zerank-2 Reranker
Sana HassanDesign a Complete Multimodal RLVR Pipeline with Open-MM-RL, Vision-Language Prompting, Reward Scoring, and GRPO Export
Sana HassanStep by Step Guide to Build and Compare FedAvg and FedProx Federated Learning on Non-IID CIFAR-10 with NVIDIA FLARE
Sana HassanBuild a Complete Langfuse Observability and Evaluation Pipeline for Tracing, Prompt Management, Scoring, and Experiments
Sana HassanBuild a SuperClaude Framework Workflow with Commands, Agents, Modes, and Session Memory
Sana HassanBuild Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE, and Loop-Scaled Reasoning
Sana HassanHow to Build Knowledge Graph Generation Pipelines From Text With kg-gen, NetworkX Analytics, and Interactive Visualizations
Sana HassanHow to Build an Advanced Agentic AI System with Planning, Tool Calling, Memory, and Self-Critique Using OpenAI API
Sana HassanA Coding Implementation to Compress and Benchmark Instruction-Tuned LLMs with FP8, GPTQ, and SmoothQuant Quantization using llmcompressor
Sana HassanA Coding Guide Implementing SHAP Explainability Workflows with Explainer Comparisons, Maskers, Interactions, Drift, and Black-Box Models
Sana HassanHow to Build Repository-Level Code Intelligence with Repowise Using Graph Analysis, Dead-Code Detection, Decisions, and AI Context
Sana HassanHow to Build an MCP Style Routed AI Agent System with Dynamic Tool Exposure Planning, Execution, and Context Injection
Sana HassanHow to Build a Django-Unfold Admin Dashboard with Custom Models, Filters, Actions, and KPIs
Sana HassanA Coding Implementation to Master GPU Computing with CuPy, Custom CUDA Kernels, Streams, Sparse Matrices, and Profiling
Sana HassanHow to Build a Dynamic Zero-Trust Network Simulation with Graph-Based Micro-Segmentation, Adaptive Policy Engine, and Insider Threat Detection
Sana HassanBuild a Hybrid-Memory Autonomous Agent with Modular Architecture and Tool Dispatch Using OpenAI
Sana HassanA Coding Implementation to Portfolio Optimization with skfolio for Building Testing, Tuning, and Comparing Modern Investment Strategies
Sana HassanHow to Build Technical Analysis and Backtesting Workflow with pandas-ta-classic, Strategy Signals, and Performance Metrics
Sana HassanA Coding Implementation to Build Agent-Native Memory Infrastructure with Memori for Persistent Multi-User and Multi-Session LLM Applications
Sana HassanHow to Build a Cost-Aware LLM Routing System with NadirClaw Using Local Prompt Classification and Gemini Model Switching
Sana HassanA Coding Implementation to Recover Hidden Malware IOCs with FLARE-FLOSS Beyond Classic Strings Analysis
Sana HassanHow to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery
Sana HassanBuild a CloakBrowser Automation Workflow with Stealth Chromium, Persistent Profiles, and Browser Signal Inspection
Sana HassanHow to Build a Fully Interactive Multi-Page NiceGUI Application with Real-Time Dashboard, CRUD Operations, File Upload, and Async Chat
Sana HassanBuild a Modular Skill-Based Agent System for LLMs with Dynamic Tool Routing in Python
Sana HassanA Coding Guide to Survey Bias Correction Using Facebook Research Balance with IPW CBPS Ranking and Post Stratification Methods
Sana HassanHow to Build an End-to-End Production Grade Machine Learning Pipeline with ZenML, Including Custom Materializers, Metadata Tracking, and Hyperparameter Optimization
Sana HassanA Coding Implementation to Explore and Analyze the TaskTrove Dataset with Streaming Parsing Visualization and Verifier Detection
Sana HassanA Coding Implementation of End-to-End Brain Decoding from MEG Signals Using NeuralSet and Deep Learning for Predicting Linguistic Features
Sana HassanA Coding Guide on LLM Post Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning
Sana HassanA Coding Deep Dive into Agentic UI, Generative UI, State Synchronization, and Interrupt-Driven Approval Flows
Sana HassanA Coding Implementation on Pyright Type Checking Covering Generics, Protocols, Strict Mode, Type Narrowing, and Modern Python Typing
Sana HassanA Coding Implementation on Document Parsing Benchmarking with LlamaIndex ParseBench Using Python, Hugging Face, and Evaluation Metrics
Sana HassanHow to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI
Sana HassanHow to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control
Sana HassanHow to Build a Fully Searchable AI Knowledge Base with OpenKB, OpenRouter, and Llama
Sana HassanHow to Build Smarter Multilingual Text Wrapping with BudouX Through Parsing, HTML Rendering, Model Introspection, and Toy Training
Sana HassanA Coding Tutorial on Datashader on Rendering Massive Datasets with High-Performance Python Visual Analytics
Sana HassanA Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing
Sana HassanA Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation
Sana HassanA Detailed Implementation on Equinox with JAX Native Modules, Filtered Transforms, Stateful Layers, and End-to-End Training Workflows
Sana HassanA Coding Implementation to Build a Conditional Bayesian Hyperparameter Optimization Pipeline with Hyperopt, TPE, and Early Stopping
Sana HassanA Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence
Sana HassanA Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning
Sana HassanA Coding Implementation to Build an AI-Powered File Type Detection and Security Analysis Pipeline with Magika and OpenAI
Sana HassanA Coding Implementation of Quantum State Evolution, Decoherence, and Entanglement Dynamics using QuTiP
Sana HassanGoogle AI Introduced Guardrailed-AMIE (g-AMIE): A Multi-Agent Approach to Accountability in Conversational Medical AI
Sana HassanPrefix-RFT: A Unified Machine Learning Framework to blend Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT)
Sana HassanHuawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving
Sana HassanZenFlow: A New DeepSpeed Extension Designed as a Stall-Free Offloading Engine for Large Language Model (LLM) Training
Sana HassanA Coding Implementation to Build a Complete Self-Hosted LLM Workflow with Ollama, REST API, and Gradio Chat Interface
Sana HassanMemp: A Task-Agnostic Framework that Elevates Procedural Memory to a Core Optimization Target in LLM-based Agent
Sana HassanA Coding Guide to Build and Validate End-to-End Partitioned Data Pipelines in Dagster with Machine Learning Integration
Sana HassanEfficient AI Agents Don’t Have to Be Expensive: Here’s Proof
Sana HassanGenie Envisioner: A Unified Video-Generative Platform for Scalable, Instruction-Driven Robotic Manipulation
Sana HassanBuilding an Advanced Portfolio Analysis and Market Intelligence Tool with OpenBB
Sana HassanGraph-R1: An Agentic GraphRAG Framework for Structured, Multi-Turn Reasoning with Reinforcement Learning
Sana HassanMIT Researchers Develop Methods to Control Transformer Sensitivity with Provable Lipschitz Bounds and Muon
Sana HassanTransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs
Sana HassanWhy Context Matters: Transforming AI Model Evaluation with Contextualized Queries
Sana HassanURBAN-SIM: Advancing Autonomous Micromobility with Scalable Urban Simulation
Sana HassanGPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks
Sana HassanA Code Implementation to Efficiently Leverage LangChain to Automate PubMed Literature Searches, Parsing, and Trend Visualization
Sana HassanCan LLM Reward Models Be Trusted? Master-RM Exposes and Fixes Their Weaknesses
Sana HassanEG-CFG: Enhancing Code Generation with Real-Time Execution Feedback
Sana HassanMirage: Multimodal Reasoning in VLMs Without Rendering Images
Sana HassanNeuralOS: A Generative Framework for Simulating Interactive Operating System Interfaces
Sana HassanEfficient and Adaptable Speech Enhancement via Pre-trained Generative Audioencoders and Vocoders
Sana HassanSDBench and MAI-DxO: Advancing Realistic, Cost-Aware Clinical Reasoning with AI
Sana HassanFrom Perception to Action: The Role of World Models in Embodied AI Systems
Sana HassanMistral AI Releases Devstral 2507 for Code-Centric Language Modeling
Sana HassanPerplexity Introduces Comet—An AI-First Alternative to Traditional Browsers
Sana HassanMicrosoft Open-Sources GitHub Copilot Chat Extension for VS Code—Now Free for All Developers
Sana HassanHow Radial Attention Cuts Costs in Video Diffusion by 4.4× Without Sacrificing Quality
Sana HassanSynPref-40M and Skywork-Reward-V2: Scalable Human-AI Alignment for State-of-the-Art Reward Models
Sana HassanA Coding Guide to Build Modular and Self-Correcting QA Systems with DSPy
Sana HassanAbstRaL: Teaching LLMs Abstract Reasoning via Reinforcement to Boost Robustness on GSM Benchmarks
Sana HassanKyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training
Sana HassanA Tutorial on Using OpenAI Codex with GitHub Repositories for Seamless AI-Powered Development
Sana HassanThought Anchors: A Machine Learning Framework for Identifying and Measuring Key Reasoning Steps in Large Language Models with Precision
Sana HassanBuilding a BioCypher-Powered AI Agent for Biomedical Knowledge Graph Generation and Querying
Sana HassanLongWriter-Zero: A Reinforcement Learning Framework for Ultra-Long Text Generation Without Synthetic Data
Sana HassanMDM-Prime: A generalized Masked Diffusion Models (MDMs) Framework that Enables Partially Unmasked Tokens during Sampling
Sana HassanUC San Diego Researchers Introduced Dex1B: A Billion-Scale Dataset for Dexterous Hand Manipulation in Robotics
Sana HassanDeepRare: The First AI-Powered Agentic Diagnostic System Transforming Clinical Decision-Making in Rare Disease Management
Sana HassanGURU: A Reinforcement Learning Framework that Bridges LLM Reasoning Across Six Domains
Sana HassanMIT and NUS Researchers Introduce MEM1: A Memory-Efficient Framework for Long-Horizon Language Agents
Sana HassanETH and Stanford Researchers Introduce MIRIAD: A 5.8M Pair Dataset to Improve LLM Accuracy in Medical AI
Sana HassanByteDance Researchers Introduce Seed-Coder: A Model-Centric Code LLM Trained on 6 Trillion Tokens
Sana HassanA Coding Implementation for Creating, Annotating, and Visualizing Complex Biological Knowledge Graphs Using PyBEL
Sana HassanByteDance Researchers Introduce ProtoReasoning: Enhancing LLM Generalization via Logic-Based Prototypes
Sana HassanBuild a Groundedness Verification Tool Using Upstage API and LangChain
Sana HassanA Coding Guide to Build a Production-Ready Asynchronous Python SDK with Rate Limiting, In-Memory Caching, and Authentication
Sana HassanEmbodiedGen: A Scalable 3D World Generator for Realistic Embodied AI Simulations
Sana HassanTexas A&M Researchers Introduce a Two-Phase Machine Learning Method Named ‘ShockCast’ for High-Speed Flow Simulation with Neural Temporal Re-Meshing
Sana HassanMistral AI Releases Mistral Small 3.2: Enhanced Instruction Following, Reduced Repetition, and Stronger Function Calling for AI Integration
Sana HassanPoE-World + Planner Outperforms Reinforcement Learning RL Baselines in Montezuma’s Revenge with Minimal Demonstration Data
Sana HassanReVisual-R1: An Open-Source 7B Multimodal Large Language Model (MLLMs) that Achieves Long, Accurate and Thoughtful Reasoning
Sana HassanWhy Small Language Models (SLMs) Are Poised to Redefine Agentic AI: Efficiency, Cost, and Practical Deployment
Sana HassanAREAL: Accelerating Large Reasoning Model Training with Fully Asynchronous Reinforcement Learning
Sana HassanBuilding High-Performance Financial Analytics Pipelines with Polars: Lazy Evaluation, Advanced Expressions, and SQL Integration
Sana HassanOThink-R1: A Dual-Mode Reasoning Framework to Cut Redundant Computation in LLMs
Sana HassanBuilding AI-Powered Applications Using the Plan → Files → Code Workflow in TinyDev
Sana HassanMemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language Models
Sana HassanGoogle AI Unveils a Hybrid AI-Physics Model for Accurate Regional Climate Risk Forecasts with Better Uncertainty Assessment
Sana HassanRun Multiple AI Coding Agents in Parallel with Container-Use from Dagger
Sana HassanHow Do LLMs Really Reason? A Framework to Separate Logic from Knowledge
Sana HassanFrom Text to Action: How Tool-Augmented AI Agents Are Redefining Language Models with Reasoning, Memory, and Autonomy
Sana HassanMeet BioReason: The World’s First Reasoning Model in Biology that Enables AI to Reason about Genomics like a Biology Expert
Sana HassanDarwin Gödel Machine: A Self-Improving AI Agent That Evolves Code Using Foundation Models and Real-World Benchmarks
Sana HassanSalesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents
Sana HassanLifelongAgentBench: A Benchmark for Evaluating Continuous Learning in LLM-Based Agents
Sana HassanMistral AI Introduces Codestral Embed: A High-Performance Code Embedding Model for Scalable Retrieval and Semantic Understanding
Sana HassanOff-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models
Sana HassanThis AI Paper from Microsoft Introduces WINA: A Training-Free Sparse Activation Framework for Efficient Large Language Model Inference
Sana HassanApple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy
Sana HassanNational University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text Generation
Sana HassanLLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept Embeddings
Sana HassanResearchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary Search
Sana HassanMicrosoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language Interfaces
Sana HassanOptimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers
Sana HassanEvaluating Enterprise-Grade AI Assistants: A Benchmark for Complex, Voice-Driven Workflows
Sana HassanBeyond Aha Moments: Structuring Reasoning in Large Language Models
Sana HassanRXTX: A Machine Learning-Guided Algorithm for Efficient Structured Matrix Multiplication
Sana HassanFrom Protocol to Production: How Model Context Protocol (MCP) Gateways Enable Secure, Scalable, and Seamless AI Integrations Across Enterprises
Sana HassanResearchers from Renmin University and Huawei Propose MemEngine: A Unified Modular AI Library for Customizing Memory in LLM-Based Agents
Sana HassanMeta Introduces KernelLLM: An 8B LLM that Translates PyTorch Modules into Efficient Triton GPU Kernels
Sana HassanOmni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated Data
Sana HassanReinforcement Learning Makes LLMs Search-Savvy: Ant Group Researchers Introduce SEM to Optimize Tool Usage and Reasoning Efficiency
Sana HassanSWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents
Sana HassanThis AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency
Sana HassanMeet LangGraph Multi-Agent Swarm: A Python Library for Creating Swarm-Style Multi-Agent Systems Using LangGraph
Sana HassanByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and Reasoning
Sana HassanResearchers from Tsinghua and ModelBest Release Ultra-FineWeb: A Trillion-Token Dataset Enhancing LLM Accuracy Across Benchmarks
Sana HassanCoding Agents See 75% Surge: SimilarWeb’s AI Usage Report Highlights the Sectors Winning and Losing in 2025’s Generative AI Boom
Sana HassanRethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification
Sana HassanA Step-by-Step Guide on Building, Customizing, and Publishing an AI-Focused Blogging Website with Lovable.dev and Seamless GitHub Integration
Sana HassanNVIDIA AI Introduces Audio-SDS: A Unified Diffusion-Based Framework for Prompt-Guided Audio Synthesis and Source Separation without Specialized Datasets
Sana HassanTencent Released PrimitiveAnything: A New AI Framework That Reconstructs 3D Shapes Using Auto-Regressive Primitive Generation
Sana HassanMicrosoft Researchers Introduce ARTIST: A Reinforcement Learning Framework That Equips LLMs with Agentic Reasoning and Dynamic Tool Use
Sana HassanA Deep Technical Dive into Next-Generation Interoperability Protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP)
Sana HassanMing-Lite-Uni: An Open-Source AI Framework Designed to Unify Text and Vision through an Autoregressive Multimodal Structure
Sana HassanMultimodal LLMs Without Compromise: Researchers from UCLA, UW–Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities
Sana HassanNVIDIA Open-Sources Open Code Reasoning Models (32B, 14B, 7B)
Sana HassanIs Automated Hallucination Detection in LLMs Feasible? A Theoretical and Empirical Investigation
Sana HassanGoogle Releases 76-Page Whitepaper on AI Agents: A Deep Technical Dive into Agentic RAG, Evaluation Frameworks, and Real-World Architectures
import subprocess, sys def pip(*pkgs): subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True) pip("datasets>=2.19", "datasketch", "tiktoken", "pandas", "matplotlib", "tqdm") import re, math, random, collections from urllib.parse import urlparse import pandas as pd import numpy as np import matplotlib.pyplot as plt from tqdm.auto import tqdm from datasets import load_dataset random.seed(0); np.random.seed(0) pd.set_option("display.max_colwidth", 90)
We begin by installing all required libraries for streaming, analysis, deduplication, tokenization, and visualization. We import the core Python packages needed to process FineWeb documents and work with tabular data. We also set random seeds and display options so that our results remain consistent and easier to inspect.
N_DOCS = 3000 print(f"Streaming {N_DOCS} docs from FineWeb sample-10BT ...") stream = load_dataset( "HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True, ) docs = [] for i, doc in enumerate(tqdm(stream, total=N_DOCS)): docs.append(doc) if i + 1 >= N_DOCS: break df = pd.DataFrame(docs) print("\nColumns:", list(df.columns)) print(df[["url", "language", "language_score", "token_count"]].head(5)) ex = docs[0] print("\n--- Example record (fields) ---") for k, v in ex.items(): preview = (v[:120] + "…") if isinstance(v, str) and len(v) > 120 else v print(f"{k:>16}: {preview}")
We stream a fixed number of documents from the FineWeb sample-10BT subset without downloading the full dataset. We convert the streamed records into a DataFrame and inspect key metadata fields, including URL, language, language score, and token count. We also print a complete example record to better understand the dataset’s structure.
WORD = re.compile(r"\b\w+\b") def gopher_quality(text): words = WORD.findall(text) n = len(words) if n 100_000: return False, "word_count_out_of_range" mean_len = sum(len(w) for w in words) / n if mean_len 10: return False, "bad_mean_word_length" if (text.count("#") + text.count("...")) / n > 0.1: return False, "too_many_symbols" lines = text.split("\n") if lines and sum(l.lstrip().startswith(("•", "-", "*")) for l in lines) / len(lines) > 0.9: return False, "mostly_bullets" stops = {"the", "be", "to", "of", "and", "that", "have", "with"} if len(stops & {w.lower() for w in words}) 0 and text.count("{") / max(len(lines), 1) > 0.5: return False, "too_many_braces" return True, "ok" def fineweb_custom(text): lines = [l.strip() for l in text.split("\n") if l.strip()] if not lines: return False, "empty" dup_frac = 1 - len(set(lines)) / len(lines) if dup_frac > 0.3: return False, "duplicated_lines" short_frac = sum(len(l) 0.67 and len(lines) > 5: return False, "list_like" return True, "ok" results = [] for d in docs: t = d["text"] g_ok, g_r = gopher_quality(t) c_ok, c_r = c4_quality(t) f_ok, f_r = fineweb_custom(t) reason = "kept" if (g_ok and c_ok and f_ok) else (g_r if not g_ok else c_r if not c_ok else f_r) results.append(reason) filter_summary = pd.Series(results).value_counts() print("\n--- Quality-filter outcomes on already-clean FineWeb data ---") print("(Most pass: FineWeb is pre-filtered. Rejections show what the rules catch.)") print(filter_summary)
We recreate simplified versions of FineWeb’s quality filters using Gopher-style, C4-style, and custom text-cleaning heuristics. We check each document for issues such as abnormal word counts, poor word statistics, boilerplate text, repeated lines, and list-like structure. We summarize how many documents pass or fail these filters to understand the quality of the already-cleaned FineWeb sample.
from datasketch import MinHash, MinHashLSH def shingles(text, k=5): toks = WORD.findall(text.lower()) return {" ".join(toks[i:i+k]) for i in range(max(len(toks) - k + 1, 1))} NUM_PERM = 128 THRESHOLD = 0.7 lsh = MinHashLSH(threshold=THRESHOLD, num_perm=NUM_PERM) minhashes = {} for idx, d in enumerate(tqdm(docs, desc="MinHashing")): m = MinHash(num_perm=NUM_PERM) for s in shingles(d["text"]): m.update(s.encode("utf8")) minhashes[idx] = m lsh.insert(str(idx), m) dup_pairs = set() for idx, m in minhashes.items(): for cand in lsh.query(m): c = int(cand) if c != idx: dup_pairs.add(tuple(sorted((idx, c)))) print(f"\nFound {len(dup_pairs)} near-duplicate pairs (Jaccard ≥ {THRESHOLD}).") if dup_pairs: a, b = next(iter(dup_pairs)) j = minhashes[a].jaccard(minhashes[b]) print(f"Example pair (estimated Jaccard ≈ {j:.2f}):") print(" DOC A:", docs[a]["text"][:160].replace("\n", " "), "…") print(" DOC B:", docs[b]["text"][:160].replace("\n", " "), "…") else: print("No near-dupes in this slice — expected, since FineWeb is dedup'd per crawl.")
We implement MinHash-based near-duplicate detection to approximate how large web corpora identify repeated or highly similar documents. We convert each document into word shingles, generate MinHash signatures, and index them with Locality Sensitive Hashing. We then search for near-duplicate document pairs and inspect an example if any similar texts are found.
import tiktoken enc = tiktoken.get_encoding("gpt2") check = docs[:200] recomputed = [len(enc.encode(d["text"])) for d in tqdm(check, desc="Tokenizing")] stored = [d["token_count"] for d in check] diffs = np.array(recomputed) - np.array(stored) print(f"\n--- Verifying token_count field (gpt2) on 200 docs ---") print(f"Mean abs diff vs stored token_count: {np.abs(diffs).mean():.2f} tokens") print(f"Exact matches: {(diffs == 0).mean()*100:.0f}% (small drift = tokenizer version)") df["chars_per_token"] = df["text"].str.len() / df["token_count"].clip(lower=1) print(f"Avg characters per token: {df['chars_per_token'].mean():.2f}")
We verify the dataset’s token_count field by recomputing GPT-2 token counts with the tiktoken tokenizer. We compare the recomputed token counts with the stored values and measure the average difference between them. We also calculate characters per token to understand tokenizer efficiency across the sampled documents.
df["domain"] = df["url"].apply(lambda u: urlparse(u).netloc.replace("www.", "") if isinstance(u, str) else "?") top_domains = df["domain"].value_counts().head(15) print("\n--- Top 15 domains in sample ---") print(top_domains) fig, axes = plt.subplots(2, 2, figsize=(14, 10)) axes[0, 0].hist(df["token_count"].clip(upper=4000), bins=50, color="#7b2d26") axes[0, 0].set_title("Token count per document (gpt2)") axes[0, 0].set_xlabel("tokens"); axes[0, 0].set_ylabel("docs") axes[0, 1].hist(df["language_score"], bins=40, color="#2d5d7b") axes[0, 1].axvline(0.65, color="red", ls="--", label="FineWeb cutoff 0.65") axes[0, 1].set_title("fastText English language score") axes[0, 1].set_xlabel("score"); axes[0, 1].legend() axes[1, 0].hist(df["chars_per_token"].clip(upper=8), bins=40, color="#3f7b2d") axes[1, 0].set_title("Characters per token (compression)") axes[1, 0].set_xlabel("chars / token") top_domains.iloc[::-1].plot(kind="barh", ax=axes[1, 1], color="#7b5d2d") axes[1, 1].set_title("Top domains") plt.tight_layout() plt.show() print("\n" + "=" * 70) print("SUMMARY") print("=" * 70) print(f"Docs streamed : {len(df):,}") print(f"Total gpt2 tokens : {df['token_count'].sum():,}") print(f"Median tokens/doc : {int(df['token_count'].median())}") print(f"Unique domains : {df['domain'].nunique():,}") print(f"Mean language_score : {df['language_score'].mean():.3f}") print(f"Near-duplicate pairs : {len(dup_pairs)}") print(f"Docs flagged by filters : {(pd.Series(results) != 'kept').sum()} / {len(results)}") print("\nNext steps:") print(" • Swap name='sample-10BT' for a real crawl, e.g. name='CC-MAIN-2024-10'") print(" • Raise N_DOCS for stronger statistics") print(" • Use the full datatrove pipeline to reproduce FineWeb end-to-end")
We extract domain names from URLs and identify the most frequent domains present in the FineWeb sample. We create visualizations for token count distribution, language score distribution, characters per token, and top domains. We finish by printing a compact summary of streamed documents, total tokens, median length, unique domains, language quality, duplicate count, and filter results.
In conclusion, we developed a practical understanding of how large-scale web datasets such as FineWeb are explored, filtered, deduplicated, and analyzed for language model training. We worked efficiently with streaming data, tested quality heuristics on real documents, identified near-duplicate text patterns, and validated token-level metadata using a production-style tokenizer. It can be used to scale the workflow to larger FineWeb crawls, perform deeper corpus analysis, and design high-quality preprocessing pipelines for LLM dataset preparation.
Check out the Full Codes with Notebook. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.
Sana HassanHow to Build a QwenPaw Agent Workspace with Custom Skills, Model Providers, Console Access, and Streaming API Testing
Sana HassanA Coding Implementation on Spatial Graph Neural Networks for Urban Function Inference Using city2graph, OSMnx, and PyTorch Geometric
Sana HassanA Coding Implementation on MONAI for End-to-End 3D Spleen Segmentation Using UNet on Medical CT Volumes
Sana HassanA Coding Implementation on Microsoft SkillOpt for Instrumented Prompt Optimization, Skill Evolution Analysis, and Baseline Comparison
Sana HassanBuilding a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata with Streaming, Pandas, and tiktoken
Sana HassanNVIDIA cuTile Python Tutorial: Building Tiled GPU Kernels for Vector Addition, Matrix Addition, and Matrix Multiplication in Colab
Sana HassanClawHub Security Signals: A Coding Guide to End-to-End Security Signal Analysis and Verdict Classification on the AI Skills Dataset
Sana HassanBuilding Reflective Prompt Optimization with GEPA: Multi-Component Prompts, Structured Feedback, and Held-Out Validation
Sana HassanNVIDIA garak Tutorial: Build a Complete Defensive LLM Red-Teaming Workflow with Custom Probes and Detectors
Sana HassanA Hands-On Coding Tutorial on Qualcomm AI Hub Models for Classification, Object Detection, and Hardware-Aware Deployment
Sana HassanMicrosoft Fara Tutorial: Run a Browser-Use Agent in Google Colab with a Mock OpenAI-Compatible Endpoint
Sana HassanBuilding a Semantic Search Engine and Open-Status Classifier over the ResearchMath-14k Dataset
Sana HassanHow to Build a Document Intelligence Backend with iii Using Workers, Functions, and Cron Triggers
Sana HassanHow to Fine-Tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Coding Tutorial on Google Colab
Sana HassanHow to Speed Up Transformer Training Using NVIDIA Apex (FusedAdam, FusedLayerNorm) and Native torch.amp
Sana HassanAn Implementation of the Microsoft Agent Governance Toolkit for Safe AI Agent Tool Use with Policies, Approvals, Audit Logs, and Risk Controls
Sana HassanA Coding Implementation on Loguru for Designing Robust, Structured, Concurrent, and Production-Ready Python Logging Pipelines
Sana HassanBuild Skill-Augmented AI Agents with SkillNet for Search, Evaluation, Graph Analysis, and Task Planning
Sana HassanHow to Use AgentTrove: Streaming 1.7M Agentic Traces and Building a Clean ShareGPT SFT Dataset in Python
Sana HassanHow to Design an End-to-End Ansible Automation Lab with Playbooks, Inventories, Roles, Vault, Dynamic Inventory, and Custom Modules
Sana HassanA Coding Guide to Implement a pgvector-Powered Semantic, Hybrid, Sparse, and Quantized Vector Search System
Sana HassanDesign a High-Precision Retrieve-and-Rerank Pipeline with ZeroEntropy Zerank-2 Reranker
Sana HassanDesign a Complete Multimodal RLVR Pipeline with Open-MM-RL, Vision-Language Prompting, Reward Scoring, and GRPO Export
Sana HassanStep by Step Guide to Build and Compare FedAvg and FedProx Federated Learning on Non-IID CIFAR-10 with NVIDIA FLARE
Sana HassanBuild a Complete Langfuse Observability and Evaluation Pipeline for Tracing, Prompt Management, Scoring, and Experiments
Sana HassanBuild a SuperClaude Framework Workflow with Commands, Agents, Modes, and Session Memory
Sana HassanBuild Recurrent-Depth Transformers with OpenMythos for MLA, GQA, Sparse MoE, and Loop-Scaled Reasoning
Sana HassanHow to Build Knowledge Graph Generation Pipelines From Text With kg-gen, NetworkX Analytics, and Interactive Visualizations
Sana HassanHow to Build an Advanced Agentic AI System with Planning, Tool Calling, Memory, and Self-Critique Using OpenAI API
Sana HassanA Coding Implementation to Compress and Benchmark Instruction-Tuned LLMs with FP8, GPTQ, and SmoothQuant Quantization using llmcompressor
Sana HassanA Coding Guide Implementing SHAP Explainability Workflows with Explainer Comparisons, Maskers, Interactions, Drift, and Black-Box Models
Sana HassanHow to Build Repository-Level Code Intelligence with Repowise Using Graph Analysis, Dead-Code Detection, Decisions, and AI Context
Sana HassanHow to Build an MCP Style Routed AI Agent System with Dynamic Tool Exposure Planning, Execution, and Context Injection
Sana HassanHow to Build a Django-Unfold Admin Dashboard with Custom Models, Filters, Actions, and KPIs
Sana HassanA Coding Implementation to Master GPU Computing with CuPy, Custom CUDA Kernels, Streams, Sparse Matrices, and Profiling
Sana HassanHow to Build a Dynamic Zero-Trust Network Simulation with Graph-Based Micro-Segmentation, Adaptive Policy Engine, and Insider Threat Detection
Sana HassanBuild a Hybrid-Memory Autonomous Agent with Modular Architecture and Tool Dispatch Using OpenAI
Sana HassanA Coding Implementation to Portfolio Optimization with skfolio for Building Testing, Tuning, and Comparing Modern Investment Strategies
Sana HassanHow to Build Technical Analysis and Backtesting Workflow with pandas-ta-classic, Strategy Signals, and Performance Metrics
Sana HassanA Coding Implementation to Build Agent-Native Memory Infrastructure with Memori for Persistent Multi-User and Multi-Session LLM Applications
Sana HassanHow to Build a Cost-Aware LLM Routing System with NadirClaw Using Local Prompt Classification and Gemini Model Switching
Sana HassanA Coding Implementation to Recover Hidden Malware IOCs with FLARE-FLOSS Beyond Classic Strings Analysis
Sana HassanHow to Build a Single-Cell RNA-seq Analysis Pipeline with Scanpy for PBMC Clustering, Annotation, and Trajectory Discovery
Sana HassanBuild a CloakBrowser Automation Workflow with Stealth Chromium, Persistent Profiles, and Browser Signal Inspection
Sana HassanHow to Build a Fully Interactive Multi-Page NiceGUI Application with Real-Time Dashboard, CRUD Operations, File Upload, and Async Chat
Sana HassanBuild a Modular Skill-Based Agent System for LLMs with Dynamic Tool Routing in Python
Sana HassanA Coding Guide to Survey Bias Correction Using Facebook Research Balance with IPW CBPS Ranking and Post Stratification Methods
Sana HassanHow to Build an End-to-End Production Grade Machine Learning Pipeline with ZenML, Including Custom Materializers, Metadata Tracking, and Hyperparameter Optimization
Sana HassanA Coding Implementation to Explore and Analyze the TaskTrove Dataset with Streaming Parsing Visualization and Verifier Detection
Sana HassanA Coding Implementation of End-to-End Brain Decoding from MEG Signals Using NeuralSet and Deep Learning for Predicting Linguistic Features
Sana HassanA Coding Guide on LLM Post Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning
Sana HassanA Coding Deep Dive into Agentic UI, Generative UI, State Synchronization, and Interrupt-Driven Approval Flows
Sana HassanA Coding Implementation on Pyright Type Checking Covering Generics, Protocols, Strict Mode, Type Narrowing, and Modern Python Typing
Sana HassanA Coding Implementation on Document Parsing Benchmarking with LlamaIndex ParseBench Using Python, Hugging Face, and Evaluation Metrics
Sana HassanHow to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI
Sana HassanHow to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control
Sana HassanHow to Build a Fully Searchable AI Knowledge Base with OpenKB, OpenRouter, and Llama
Sana HassanHow to Build Smarter Multilingual Text Wrapping with BudouX Through Parsing, HTML Rendering, Model Introspection, and Toy Training
Sana HassanA Coding Tutorial on Datashader on Rendering Massive Datasets with High-Performance Python Visual Analytics
Sana HassanA Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing
Sana HassanA Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation
Sana HassanA Detailed Implementation on Equinox with JAX Native Modules, Filtered Transforms, Stateful Layers, and End-to-End Training Workflows
Sana HassanA Coding Implementation to Build a Conditional Bayesian Hyperparameter Optimization Pipeline with Hyperopt, TPE, and Early Stopping
Sana HassanA Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence
Sana HassanA Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning
Sana HassanA Coding Implementation to Build an AI-Powered File Type Detection and Security Analysis Pipeline with Magika and OpenAI
Sana HassanA Coding Implementation of Quantum State Evolution, Decoherence, and Entanglement Dynamics using QuTiP
Sana HassanGoogle AI Introduced Guardrailed-AMIE (g-AMIE): A Multi-Agent Approach to Accountability in Conversational Medical AI
Sana HassanPrefix-RFT: A Unified Machine Learning Framework to blend Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT)
Sana HassanHuawei CloudMatrix: A Peer-to-Peer AI Datacenter Architecture for Scalable and Efficient LLM Serving
Sana HassanZenFlow: A New DeepSpeed Extension Designed as a Stall-Free Offloading Engine for Large Language Model (LLM) Training
Sana HassanA Coding Implementation to Build a Complete Self-Hosted LLM Workflow with Ollama, REST API, and Gradio Chat Interface
Sana HassanMemp: A Task-Agnostic Framework that Elevates Procedural Memory to a Core Optimization Target in LLM-based Agent
Sana HassanA Coding Guide to Build and Validate End-to-End Partitioned Data Pipelines in Dagster with Machine Learning Integration
Sana HassanEfficient AI Agents Don’t Have to Be Expensive: Here’s Proof
Sana HassanGenie Envisioner: A Unified Video-Generative Platform for Scalable, Instruction-Driven Robotic Manipulation
Sana HassanBuilding an Advanced Portfolio Analysis and Market Intelligence Tool with OpenBB
Sana HassanGraph-R1: An Agentic GraphRAG Framework for Structured, Multi-Turn Reasoning with Reinforcement Learning
Sana HassanMIT Researchers Develop Methods to Control Transformer Sensitivity with Provable Lipschitz Bounds and Muon
Sana HassanTransEvalnia: A Prompting-Based System for Fine-Grained, Human-Aligned Translation Evaluation Using LLMs
Sana HassanWhy Context Matters: Transforming AI Model Evaluation with Contextualized Queries
Sana HassanURBAN-SIM: Advancing Autonomous Micromobility with Scalable Urban Simulation
Sana HassanGPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks
Sana HassanA Code Implementation to Efficiently Leverage LangChain to Automate PubMed Literature Searches, Parsing, and Trend Visualization
Sana HassanCan LLM Reward Models Be Trusted? Master-RM Exposes and Fixes Their Weaknesses
Sana HassanEG-CFG: Enhancing Code Generation with Real-Time Execution Feedback
Sana HassanMirage: Multimodal Reasoning in VLMs Without Rendering Images
Sana HassanNeuralOS: A Generative Framework for Simulating Interactive Operating System Interfaces
Sana HassanEfficient and Adaptable Speech Enhancement via Pre-trained Generative Audioencoders and Vocoders
Sana HassanSDBench and MAI-DxO: Advancing Realistic, Cost-Aware Clinical Reasoning with AI
Sana HassanFrom Perception to Action: The Role of World Models in Embodied AI Systems
Sana HassanMistral AI Releases Devstral 2507 for Code-Centric Language Modeling
Sana HassanPerplexity Introduces Comet—An AI-First Alternative to Traditional Browsers
Sana HassanMicrosoft Open-Sources GitHub Copilot Chat Extension for VS Code—Now Free for All Developers
Sana HassanHow Radial Attention Cuts Costs in Video Diffusion by 4.4× Without Sacrificing Quality
Sana HassanSynPref-40M and Skywork-Reward-V2: Scalable Human-AI Alignment for State-of-the-Art Reward Models
Sana HassanA Coding Guide to Build Modular and Self-Correcting QA Systems with DSPy
Sana HassanAbstRaL: Teaching LLMs Abstract Reasoning via Reinforcement to Boost Robustness on GSM Benchmarks
Sana HassanKyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training
Sana HassanA Tutorial on Using OpenAI Codex with GitHub Repositories for Seamless AI-Powered Development
Sana HassanThought Anchors: A Machine Learning Framework for Identifying and Measuring Key Reasoning Steps in Large Language Models with Precision
Sana HassanBuilding a BioCypher-Powered AI Agent for Biomedical Knowledge Graph Generation and Querying
Sana HassanLongWriter-Zero: A Reinforcement Learning Framework for Ultra-Long Text Generation Without Synthetic Data
Sana HassanMDM-Prime: A generalized Masked Diffusion Models (MDMs) Framework that Enables Partially Unmasked Tokens during Sampling
Sana HassanUC San Diego Researchers Introduced Dex1B: A Billion-Scale Dataset for Dexterous Hand Manipulation in Robotics
Sana HassanDeepRare: The First AI-Powered Agentic Diagnostic System Transforming Clinical Decision-Making in Rare Disease Management
Sana HassanGURU: A Reinforcement Learning Framework that Bridges LLM Reasoning Across Six Domains
Sana HassanMIT and NUS Researchers Introduce MEM1: A Memory-Efficient Framework for Long-Horizon Language Agents
Sana HassanETH and Stanford Researchers Introduce MIRIAD: A 5.8M Pair Dataset to Improve LLM Accuracy in Medical AI
Sana HassanByteDance Researchers Introduce Seed-Coder: A Model-Centric Code LLM Trained on 6 Trillion Tokens
Sana HassanA Coding Implementation for Creating, Annotating, and Visualizing Complex Biological Knowledge Graphs Using PyBEL
Sana HassanByteDance Researchers Introduce ProtoReasoning: Enhancing LLM Generalization via Logic-Based Prototypes
Sana HassanBuild a Groundedness Verification Tool Using Upstage API and LangChain
Sana HassanA Coding Guide to Build a Production-Ready Asynchronous Python SDK with Rate Limiting, In-Memory Caching, and Authentication
Sana HassanEmbodiedGen: A Scalable 3D World Generator for Realistic Embodied AI Simulations
Sana HassanTexas A&M Researchers Introduce a Two-Phase Machine Learning Method Named ‘ShockCast’ for High-Speed Flow Simulation with Neural Temporal Re-Meshing
Sana HassanMistral AI Releases Mistral Small 3.2: Enhanced Instruction Following, Reduced Repetition, and Stronger Function Calling for AI Integration
Sana HassanPoE-World + Planner Outperforms Reinforcement Learning RL Baselines in Montezuma’s Revenge with Minimal Demonstration Data
Sana HassanReVisual-R1: An Open-Source 7B Multimodal Large Language Model (MLLMs) that Achieves Long, Accurate and Thoughtful Reasoning
Sana HassanWhy Small Language Models (SLMs) Are Poised to Redefine Agentic AI: Efficiency, Cost, and Practical Deployment
Sana HassanAREAL: Accelerating Large Reasoning Model Training with Fully Asynchronous Reinforcement Learning
Sana HassanBuilding High-Performance Financial Analytics Pipelines with Polars: Lazy Evaluation, Advanced Expressions, and SQL Integration
Sana HassanOThink-R1: A Dual-Mode Reasoning Framework to Cut Redundant Computation in LLMs
Sana HassanBuilding AI-Powered Applications Using the Plan → Files → Code Workflow in TinyDev
Sana HassanMemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language Models
Sana HassanGoogle AI Unveils a Hybrid AI-Physics Model for Accurate Regional Climate Risk Forecasts with Better Uncertainty Assessment
Sana HassanRun Multiple AI Coding Agents in Parallel with Container-Use from Dagger
Sana HassanHow Do LLMs Really Reason? A Framework to Separate Logic from Knowledge
Sana HassanFrom Text to Action: How Tool-Augmented AI Agents Are Redefining Language Models with Reasoning, Memory, and Autonomy
Sana HassanMeet BioReason: The World’s First Reasoning Model in Biology that Enables AI to Reason about Genomics like a Biology Expert
Sana HassanDarwin Gödel Machine: A Self-Improving AI Agent That Evolves Code Using Foundation Models and Real-World Benchmarks
Sana HassanSalesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents
Sana HassanLifelongAgentBench: A Benchmark for Evaluating Continuous Learning in LLM-Based Agents
Sana HassanMistral AI Introduces Codestral Embed: A High-Performance Code Embedding Model for Scalable Retrieval and Semantic Understanding
Sana HassanOff-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models
Sana HassanThis AI Paper from Microsoft Introduces WINA: A Training-Free Sparse Activation Framework for Efficient Large Language Model Inference
Sana HassanApple and Duke Researchers Present a Reinforcement Learning Approach That Enables LLMs to Provide Intermediate Answers, Enhancing Speed and Accuracy
Sana HassanNational University of Singapore Researchers Introduce Dimple: A Discrete Diffusion Multimodal Language Model for Efficient and Controllable Text Generation
Sana HassanLLMs Can Now Reason Beyond Language: Researchers Introduce Soft Thinking to Replace Discrete Tokens with Continuous Concept Embeddings
Sana HassanResearchers at UT Austin Introduce Panda: A Foundation Model for Nonlinear Dynamics Pretrained on 20,000 Chaotic ODE Discovered via Evolutionary Search
Sana HassanMicrosoft Releases NLWeb: An Open Project that Allows Developers to Easily Turn Any Website into an AI-Powered App with Natural Language Interfaces
Sana HassanOptimizing Assembly Code with LLMs: Reinforcement Learning Outperforms Traditional Compilers
Sana HassanEvaluating Enterprise-Grade AI Assistants: A Benchmark for Complex, Voice-Driven Workflows
Sana HassanBeyond Aha Moments: Structuring Reasoning in Large Language Models
Sana HassanRXTX: A Machine Learning-Guided Algorithm for Efficient Structured Matrix Multiplication
Sana HassanFrom Protocol to Production: How Model Context Protocol (MCP) Gateways Enable Secure, Scalable, and Seamless AI Integrations Across Enterprises
Sana HassanResearchers from Renmin University and Huawei Propose MemEngine: A Unified Modular AI Library for Customizing Memory in LLM-Based Agents
Sana HassanMeta Introduces KernelLLM: An 8B LLM that Translates PyTorch Modules into Efficient Triton GPU Kernels
Sana HassanOmni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated Data
Sana HassanReinforcement Learning Makes LLMs Search-Savvy: Ant Group Researchers Introduce SEM to Optimize Tool Usage and Reasoning Efficiency
Sana HassanSWE-Bench Performance Reaches 50.8% Without Tool Use: A Case for Monolithic State-in-Context Agents
Sana HassanThis AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency
Sana HassanMeet LangGraph Multi-Agent Swarm: A Python Library for Creating Swarm-Style Multi-Agent Systems Using LangGraph
Sana HassanByteDance Introduces Seed1.5-VL: A Vision-Language Foundation Model Designed to Advance General-Purpose Multimodal Understanding and Reasoning
Sana HassanResearchers from Tsinghua and ModelBest Release Ultra-FineWeb: A Trillion-Token Dataset Enhancing LLM Accuracy Across Benchmarks
Sana HassanCoding Agents See 75% Surge: SimilarWeb’s AI Usage Report Highlights the Sectors Winning and Losing in 2025’s Generative AI Boom
Sana HassanRethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification
Sana HassanA Step-by-Step Guide on Building, Customizing, and Publishing an AI-Focused Blogging Website with Lovable.dev and Seamless GitHub Integration
Sana HassanNVIDIA AI Introduces Audio-SDS: A Unified Diffusion-Based Framework for Prompt-Guided Audio Synthesis and Source Separation without Specialized Datasets
Sana HassanTencent Released PrimitiveAnything: A New AI Framework That Reconstructs 3D Shapes Using Auto-Regressive Primitive Generation
Sana HassanMicrosoft Researchers Introduce ARTIST: A Reinforcement Learning Framework That Equips LLMs with Agentic Reasoning and Dynamic Tool Use
Sana HassanA Deep Technical Dive into Next-Generation Interoperability Protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP)
Sana HassanMing-Lite-Uni: An Open-Source AI Framework Designed to Unify Text and Vision through an Autoregressive Multimodal Structure
Sana HassanMultimodal LLMs Without Compromise: Researchers from UCLA, UW–Madison, and Adobe Introduce X-Fusion to Add Vision to Frozen Language Models Without Losing Language Capabilities
Sana HassanNVIDIA Open-Sources Open Code Reasoning Models (32B, 14B, 7B)
Sana HassanIs Automated Hallucination Detection in LLMs Feasible? A Theoretical and Empirical Investigation
Sana HassanGoogle Releases 76-Page Whitepaper on AI Agents: A Deep Technical Dive into Agentic RAG, Evaluation Frameworks, and Real-World Architectures