Vamsi Thokala

Future Blog Post

2199-01-01T00:00:00+00:00

This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.

Building Production-Grade RAG Systems

2024-08-10T00:00:00+00:00

Retrieval-Augmented Generation (RAG) is becoming the standard for grounding LLMs on private data. But building a prototype is easy; building a production system is hard.

The Retrieval Challenge

Simple cosine similarity often fails for complex queries.

Hybrid Search: Combining keyword search (BM25) with vector search often yields better results.
Re-ranking: Using a cross-encoder model to re-rank the top K retrieved documents can significantly improve relevance.

Chunking Strategies

Fixed-size chunking is a good starting point, but semantic chunking or recursive retrieval (parent-child chunking) preserves context better.

Evaluation

How do you know if your RAG system is working? We use frameworks like Ragas and TruLens to measure:

Faithfulness: Is the answer derived from the context?
Answer Relevance: Does the answer address the query?

Moving from a demo to production requires robust evaluation pipelines and continuous monitoring.

Spark Performance Optimization: Beyond the Basics

2024-07-01T00:00:00+00:00

Apache Spark is powerful, but it’s easy to write inefficient jobs. Here are some advanced techniques I’ve used to optimize long-running ETL processes.

1. Skew Handling

Data skew can kill your job performance.

Broadcasting: If joining a large table with a small one, always broadcast the smaller table.
Salting: For joining two large skewed tables, add a random salt to the keys to distribute the load evenly.

from pyspark.sql.functions import rand
df = df.withColumn("salt", (rand() * 10).cast("int"))

2. Memory Management

Understanding Spark’s memory model is key. Tuning spark.executor.memory and spark.memory.fraction can prevent OOM errors and reduce garbage collection overhead.

3. File Formats and Compression

Switching from CSV/JSON to Parquet or Delta is a no-brainer. But also consider the compression codec. Snappy is fast, but Gzip offers better compression ratios for archival data.

Final Thoughts

Performance tuning is an iterative process. Always use the Spark UI to identify bottlenecks before applying fixes.

Databricks Migration Patterns: Strategies for Moving 50TB+ to Delta Lake

2024-06-15T00:00:00+00:00

Migrating legacy data warehouses to modern cloud platforms like Databricks is a complex undertaking. In this post, I discuss the strategies I used to migrate over 50TB of data from an on-premise Hadoop cluster to Delta Lake on AWS.

The Challenge

Our legacy system was suffering from:

High maintenance costs
Lack of scalability during peak loads
Inability to support ACID transactions

The Approach: Lift and Shift vs. Re-architecting

We chose a hybrid approach. For raw data ingestion, we used a “lift and shift” strategy to minimize disruption. However, for the consumption layer, we completely re-architected the ETL pipelines using Spark Structured Streaming and Delta Lake.

Key Learnings

Data Validation is Critical: We built a custom validation framework using PySpark to ensure 100% data parity.
Optimize for Write Throughput: Tuning the file sizes and using OPTIMIZE and Z-ORDER indexing significantly improved query performance.
Cost Management: Leveraging Spot Instances for non-critical batch jobs reduced our compute costs by 40%.

Conclusion

The migration not only modernized our stack but also empowered our data science teams to run ML workloads directly on the data lake, unlocking new insights.

Blog Post number 4

2015-08-14T00:00:00+00:00

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Headings are cool

You can have many headings

Aren’t headings cool?

Blog Post number 3

2014-08-14T00:00:00+00:00

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Headings are cool

You can have many headings

Aren’t headings cool?

Blog Post number 2

2013-08-14T00:00:00+00:00

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Headings are cool

You can have many headings

Aren’t headings cool?

Blog Post number 1

2012-08-14T00:00:00+00:00

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Headings are cool

You can have many headings