<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://themagicalthings.github.io/vthokala.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://themagicalthings.github.io/vthokala.github.io/" rel="alternate" type="text/html" /><updated>2026-05-15T01:03:23+00:00</updated><id>https://themagicalthings.github.io/vthokala.github.io/feed.xml</id><title type="html">Vamsi Thokala</title><subtitle>Cloud Data Engineer &amp; GenAI Practitioner</subtitle><author><name>Vamsi Thokala</name></author><entry><title type="html">Future Blog Post</title><link href="https://themagicalthings.github.io/vthokala.github.io/posts/2012/08/blog-post-4/" rel="alternate" type="text/html" title="Future Blog Post" /><published>2199-01-01T00:00:00+00:00</published><updated>2199-01-01T00:00:00+00:00</updated><id>https://themagicalthings.github.io/vthokala.github.io/posts/2012/08/future-post</id><content type="html" xml:base="https://themagicalthings.github.io/vthokala.github.io/posts/2012/08/blog-post-4/"><![CDATA[<p>This post will show up by default. To disable scheduling of future posts, edit <code class="language-plaintext highlighter-rouge">config.yml</code> and set <code class="language-plaintext highlighter-rouge">future: false</code>.</p>]]></content><author><name>Vamsi Thokala</name></author><category term="cool posts" /><category term="category1" /><category term="category2" /><summary type="html"><![CDATA[This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.]]></summary></entry><entry><title type="html">Building Production-Grade RAG Systems</title><link href="https://themagicalthings.github.io/vthokala.github.io/posts/2024/08/rag-overview/" rel="alternate" type="text/html" title="Building Production-Grade RAG Systems" /><published>2024-08-10T00:00:00+00:00</published><updated>2024-08-10T00:00:00+00:00</updated><id>https://themagicalthings.github.io/vthokala.github.io/posts/2024/08/rag-overview</id><content type="html" xml:base="https://themagicalthings.github.io/vthokala.github.io/posts/2024/08/rag-overview/"><![CDATA[<p>Retrieval-Augmented Generation (RAG) is becoming the standard for grounding LLMs on private data. But building a prototype is easy; building a production system is hard.</p>

<h2 id="the-retrieval-challenge">The Retrieval Challenge</h2>

<p>Simple cosine similarity often fails for complex queries.</p>
<ul>
  <li><strong>Hybrid Search</strong>: Combining keyword search (BM25) with vector search often yields better results.</li>
  <li><strong>Re-ranking</strong>: Using a cross-encoder model to re-rank the top K retrieved documents can significantly improve relevance.</li>
</ul>

<h2 id="chunking-strategies">Chunking Strategies</h2>

<p>Fixed-size chunking is a good starting point, but semantic chunking or recursive retrieval (parent-child chunking) preserves context better.</p>

<h2 id="evaluation">Evaluation</h2>

<p>How do you know if your RAG system is working? We use frameworks like <strong>Ragas</strong> and <strong>TruLens</strong> to measure:</p>
<ul>
  <li><strong>Faithfulness</strong>: Is the answer derived from the context?</li>
  <li><strong>Answer Relevance</strong>: Does the answer address the query?</li>
</ul>

<p>Moving from a demo to production requires robust evaluation pipelines and continuous monitoring.</p>]]></content><author><name>Vamsi Thokala</name></author><category term="GenAI" /><category term="RAG" /><category term="LLM" /><category term="Vector DB" /><summary type="html"><![CDATA[Retrieval-Augmented Generation (RAG) is becoming the standard for grounding LLMs on private data. But building a prototype is easy; building a production system is hard.]]></summary></entry><entry><title type="html">Spark Performance Optimization: Beyond the Basics</title><link href="https://themagicalthings.github.io/vthokala.github.io/posts/2024/07/spark-performance-basics/" rel="alternate" type="text/html" title="Spark Performance Optimization: Beyond the Basics" /><published>2024-07-01T00:00:00+00:00</published><updated>2024-07-01T00:00:00+00:00</updated><id>https://themagicalthings.github.io/vthokala.github.io/posts/2024/07/spark-performance-basics</id><content type="html" xml:base="https://themagicalthings.github.io/vthokala.github.io/posts/2024/07/spark-performance-basics/"><![CDATA[<p>Apache Spark is powerful, but it’s easy to write inefficient jobs. Here are some advanced techniques I’ve used to optimize long-running ETL processes.</p>

<h2 id="1-skew-handling">1. Skew Handling</h2>

<p>Data skew can kill your job performance.</p>
<ul>
  <li><strong>Broadcasting</strong>: If joining a large table with a small one, always <code class="language-plaintext highlighter-rouge">broadcast</code> the smaller table.</li>
  <li><strong>Salting</strong>: For joining two large skewed tables, add a random salt to the keys to distribute the load evenly.</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="kn">import</span> <span class="n">rand</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">withColumn</span><span class="p">(</span><span class="s">"salt"</span><span class="p">,</span> <span class="p">(</span><span class="n">rand</span><span class="p">()</span> <span class="o">*</span> <span class="mi">10</span><span class="p">).</span><span class="n">cast</span><span class="p">(</span><span class="s">"int"</span><span class="p">))</span>
</code></pre></div></div>

<h2 id="2-memory-management">2. Memory Management</h2>

<p>Understanding Spark’s memory model is key. Tuning <code class="language-plaintext highlighter-rouge">spark.executor.memory</code> and <code class="language-plaintext highlighter-rouge">spark.memory.fraction</code> can prevent OOM errors and reduce garbage collection overhead.</p>

<h2 id="3-file-formats-and-compression">3. File Formats and Compression</h2>

<p>Switching from CSV/JSON to Parquet or Delta is a no-brainer. But also consider the compression codec. Snappy is fast, but Gzip offers better compression ratios for archival data.</p>

<h2 id="final-thoughts">Final Thoughts</h2>

<p>Performance tuning is an iterative process. Always use the Spark UI to identify bottlenecks before applying fixes.</p>]]></content><author><name>Vamsi Thokala</name></author><category term="Spark" /><category term="Performance" /><category term="Optimization" /><summary type="html"><![CDATA[Apache Spark is powerful, but it’s easy to write inefficient jobs. Here are some advanced techniques I’ve used to optimize long-running ETL processes.]]></summary></entry><entry><title type="html">Databricks Migration Patterns: Strategies for Moving 50TB+ to Delta Lake</title><link href="https://themagicalthings.github.io/vthokala.github.io/posts/2024/06/databricks-migration-patterns/" rel="alternate" type="text/html" title="Databricks Migration Patterns: Strategies for Moving 50TB+ to Delta Lake" /><published>2024-06-15T00:00:00+00:00</published><updated>2024-06-15T00:00:00+00:00</updated><id>https://themagicalthings.github.io/vthokala.github.io/posts/2024/06/databricks-migration-patterns</id><content type="html" xml:base="https://themagicalthings.github.io/vthokala.github.io/posts/2024/06/databricks-migration-patterns/"><![CDATA[<p>Migrating legacy data warehouses to modern cloud platforms like Databricks is a complex undertaking. In this post, I discuss the strategies I used to migrate over 50TB of data from an on-premise Hadoop cluster to Delta Lake on AWS.</p>

<h2 id="the-challenge">The Challenge</h2>

<p>Our legacy system was suffering from:</p>
<ul>
  <li>High maintenance costs</li>
  <li>Lack of scalability during peak loads</li>
  <li>Inability to support ACID transactions</li>
</ul>

<h2 id="the-approach-lift-and-shift-vs-re-architecting">The Approach: Lift and Shift vs. Re-architecting</h2>

<p>We chose a hybrid approach. For raw data ingestion, we used a “lift and shift” strategy to minimize disruption. However, for the consumption layer, we completely re-architected the ETL pipelines using Spark Structured Streaming and Delta Lake.</p>

<h3 id="key-learnings">Key Learnings</h3>

<ol>
  <li><strong>Data Validation is Critical</strong>: We built a custom validation framework using PySpark to ensure 100% data parity.</li>
  <li><strong>Optimize for Write Throughput</strong>: Tuning the file sizes and using <code class="language-plaintext highlighter-rouge">OPTIMIZE</code> and <code class="language-plaintext highlighter-rouge">Z-ORDER</code> indexing significantly improved query performance.</li>
  <li><strong>Cost Management</strong>: Leveraging Spot Instances for non-critical batch jobs reduced our compute costs by 40%.</li>
</ol>

<h2 id="conclusion">Conclusion</h2>

<p>The migration not only modernized our stack but also empowered our data science teams to run ML workloads directly on the data lake, unlocking new insights.</p>]]></content><author><name>Vamsi Thokala</name></author><category term="Databricks" /><category term="Migration" /><category term="Delta Lake" /><category term="Data Engineering" /><summary type="html"><![CDATA[Migrating legacy data warehouses to modern cloud platforms like Databricks is a complex undertaking. In this post, I discuss the strategies I used to migrate over 50TB of data from an on-premise Hadoop cluster to Delta Lake on AWS.]]></summary></entry><entry><title type="html">Blog Post number 4</title><link href="https://themagicalthings.github.io/vthokala.github.io/posts/2012/08/blog-post-4/" rel="alternate" type="text/html" title="Blog Post number 4" /><published>2015-08-14T00:00:00+00:00</published><updated>2015-08-14T00:00:00+00:00</updated><id>https://themagicalthings.github.io/vthokala.github.io/posts/2012/08/blog-post-4</id><content type="html" xml:base="https://themagicalthings.github.io/vthokala.github.io/posts/2012/08/blog-post-4/"><![CDATA[<p>This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.</p>

<h1 id="headings-are-cool">Headings are cool</h1>

<h1 id="you-can-have-many-headings">You can have many headings</h1>

<h2 id="arent-headings-cool">Aren’t headings cool?</h2>]]></content><author><name>Vamsi Thokala</name></author><category term="cool posts" /><category term="category1" /><category term="category2" /><summary type="html"><![CDATA[This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.]]></summary></entry><entry><title type="html">Blog Post number 3</title><link href="https://themagicalthings.github.io/vthokala.github.io/posts/2014/08/blog-post-3/" rel="alternate" type="text/html" title="Blog Post number 3" /><published>2014-08-14T00:00:00+00:00</published><updated>2014-08-14T00:00:00+00:00</updated><id>https://themagicalthings.github.io/vthokala.github.io/posts/2014/08/blog-post-3</id><content type="html" xml:base="https://themagicalthings.github.io/vthokala.github.io/posts/2014/08/blog-post-3/"><![CDATA[<p>This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.</p>

<h1 id="headings-are-cool">Headings are cool</h1>

<h1 id="you-can-have-many-headings">You can have many headings</h1>

<h2 id="arent-headings-cool">Aren’t headings cool?</h2>]]></content><author><name>Vamsi Thokala</name></author><category term="cool posts" /><category term="category1" /><category term="category2" /><summary type="html"><![CDATA[This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.]]></summary></entry><entry><title type="html">Blog Post number 2</title><link href="https://themagicalthings.github.io/vthokala.github.io/posts/2013/08/blog-post-2/" rel="alternate" type="text/html" title="Blog Post number 2" /><published>2013-08-14T00:00:00+00:00</published><updated>2013-08-14T00:00:00+00:00</updated><id>https://themagicalthings.github.io/vthokala.github.io/posts/2013/08/blog-post-2</id><content type="html" xml:base="https://themagicalthings.github.io/vthokala.github.io/posts/2013/08/blog-post-2/"><![CDATA[<p>This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.</p>

<h1 id="headings-are-cool">Headings are cool</h1>

<h1 id="you-can-have-many-headings">You can have many headings</h1>

<h2 id="arent-headings-cool">Aren’t headings cool?</h2>]]></content><author><name>Vamsi Thokala</name></author><category term="cool posts" /><category term="category1" /><category term="category2" /><summary type="html"><![CDATA[This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.]]></summary></entry><entry><title type="html">Blog Post number 1</title><link href="https://themagicalthings.github.io/vthokala.github.io/posts/2012/08/blog-post-1/" rel="alternate" type="text/html" title="Blog Post number 1" /><published>2012-08-14T00:00:00+00:00</published><updated>2012-08-14T00:00:00+00:00</updated><id>https://themagicalthings.github.io/vthokala.github.io/posts/2012/08/blog-post-1</id><content type="html" xml:base="https://themagicalthings.github.io/vthokala.github.io/posts/2012/08/blog-post-1/"><![CDATA[<p>This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.</p>

<h1 id="headings-are-cool">Headings are cool</h1>

<h1 id="you-can-have-many-headings">You can have many headings</h1>

<h2 id="arent-headings-cool">Aren’t headings cool?</h2>]]></content><author><name>Vamsi Thokala</name></author><category term="cool posts" /><category term="category1" /><category term="category2" /><summary type="html"><![CDATA[This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.]]></summary></entry></feed>