Search
Calendar
June 2025
S M T W T F S
« May    
1234567
891011121314
15161718192021
22232425262728
2930  
Archives

Posts Tagged ‘APM’

PostHeaderIcon How to Bypass Elasticsearch’s 10,000-Result Limit with the Scroll API

If you’ve ever worked with the Elasticsearch API, you’ve likely run into its infamous 10,000-result limit. It’s a default cap that can feel like a brick wall when you’re dealing with large datasets—think log analysis, report generation, or bulk data exports. Fortunately, there’s a slick workaround: the Scroll API. In this post, I’ll walk you through why this limit exists, how the Scroll API solves it, and share practical examples to get you started.

Why the 10,000-Result Limit Exists

Elasticsearch caps standard search results at 10,000 to protect performance. Fetching millions of records in one shot with from and size parameters can strain memory and slow things down. But what if you need all that data? That’s where the Scroll API shines—it’s designed for deep pagination, letting you retrieve everything in manageable chunks.

What Is the Scroll API?

Unlike a typical search, the Scroll API maintains a temporary “scroll context” on the server. You grab a batch of results, get a scroll_id, and use it to fetch the next batch—no need to rerun your query. It’s efficient, scalable, and perfect for big data tasks.

How to Use the Scroll API: Step by Step

Let’s break it down with examples you can try yourself.

Step 1: Start the Scroll

Kick things off with a search request. Add the scroll parameter (like 1m for a 1-minute timeout) and set size to control your batch size. Here’s a basic example:
GET /my_index/_search?scroll=1m
{
  "size": 1000,
  "query": {
    "match_all": {}
  }
}
This pulls the first 1,000 hits and returns a `scroll_id`—a long, encoded string you’ll need for the next step.

Step 2: Fetch More Results

Using that `scroll_id`, request the next batch. You don’t need to repeat the query—just send the ID and timeout:
POST /_search/scroll
{
  "scroll": "1m",
  "scroll_id": "c2NhbjsxMDAwO...YOUR_SCROLL_ID_HERE..."
}
Loop this call until you’ve retrieved all your data. Each response includes a new `scroll_id` (sometimes the same, depending on the version), so keep updating it.

Step 3: Clean Up

When you’re done, delete the scroll context to free up server resources. It’s a small but critical step:
DELETE /_search/scroll/c2NhbjsxMDAwO...YOUR_SCROLL_ID_HERE...

Skip this, and you’ll leave dangling contexts that could bog down your cluster.

A Real-World Example

Let’s say you’re sifting through millions of logs for a specific error. Here’s a targeted scroll query:
GET /logs/_search?scroll=2m
{
  "size": 500,
  "query": {
    "match": {
      "error_message": "timeout"
    }
  }
}

Then, use the Scroll API to paginate through every matching log entry. It’s way cleaner than hacking around with `from` and `size`.
Tips for Scroll API Success
  • Batch Size: Stick to a `size` like 500–1000. Too large, and you’ll strain memory; too small, and you’ll make too many requests.
  • Timeout Tuning: Set the scroll duration (e.g., `1m`, `5m`) based on how fast your script processes each batch. Too short, and the context expires mid-run.
  • Automation: Use a script to handle the loop. Python’s `elasticsearch` library, for instance, has a handy scroll helper:
from elasticsearch import Elasticsearch

es = Elasticsearch(["http://localhost:9200"])
scroll = es.search(index="logs", scroll="2m", size=500, body={"query": {"match": {"error_message": "timeout"}}})
scroll_id = scroll["_scroll_id"]

while len(scroll["hits"]["hits"]):
    print(scroll["hits"]["hits"])  # Process this batch
    scroll = es.scroll(scroll_id=scroll_id, scroll="2m")
    scroll_id = scroll["_scroll_id"]

es.clear_scroll(scroll_id=scroll_id)  # Cleanup

Why Scroll Beats the Alternatives

You could tweak `index.max_result_window` to raise the limit, but that’s a performance gamble. Export tools or aggregations might work for summaries, but for raw data retrieval, Scroll is king—efficient and built for the job.

Conclusion

The Scroll API has been a game-changer for my Elasticsearch projects, especially when wrestling with massive indices. It’s simple once you get the hang of it, and the payoff is huge.

PostHeaderIcon Elastic APM: When to Use @CaptureSpan vs. @CaptureTransaction?

If you’re working with Elastic APM in a Java application, you might wonder when to use `@CaptureSpan` versus `@CaptureTransaction`. Both are powerful tools for observability, but they serve different purposes.
🔹 `@CaptureTransaction`:
Use this at the entry point of a request, typically at a controller, service method, or a background job. It defines the start of a transaction and allows you to trace how a request propagates through your system.
🔹 `@CaptureSpan`:
Use this to track sub-operations within a transaction, such as database queries, HTTP calls, or specific business logic. It helps break down execution time and pinpoint performance bottlenecks inside a transaction.

📌 Best Practices:

✅ Apply @CaptureTransaction at the highest-level method handling a request.
✅ Use @CaptureSpan for key internal operations you want to monitor.
✅ Avoid excessive spans—instrument only critical code paths to reduce overhead.

By balancing these annotations effectively, you can get detailed insights into your app’s performance while keeping APM overhead minimal.