RAG: Web Search and Analysis with Apify and Haystack

_{Last Updated:
January 21, 2025}

Want to give any of your LLM applications the power to search and browse the web? In this cookbook, we’ll show you how to use the RAG Web Browser Actor to search Google and extract content from web pages, then analyze the results using a large language model - all within the Haystack ecosystem using the apify-haystack integration.

This cookbook also demonstrates how to leverage the RAG Web Browser Actor with Haystack to create powerful web-aware applications. We’ll explore multiple use cases showing how easy it is to:

Search interesting topics
Analyze the results with OpenAIGenerator
Use the Haystack Pipeline for web search and analysis

We’ll start by using the RAG Web Browser Actor to perform web searches and then use the OpenAIGenerator to analyze and summarize the web content

Install dependencies

!pip install apify-haystack==0.1.4 haystack-ai

Set up the API keys

You need to have an Apify account and obtain APIFY_API_TOKEN.

You also need an OpenAI account and OPENAI_API_KEY

import os
from getpass import getpass

os.environ["APIFY_API_TOKEN"] = getpass("Enter YOUR APIFY_API_TOKEN")
os.environ["OPENAI_API_KEY"] = getpass("Enter YOUR OPENAI_API_KEY")

Search interesting topics

The RAG Web Browser Actor is designed to enhance AI and Large Language Model (LLM) applications by providing up-to-date web content. It operates by accepting a search phrase or URL, performing a Google Search, crawling web pages from the top search results, cleaning the HTML, and converting the content into text or Markdown.

Output Format

The output from the RAG Web Browser Actor is a JSON array, where each object contains:

crawl: Details about the crawling process, including HTTP status code and load time.
searchResult: Information from the search result, such as the title, description, and URL.
metadata: Additional metadata like the page title, description, language code, and URL.
markdown: The main content of the page, converted into Markdown format.

For example, query: rag web browser returns:

[
    {
        "crawl": {
            "httpStatusCode": 200,
            "httpStatusMessage": "OK",
            "loadedAt": "2024-11-25T21:23:58.336Z",
            "uniqueKey": "eM0RDxDQ3q",
            "requestStatus": "handled"
        },
        "searchResult": {
            "title": "apify/rag-web-browser",
            "description": "Sep 2, 2024 — The RAG Web Browser is designed for Large Language Model (LLM) applications ...",
            "url": "https://github.com/apify/rag-web-browser"
        },
        "metadata": {
            "title": "GitHub - apify/rag-web-browser: RAG Web Browser is an Apify Actor to feed your LLM applications ...",
            "description": "RAG Web Browser is an Apify Actor to feed your LLM applications ...",
            "languageCode": "en",
            "url": "https://github.com/apify/rag-web-browser"
        },
        "markdown": "# apify/rag-web-browser: RAG Web Browser is an Apify Actor ..."
    }
]

We will convert this JSON to a Haystack Document using the dataset_mapping_function as follows:

from haystack import Document

def dataset_mapping_function(dataset_item: dict) -> Document:
    return Document(
        content=dataset_item.get("markdown"),
        meta={
            "title": dataset_item.get("metadata", {}).get("title"),
            "url": dataset_item.get("metadata", {}).get("url"),
            "language": dataset_item.get("metadata", {}).get("languageCode")
        }
    )

Now set up the ApifyDatasetFromActorCall component:

from apify_haystack import ApifyDatasetFromActorCall

document_loader = ApifyDatasetFromActorCall(
    actor_id="apify/rag-web-browser",
    run_input={
        "maxResults": 2,
        "outputFormats": ["markdown"],
        "requestTimeoutSecs": 30
    },
    dataset_mapping_function=dataset_mapping_function,
)

Check out other run_input parameters at Github for the RAG web browser.

Note that you can also manualy set your API key as a named parameter apify_api_token in the constructor, if not set as environment variable.

Run the Actor and fetch results

Let’s run the Actor with a sample query and fetch the results. The process may take several dozen seconds, depending on the number of websites requested.

query = "Artificial intelligence latest developments"

# Load the documents and extract the list of document
result = document_loader.run(run_input={"query": query})
documents = result.get("documents", [])

for doc in documents:
    print(f"Title: {doc.meta['title']}")
    print(f"Truncated content:  \n {doc.content[:100]} ...")
    print("---")

Title: 7 Recent AI Developments: Artificial Intelligence News
Truncated content: 7 Recent AI Developments: Artificial Intelligence NewsContact phone +1-888-840-3252Koombea [Skip to Content](#maincontent)

[+1-888-840-3252](tel:+18888403252)

[](https://www.koombea.com)

[get in touch](/contact/)

HiTech

9 minutes read

# 7 Recen ...
---
Title: Artificial Intelligence News -- ScienceDaily
Truncated content: Artificial Intelligence News -- ScienceDaily

[Skip to main content](#main)

[![ScienceDaily](/images/sd-logo.png)](/ "ScienceDaily")

* * *

Your source for the latest research news

[Follow:](#) [_Facebook_](https://www.facebook.com/sciencedaily) [ ...
---

Analyze the results with OpenAIGenerator

Use the OpenAIGenerator to analyze and summarize the web content.

from haystack.components.generators import OpenAIGenerator

generator = OpenAIGenerator(model="gpt-4o-mini")

for doc in documents:
    result = generator.run(prompt=doc.content)
    summary = result["replies"][0]  # Accessing the generated text
    print(f"Summary for {doc.meta.get('title')} available from {doc.meta.get('url')}: \n{summary}\n ---")

Summary for 7 Recent AI Developments: Artificial Intelligence News available from https://www.koombea.com/blog/7-recent-ai-developments/: 
AI is making waves in various industries, from robotics to healthcare to brewing beer. As technology advances, AI is beginning to take on more complex tasks and solve unique problems. The implications for AI in app development are also profound, making apps smarter, more efficient, and user-friendly. Stay tuned for even more groundbreaking developments in the world of AI.

Summary for Artificial Intelligence News -- ScienceDaily available from https://www.sciencedaily.com/news/computers_math/artificial_intelligence/: 
[Machine Psychology: A Bridge to General AI?](https://www.sciencedaily.com/releases/2024/12/241219190259.htm)

[Scientists Create AI That 'Watches' Videos by Mimicking the Brain](https://www.sciencedaily.com/releases/2024/12/241209163200.htm)

[Bird-Inspired Drone Can Jump for Take-Off](https://www.sciencedaily.com/releases/2024/12/241206111951.htm)

[Robot That Watched Surgery Videos Performs With Skill of Human Doctor, Researchers Report](https://www.sciencedaily.com/releases/2024/11/241111123037.htm)

Summary for 8 AI and machine learning trends to watch in 2025 | TechTarget available from https://www.techtarget.com/searchenterpriseai/tip/9-top-AI-and-machine-learning-trends: 
The article discusses eight AI and machine learning trends to watch in 2025:

1. Pragmatic approaches to generative AI
2. Expansion of generative AI beyond chatbots
3. Rise of AI agents capable of independent action
4. Evolution of generative AI models into commodities
5. Domain-specific AI applications and data sets
6. Importance of AI literacy for everyone
7. Adaptation to an evolving regulatory environment
8. Escalation of AI-related security concerns

These trends reflect the current state and future direction of AI and machine learning technologies as they continue to impact various industries and sectors.

Use the Haystack Pipeline for web search and analysis

Now let’s create a more sophisticated pipeline that can handle different types of content and generate specialized analyses. We’ll create a pipeline that:

Searches the web using RAG Web Browser
Cleans and filters the documents
Routes them based on content type
Generates customized summaries for different types of content

from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.builders import PromptBuilder

# Improved dataset_mapping_function with truncation of the content
def dataset_mapping_function(dataset_item: dict) -> Document:
    max_chars = 10000
    content = dataset_item.get("markdown", "")
    return Document(
        content=content[:max_chars],  
        meta={
            "title": dataset_item.get("metadata", {}).get("title"),
            "url": dataset_item.get("metadata", {}).get("url"),
            "language": dataset_item.get("metadata", {}).get("languageCode")
        }
    )
    
def create_pipeline(query: str) -> Pipeline:

    document_loader = ApifyDatasetFromActorCall(
        actor_id="apify/rag-web-browser",
        run_input={
            "query": query,
            "maxResults": 2,
            "outputFormats": ["markdown"]
        },
        dataset_mapping_function=dataset_mapping_function,
    )

    cleaner = DocumentCleaner(
        remove_empty_lines=True,
        remove_extra_whitespaces=True,
        remove_repeated_substrings=True
    )

    prompt_template = """
    Analyze the following content and provide:
    1. Key points and findings
    2. Practical implications
    3. Notable conclusions
    Be concise.

    Context:
    {% for document in documents %}
        {{ document.content }}
    {% endfor %}

    Analysis:
    """

    prompt_builder = PromptBuilder(template=prompt_template)

    generator = OpenAIGenerator(model="gpt-4o-mini")

    pipe = Pipeline()
    pipe.add_component("loader", document_loader)
    pipe.add_component("cleaner", cleaner)
    pipe.add_component("prompt_builder", prompt_builder)
    pipe.add_component("generator", generator)

    pipe.connect("loader", "cleaner")
    pipe.connect("cleaner", "prompt_builder")
    pipe.connect("prompt_builder", "generator")

    return pipe

# Function to run the pipeline
def research_topic(query: str) -> str:
    pipeline = create_pipeline(query)
    result = pipeline.run(data={})
    return result["generator"]["replies"]

query = "latest developments in AI ethics"
analysis = research_topic(query)[0]

print("Analysis Result:")
print(analysis)

Analysis Result:
1. Key Points and Findings:
- The global conversation on AI ethics has gained momentum, with calls for inclusive global governance frameworks.
- Different cultural perspectives shape ethical considerations in AI technologies.
- AI technologies can have varying social impacts across different cultural contexts.
- Initiatives like UNESCO's Recommendation on the Ethics of AI aim to promote responsible AI development.
- Fragmented regulatory approaches in different regions pose challenges for multinational tech companies.
  
2. Practical Implications:
- Businesses need to consider cultural diversity and social impacts when implementing AI technologies.
- Collaboration between industry, academia, and regulators is essential to address ethical challenges.
- Initiatives promoting responsible AI development, such as global standards like UNESCO's Recommendation, should be supported.
- Harmonized global standards can facilitate innovation while addressing ethical concerns in AI governance.

3. Notable Conclusions:
- Local cultural contexts significantly influence perceptions and understandings of AI, highlighting the need for nuanced AI governance.
- Initiatives like the UNESCO Recommendation on the Ethics of AI and forums like the Global Forum on the Ethics of Artificial Intelligence aim to address ethical challenges in AI governance.
- The push for more harmonized global standards reflects the industry's commitment to responsible AI development in a culturally diverse landscape.

You can customize the pipeline further by:

Adding more sophisticated routing logic
Implementing additional preprocessing steps
Creating specialized generators for different content types
Adding error handling and retries
Implementing caching for improved performance

This completes our exploration of using Apify’s RAG Web Browser with Haystack for web-aware AI applications. The combination of web search capabilities with sophisticated content processing and analysis creates powerful possibilities for research, analysis and many other tasks.