Real PubMed Search Analysis with searchAnalyzeR

A Comprehensive Example Using COVID-19 Long-term Effects

searchAnalyzeR Development Team

2025-08-23

1 Introduction

This vignette demonstrates how to use the searchAnalyzeR package to conduct a comprehensive analysis of systematic review search strategies using real PubMed data. We’ll walk through a complete workflow for analyzing search performance, from executing searches to generating publication-ready reports.

The searchAnalyzeR package provides tools for:

1.1 Case Study: COVID-19 Long-term Effects

For this demonstration, we’ll analyze a search strategy designed to identify literature on the long-term effects of COVID-19, commonly known as “Long COVID.” This topic represents a rapidly evolving area of research that presents typical challenges faced in systematic reviews.

2 Getting Started

2.1 Required Packages

First, let’s load the required packages for our analysis:

# Load required packages
library(searchAnalyzeR)
library(rentrez)  # For PubMed API access
library(xml2)     # For XML parsing
library(dplyr)
library(ggplot2)
library(lubridate)

cat("=== searchAnalyzeR: Real PubMed Search Example ===\n")
#> === searchAnalyzeR: Real PubMed Search Example ===
cat("Topic: Long-term effects of COVID-19 (Long COVID)\n")
#> Topic: Long-term effects of COVID-19 (Long COVID)
cat("Objective: Demonstrate search strategy analysis with real data\n\n")
#> Objective: Demonstrate search strategy analysis with real data

2.2 Defining the Search Strategy

A well-defined search strategy is crucial for systematic reviews. Here we define our search parameters including terms, databases, date ranges, and filters:

# Define our search strategy
search_strategy <- list(
  terms = c(
    "long covid",
    "post-covid syndrome",
    "covid-19 sequelae",
    "post-acute covid-19",
    "persistent covid symptoms"
  ),
  databases = c("PubMed"),
  date_range = as.Date(c("2020-01-01", "2024-12-31")),
  filters = list(
    language = "English",
    article_types = c("Journal Article", "Review", "Clinical Trial")
  ),
  search_date = Sys.time()
)

cat("Search Strategy:\n")
#> Search Strategy:
cat("Terms:", paste(search_strategy$terms, collapse = " OR "), "\n")
#> Terms: long covid OR post-covid syndrome OR covid-19 sequelae OR post-acute covid-19 OR persistent covid symptoms
cat("Date range:", paste(search_strategy$date_range, collapse = " to "), "\n\n")
#> Date range: 2020-01-01 to 2024-12-31

The search strategy includes:

4 Data Quality Assessment

4.1 Duplicate Detection

Duplicate detection is critical in systematic reviews, especially when searching multiple databases. The package provides sophisticated algorithms for identifying duplicates:

# Detect and remove duplicates
cat("Detecting duplicates...\n")
#> Detecting duplicates...
dedup_results <- detect_dupes(standardized_results, method = "exact")

cat("Duplicate detection complete:\n")
#> Duplicate detection complete:
cat("- Total articles:", nrow(dedup_results), "\n")
#> - Total articles: 150
cat("- Unique articles:", sum(!dedup_results$duplicate), "\n")
#> - Unique articles: 150
cat("- Duplicates found:", sum(dedup_results$duplicate), "\n\n")
#> - Duplicates found: 0

The detect_dupes() function offers several methods:

4.2 Search Statistics

Basic statistics help assess the overall quality of the search results:

# Calculate search statistics
search_stats <- calc_search_stats(dedup_results)
cat("Search Statistics:\n")
#> Search Statistics:
cat("- Date range:", paste(search_stats$date_range, collapse = " to "), "\n")
#> - Date range: 2024-01-01 to 2025-05-01
cat("- Missing abstracts:", search_stats$missing_abstracts, "\n")
#> - Missing abstracts: 0
cat("- Missing dates:", search_stats$missing_dates, "\n\n")
#> - Missing dates: 0

5 Creating a Gold Standard

5.1 Demonstration Gold Standard

For performance evaluation, we need a “gold standard” of known relevant articles. In a real systematic review, this would be your manually identified relevant articles. For this demonstration, we create a simplified gold standard:

# Create a gold standard for demonstration
# In a real systematic review, this would be your known relevant articles
# For this example, we'll identify articles that contain key terms in titles
cat("Creating demonstration gold standard...\n")
#> Creating demonstration gold standard...
long_covid_terms <- c("long covid", "post-covid", "post-acute covid", "persistent covid", "covid sequelae")
pattern <- paste(long_covid_terms, collapse = "|")

gold_standard_ids <- dedup_results %>%
  filter(!duplicate) %>%
  filter(grepl(pattern, tolower(title))) %>%
  pull(id)

cat("Gold standard created with", length(gold_standard_ids), "highly relevant articles\n\n")
#> Gold standard created with 83 highly relevant articles

Note: In practice, your gold standard would be created through:

6 Performance Analysis

6.1 Initializing the SearchAnalyzer

The SearchAnalyzer class provides comprehensive tools for evaluating search performance:

# Initialize SearchAnalyzer with real data
cat("Initializing SearchAnalyzer...\n")
#> Initializing SearchAnalyzer...
analyzer <- SearchAnalyzer$new(
  search_results = filter(dedup_results, !duplicate),
  gold_standard = gold_standard_ids,
  search_strategy = search_strategy
)

6.2 Calculating Performance Metrics

The analyzer calculates a comprehensive set of performance metrics:

# Calculate comprehensive metrics
cat("Calculating performance metrics...\n")
#> Calculating performance metrics...
metrics <- analyzer$calculate_metrics()

# Display key metrics
cat("\n=== SEARCH PERFORMANCE METRICS ===\n")
#> 
#> === SEARCH PERFORMANCE METRICS ===
if (!is.null(metrics$precision_recall$precision)) {
  cat("Precision:", round(metrics$precision_recall$precision, 3), "\n")
  cat("Recall:", round(metrics$precision_recall$recall, 3), "\n")
  cat("F1 Score:", round(metrics$precision_recall$f1_score, 3), "\n")
  cat("Number Needed to Read:", round(metrics$precision_recall$number_needed_to_read, 1), "\n")
}
#> Precision: 0.553 
#> Recall: 1 
#> F1 Score: 0.712 
#> Number Needed to Read: 1.8

cat("\n=== BASIC METRICS ===\n")
#> 
#> === BASIC METRICS ===
cat("Total Records:", metrics$basic$total_records, "\n")
#> Total Records: 150
cat("Unique Records:", metrics$basic$unique_records, "\n")
#> Unique Records: 150
cat("Duplicates:", metrics$basic$duplicates, "\n")
#> Duplicates: 0
cat("Sources:", metrics$basic$sources, "\n")
#> Sources: 98

6.2.1 Understanding the Metrics

  • Precision: Proportion of retrieved articles that are relevant
  • Recall: Proportion of relevant articles that are retrieved
  • F1 Score: Harmonic mean of precision and recall
  • Number Needed to Read: Average number of articles to screen to find one relevant article

7 Visualization and Reporting

7.1 Performance Visualizations

The package generates publication-ready visualizations to assess search performance:

# Generate visualizations
cat("\nGenerating visualizations...\n")
#> 
#> Generating visualizations...

# Overview plot
overview_plot <- analyzer$visualize_performance("overview")
print(overview_plot)


# Temporal distribution plot
temporal_plot <- analyzer$visualize_performance("temporal")
print(temporal_plot)


# Precision-recall curve (if gold standard available)
if (length(gold_standard_ids) > 0) {
  pr_plot <- analyzer$visualize_performance("precision_recall")
  print(pr_plot)
}

7.2 PRISMA Flow Diagram

The package can generate data for PRISMA flow diagrams, essential for systematic review reporting:

# Generate PRISMA flow diagram data
cat("\nCreating PRISMA flow data...\n")
#> 
#> Creating PRISMA flow data...
screening_data <- data.frame(
  id = dedup_results$id[!dedup_results$duplicate],
  identified = TRUE,
  duplicate = FALSE,
  title_abstract_screened = TRUE,
  full_text_eligible = runif(sum(!dedup_results$duplicate)) > 0.7,  # Simulate screening
  included = runif(sum(!dedup_results$duplicate)) > 0.85,  # Simulate final inclusion
  excluded_title_abstract = runif(sum(!dedup_results$duplicate)) > 0.3,
  excluded_full_text = runif(sum(!dedup_results$duplicate)) > 0.15
)
# Generate PRISMA diagram
reporter <- PRISMAReporter$new()
prisma_plot <- reporter$generate_prisma_diagram(screening_data)
print(prisma_plot)

8 Data Export and Sharing

8.1 Multiple Export Formats

The package supports exporting results in various formats commonly used in systematic reviews:

# Export results in multiple formats
cat("\nExporting results...\n")
#> 
#> Exporting results...
output_dir <- tempdir()
export_files <- export_results(
  search_results = filter(dedup_results, !duplicate),
  file_path = file.path(output_dir, "covid_long_term_search"),
  formats = c("csv", "xlsx", "ris"),
  include_metadata = TRUE
)

cat("Files exported:\n")
#> Files exported:
for (file in export_files) {
  cat("-", file, "\n")
}
#> - C:\Users\chaoliu\AppData\Local\Temp\RtmpM59Tem/covid_long_term_search.csv 
#> - C:\Users\chaoliu\AppData\Local\Temp\RtmpM59Tem/covid_long_term_search.xlsx 
#> - C:\Users\chaoliu\AppData\Local\Temp\RtmpM59Tem/covid_long_term_search.ris

8.2 Metrics Export

Performance metrics can also be exported for further analysis or reporting:

# Export metrics
metrics_file <- export_metrics(
  metrics = metrics,
  file_path = file.path(output_dir, "search_metrics.xlsx"),
  format = "xlsx"
)
cat("- Metrics exported to:", metrics_file, "\n")
#> - Metrics exported to: C:\Users\chaoliu\AppData\Local\Temp\RtmpM59Tem/search_metrics.xlsx

8.3 Comprehensive Data Package

For reproducibility, create a complete data package containing all analysis components:

# Create a complete data package
cat("\nCreating comprehensive data package...\n")
#> 
#> Creating comprehensive data package...
package_dir <- create_data_package(
  search_results = filter(dedup_results, !duplicate),
  analysis_results = list(
    metrics = metrics,
    search_strategy = search_strategy,
    screening_data = screening_data
  ),
  output_dir = output_dir,
  package_name = "covid_long_term_systematic_review"
)

cat("Data package created at:", package_dir, "\n")
#> Data package created at: C:\Users\chaoliu\AppData\Local\Temp\RtmpM59Tem/covid_long_term_systematic_review

9 Advanced Analysis

9.1 Benchmark Validation

The package includes tools for validating search strategies against established benchmarks:

# Demonstrate benchmark validation (simplified)
cat("\nDemonstrating benchmark validation...\n")
#> 
#> Demonstrating benchmark validation...
validator <- BenchmarkValidator$new()

# Add our search as a custom benchmark
validator$add_benchmark(
  name = "covid_long_term",
  corpus = filter(dedup_results, !duplicate),
  relevant_ids = gold_standard_ids
)

# Validate the strategy
validation_results <- validator$validate_strategy(
  search_strategy = search_strategy,
  benchmark_name = "covid_long_term"
)

cat("Validation Results:\n")
#> Validation Results:
cat("- Precision:", round(validation_results$precision, 3), "\n")
#> - Precision: 0.696
cat("- Recall:", round(validation_results$recall, 3), "\n")
#> - Recall: 0.855
cat("- F1 Score:", round(validation_results$f1_score, 3), "\n")
#> - F1 Score: 0.768

9.2 Text Similarity Analysis

Analyze how well the retrieved abstracts match the search terms:

# Text similarity analysis on abstracts
cat("\nAnalyzing abstract similarity to search terms...\n")
#> 
#> Analyzing abstract similarity to search terms...
search_term_text <- paste(search_strategy$terms, collapse = " ")

similarity_scores <- sapply(dedup_results$abstract[!dedup_results$duplicate], function(abstract) {
  if (is.na(abstract) || abstract == "") return(0)
  calc_text_sim(search_term_text, abstract, method = "jaccard")
})

cat("Average abstract similarity to search terms:", round(mean(similarity_scores, na.rm = TRUE), 3), "\n")
#> Average abstract similarity to search terms: 0.013
cat("Abstracts with high similarity (>0.1):", sum(similarity_scores > 0.1, na.rm = TRUE), "\n")
#> Abstracts with high similarity (>0.1): 0

9.3 Term Effectiveness Analysis

Evaluate which search terms are most effective:

# Analyze term effectiveness
cat("\nAnalyzing individual term effectiveness...\n")
#> 
#> Analyzing individual term effectiveness...
term_analysis <- term_effectiveness(
  terms = search_strategy$terms,
  search_results = filter(dedup_results, !duplicate),
  gold_standard = gold_standard_ids
)

print(term_analysis)
#> Term Effectiveness Analysis
#> ==========================
#> Search Results: 150 articles
#> Gold Standard: 83 relevant articles
#> Fields Analyzed: title, abstract 
#> 
#>                       term articles_with_term relevant_with_term precision
#>                 long covid                 86                 60     0.698
#>        post-covid syndrome                  5                  5     1.000
#>          covid-19 sequelae                  5                  1     0.200
#>        post-acute covid-19                 15                 11     0.733
#>  persistent covid symptoms                  0                  0     0.000
#>  coverage
#>     0.723
#>     0.060
#>     0.012
#>     0.133
#>     0.000
# Calculate term effectiveness scores
term_scores <- calc_tes(term_analysis)
cat("\nTerm Effectiveness Scores (TES):\n")
#> 
#> Term Effectiveness Scores (TES):
print(term_scores[order(term_scores$tes, decreasing = TRUE), ])
#> Term Effectiveness Analysis
#> ==========================
#> Search Results: 150 articles
#> Gold Standard: 83 relevant articles
#> Fields Analyzed: title, abstract 
#> 
#>                       term articles_with_term relevant_with_term precision
#>                 long covid                 86                 60     0.698
#>        post-acute covid-19                 15                 11     0.733
#>        post-covid syndrome                  5                  5     1.000
#>          covid-19 sequelae                  5                  1     0.200
#>  persistent covid symptoms                  0                  0     0.000
#>  coverage        tes
#>     0.723 0.71005917
#>     0.133 0.22448980
#>     0.060 0.11363636
#>     0.012 0.02272727
#>     0.000 0.00000000
# Find top performing terms
top_terms <- find_top_terms(term_analysis, n = 3, plot = TRUE, plot_type = "precision_coverage")
cat("\nTop 3 performing terms:", paste(top_terms$terms, collapse = ", "), "\n")
#> 
#> Top 3 performing terms: long covid, post-acute covid-19, post-covid syndrome
if (!is.null(top_terms$plot)) {
  print(top_terms$plot)
}

10 Results Interpretation and Recommendations

10.1 Automated Recommendations

Based on the calculated metrics, the package can provide automated recommendations:

# Final summary and recommendations
cat("\n=== FINAL SUMMARY AND RECOMMENDATIONS ===\n")
#> 
#> === FINAL SUMMARY AND RECOMMENDATIONS ===
cat("Search Topic: Long-term effects of COVID-19\n")
#> Search Topic: Long-term effects of COVID-19
cat("Articles Retrieved:", sum(!dedup_results$duplicate), "\n")
#> Articles Retrieved: 150
cat("Search Date Range:", paste(search_strategy$date_range, collapse = " to "), "\n")
#> Search Date Range: 2020-01-01 to 2024-12-31

if (!is.null(metrics$precision_recall$precision)) {
  cat("Search Precision:", round(metrics$precision_recall$precision, 3), "\n")

  if (metrics$precision_recall$precision < 0.1) {
    cat("RECOMMENDATION: Low precision suggests search may be too broad. Consider:\n")
    cat("- Adding more specific terms\n")
    cat("- Using MeSH terms\n")
    cat("- Adding study type filters\n")
  } else if (metrics$precision_recall$precision > 0.5) {
    cat("RECOMMENDATION: High precision suggests good specificity. Consider:\n")
    cat("- Broadening search if recall needs improvement\n")
    cat("- Adding synonyms or related terms\n")
  }
}
#> Search Precision: 0.553 
#> RECOMMENDATION: High precision suggests good specificity. Consider:
#> - Broadening search if recall needs improvement
#> - Adding synonyms or related terms

10.2 Sample Retrieved Articles

Let’s examine some of the retrieved articles to understand the search results:

# Show some example retrieved articles
cat("\n=== SAMPLE RETRIEVED ARTICLES ===\n")
#> 
#> === SAMPLE RETRIEVED ARTICLES ===
sample_articles <- filter(dedup_results, !duplicate) %>%
  arrange(desc(date)) %>%
  head(3)

for (i in 1:nrow(sample_articles)) {
  article <- sample_articles[i, ]
  cat("\n", i, ". ", article$title, "\n", sep = "")
  cat("   Journal:", article$source, "\n")
  cat("   Date:", as.character(article$date), "\n")
  cat("   PMID:", gsub("PMID:", "", article$id), "\n")
  cat("   Abstract:", substr(article$abstract, 1, 200), "...\n")
}
#> 
#> 1. Rates, Risk Factors and Outcomes of Complications After COVID-19 in Children.
#>    Journal: PubMed: The Pediatric infectious disease journal 
#>    Date: 2025-05-01 
#>    PMID: 40232883 
#>    Abstract: Coronavirus disease 2019 (COVID-19) can lead to various complications, including multisystem inflammatory syndrome in children (MIS-C) and post-COVID-19 conditions (long COVID). This study aimed to de ...
#> 
#> 2. Considerations for Long COVID Rehabilitation in Women.
#>    Journal: PubMed: Physical medicine and rehabilitation clinics of North America 
#>    Date: 2025-05-01 
#>    PMID: 40210368 
#>    Abstract: The coronavirus disease 2019 (COVID-19) pandemic has given rise to long COVID, a prolonged manifestation of severe acute respiratory syndrome coronavirus 2 infection, which presents with varied sympto ...
#> 
#> 3. Self-Assembly of Human Fibrinogen into Microclot-Mimicking Antifibrinolytic Amyloid Fibrinogen Particles.
#>    Journal: PubMed: ACS applied bio materials 
#>    Date: 2025-01-20 
#>    PMID: 39723824 
#>    Abstract: Recent clinical studies have highlighted the presence of microclots in the form of amyloid fibrinogen particles (AFPs) in plasma samples from Long COVID patients. However, the clinical significance of ...

11 Summary and Next Steps

11.1 What We’ve Accomplished

This vignette demonstrated a complete workflow using the searchAnalyzeR package:

cat("\n=== ANALYSIS COMPLETE ===\n")
#> 
#> === ANALYSIS COMPLETE ===
cat("This example demonstrated:\n")
#> This example demonstrated:
cat("1. Real PubMed search execution using search_pubmed()\n")
#> 1. Real PubMed search execution using search_pubmed()
cat("2. Data standardization and deduplication\n")
#> 2. Data standardization and deduplication
cat("3. Performance metric calculation\n")
#> 3. Performance metric calculation
cat("4. Visualization generation\n")
#> 4. Visualization generation
cat("5. Multi-format export capabilities\n")
#> 5. Multi-format export capabilities
cat("6. PRISMA diagram creation\n")
#> 6. PRISMA diagram creation
cat("7. Benchmark validation\n")
#> 7. Benchmark validation
cat("8. Term effectiveness analysis\n")
#> 8. Term effectiveness analysis
cat("9. Comprehensive reporting\n")
#> 9. Comprehensive reporting
cat("\nAll outputs saved to:", output_dir, "\n")
#> 
#> All outputs saved to: C:\Users\chaoliu\AppData\Local\Temp\RtmpM59Tem

11.2 Files Generated

The analysis generates numerous output files for different purposes:

# Clean up and provide final file locations
list.files(output_dir, pattern = "covid", full.names = TRUE, recursive = TRUE)
#> [1] "C:\\Users\\chaoliu\\AppData\\Local\\Temp\\RtmpM59Tem/covid_long_term_search.csv"    
#> [2] "C:\\Users\\chaoliu\\AppData\\Local\\Temp\\RtmpM59Tem/covid_long_term_search.ris"    
#> [3] "C:\\Users\\chaoliu\\AppData\\Local\\Temp\\RtmpM59Tem/covid_long_term_search.xlsx"   
#> [4] "C:\\Users\\chaoliu\\AppData\\Local\\Temp\\RtmpM59Tem/strategy_A_broad_covid.csv"    
#> [5] "C:\\Users\\chaoliu\\AppData\\Local\\Temp\\RtmpM59Tem/strategy_A_broad_covid.xlsx"   
#> [6] "C:\\Users\\chaoliu\\AppData\\Local\\Temp\\RtmpM59Tem/strategy_B_targeted_covid.csv" 
#> [7] "C:\\Users\\chaoliu\\AppData\\Local\\Temp\\RtmpM59Tem/strategy_B_targeted_covid.xlsx"

11.3 Next Steps

After completing this analysis, typical next steps would include:

  1. Refining the search strategy based on performance metrics and term effectiveness analysis
  2. Expanding to multiple databases using similar workflows
  3. Implementing the refined strategy in your systematic review protocol
  4. Using the exported data for screening and data extraction phases
  5. Incorporating visualizations into your systematic review protocol or publication

11.4 Additional Resources

For more advanced features and customization options, consult:

This comprehensive workflow demonstrates how searchAnalyzeR can streamline and enhance the systematic review search process, providing objective metrics and visualizations to support evidence-based search strategy development and optimization.