glossary reorder

This commit is contained in:
Jan Carbonell
2024-02-26 03:08:24 -08:00
parent 9272040160
commit 7a31bb523d
5 changed files with 117 additions and 98 deletions
+4 -4
View File
@@ -1,7 +1,7 @@
# DB Providers
## Local
LOCAL_DB_PATH=your_local_db_path
#LOCAL_DB_PATH=your_local_db_path
## Postgres
## For example setup, see https://supabase.com/dashboard/project/MY_PROJECT/settings/database
@@ -12,9 +12,9 @@ POSTGRES_PORT=your_port
POSTGRES_DBNAME=your_db
## qdrant
#QDRANT_HOST=your_qdrant_host
#QDRANT_PORT=your_qdrant_port
#QDRANT_API_KEY=your_qdrant_api_key
##QDRANT_HOST=your_qdrant_host
##QDRANT_PORT=your_qdrant_port
##QDRANT_API_KEY=your_qdrant_api_key
# LLM Providers
+18 -16
View File
@@ -16,13 +16,14 @@ R2R was conceived to bridge the gap between experimental RAG models and robust,
### Quick Install:
**Install R2R directly using `pip`:**
```bash
# use the `'r2r[all]'` to download all required deps
pip install 'r2r[parsing]'
```
```bash
# use the `'r2r[all]'` to download all required deps
pip install 'r2r[parsing]'
```
## Links
[Join the Discord server](https://discord.gg/p6KqD2kjtB)
[Read our Docs](https://docs.sciphi.ai/)
@@ -42,15 +43,16 @@ The project includes several basic examples that demonstrate application deploym
```bash
poetry run python -m examples.basic.run_client
```
3. [`run_pdf_chat.py`](examples/pdf_chat/run_demo.py): An example demonstrating upload and chat with a more realistic pdf.
```bash
# Ingest pdf
poetry run python -m examples.pdf_chat.run_demo ingest
# Ask a question
poetry run python -m examples.pdf_chat.run_demo search "What are the key themes of Meditations?"
```
3. [`run_pdf_chat.py`](examples/pdf_chat/run_demo.py): An example demonstrating upload and chat with a more realistic pdf.
```bash
# Ingest pdf
poetry run python -m examples.pdf_chat.run_demo ingest
# Ask a question
poetry run python -m examples.pdf_chat.run_demo search "What are the key themes of Meditations?"
```
```bash
# Ingest pdf
@@ -67,20 +69,20 @@ The project includes several basic examples that demonstrate application deploym
pnpm dev
```
## 60s demo of the examples
[![demo_screenshot](./docs/pages/getting-started/demo_screenshot.png)](https://github.com/SciPhi-AI/R2R/assets/68796651/01fee645-1beb-4096-9e7d-7d0fa01386ea)
### Full Install:
Follow these steps to ensure a smooth setup:
1. **Install Poetry:**
- Before installing the project, make sure you have Poetry on your system. If not, visit the [official Poetry website](https://python-poetry.org/docs/#installation) for installation instructions.
2. **Clone and Install Dependencies:**
- Clone the project repository and navigate to the project directory:
```bash
git clone git@github.com:SciPhi-AI/r2r.git
@@ -95,7 +97,7 @@ Follow these steps to ensure a smooth setup:
3. **Configure Environment Variables:**
- You need to set up cloud provider secrets in your `.env`. At a minimum, you will need an OpenAI key.
- The framework currently supports pgvector and Qdrant with plans to extend coverage.
- The framework currently supports PostgreSQL (locally), pgvector and Qdrant with plans to extend coverage.
- If starting from the example, copy `.env.example` to `.env` to apply your configurations:
```bash
cp .env.example .env
+4 -4
View File
@@ -1,15 +1,15 @@
{
"database": {
"provider": "local",
"database": {
"provider": "pg_vector",
"collection_name": "demo-v1-test"
},
},
"embedding": {
"provider": "openai",
"model": "text-embedding-3-small",
"dimension": 1536,
"batch_size": 32
},
"text_splitter": {
"text_splitter": {
"chunk_size": 512,
"chunk_overlap": 20
},
+11 -12
View File
@@ -2,17 +2,18 @@
This glossary provides concise definitions for key terms related to machine learning (ML), programming in Python, database and vector storage technologies, and concepts central to Retrieval-Augmented Generation (RAG) pipelines.
### Machine Learning (ML) Terms
### Retrieval-Augmented Generation (RAG) Pipeline Terms
- **Adapter Pattern**: Enables interaction between incompatible interfaces by data transformation.
- **Metadata & Session**: Information about data and individual database connections.
- **Adapter Context & Chunking**: Context for data transformation and dividing data for processing.
- **Normalize Embeddings**: Process of scaling vectors to unit norm for consistent similarity metrics.
### Machine Learning Terms
- **Embeddings**: Numerical vectors representing data (text, images) to capture similarity.
- **Sentence Transformers**: Library for dense vector representations of text for semantic search and similarity.
### Programming and Python Concepts
- **Iterable**: Object that can be iterated over (lists, tuples, strings).
- **Generator**: Iterable in Python using `yield` for lazy item generation.
- **Context Manager**: Manages resources (files, database connections) with `with` statement.
### Database and Vector Storage Terms
- **PostgreSQL**: Open-source relational database for extensible SQL operations.
@@ -20,9 +21,7 @@ This glossary provides concise definitions for key terms related to machine lear
- **HNSW & IVFFlat**: Algorithms for efficient nearest neighbor search in vector data.
- **Upsert & Index**: Operations and structures for data insertion/update and speedy retrieval.
### Retrieval-Augmented Generation (RAG) Pipeline Terms
### Programming and Python Concepts
- **Adapter Pattern**: Enables interaction between incompatible interfaces by data transformation.
- **Metadata & Session**: Information about data and individual database connections.
- **Adapter Context & Chunking**: Context for data transformation and dividing data for processing.
- **Normalize Embeddings**: Process of scaling vectors to unit norm for consistent similarity metrics.
- **Generator**: Iterable in Python using `yield` for lazy item generation.
- **Context Manager**: Manages resources (files, database connections) with `with` statement.
+80 -62
View File
@@ -11,27 +11,37 @@ update_env_example() {
local tmp_file=$(mktemp)
# Define patterns to match based on the database choice
local pattern_to_comment=""
local patterns_to_comment=()
local pattern_to_uncomment=""
if [ "$db_choice" = "1" ]; then
# If pg_vector is chosen, comment out QDRANT keys and uncomment PGVECTOR keys
pattern_to_comment="^QDRANT_"
pattern_to_uncomment="^#PGVECTOR_"
elif [ "$db_choice" = "2" ]; then
# If qdrant is chosen, comment out PGVECTOR keys
pattern_to_comment="^POSTGRES_"
fi
# Comment out the lines matching the pattern for the database choice
if [ ! -z "$pattern_to_comment" ]; then
sed "/$pattern_to_comment/s/^/#/" .env.example > "$tmp_file" && mv "$tmp_file" .env.example
fi
case "$db_choice" in
1)
# If Local is chosen, uncomment Local keys and comment out the others
patterns_to_comment=("POSTGRES_" "QDRANT_")
pattern_to_uncomment="LOCAL_DB_PATH"
;;
2)
# If pg_vector is chosen, this option is treated the same as Postgres in this context
patterns_to_comment=("LOCAL_DB_PATH" "QDRANT_")
pattern_to_uncomment="POSTGRES_"
;;
3)
# If qdrant is chosen, comment out Postgres keys and uncomment QDRANT keys
patterns_to_comment=("LOCAL_DB_PATH" "POSTGRES_")
pattern_to_uncomment="QDRANT_"
;;
esac
# Uncomment the lines matching the pattern for the database choice
# Uncomment the lines for the chosen database
if [ ! -z "$pattern_to_uncomment" ]; then
sed "/$pattern_to_uncomment/s/^#//" .env.example > "$tmp_file" && mv "$tmp_file" .env.example
fi
# Comment out the lines for the not chosen databases
for pattern in "${patterns_to_comment[@]}"; do
sed "/$pattern/s/^/#/" .env.example > "$tmp_file" && mv "$tmp_file" .env.example
done
# Handle SERPER_API_KEY based on websearch integration choice
if [ "$integrate_websearch" != "yes" ] && [ "$integrate_websearch" != "y" ] && [ "$integrate_websearch" != "Y" ] && [ "$integrate_websearch" != "1" ]; then
# Comment out SERPER_API_KEY if websearch integration is not chosen
@@ -42,61 +52,64 @@ update_env_example() {
fi
}
# Function to update config.json
update_config() {
jq "$1" config.json > config.tmp && mv config.tmp config.json
}
# Function to prompt and read user choice with retry mechanism
# Define the prompt_with_retry function without using name references
prompt_with_retry() {
local prompt_message="$1"
local user_choice_var=$2 # Changed to store the variable name as a string
local attempt=0
local max_attempts=2
while [ $attempt -lt $max_attempts ]; do
local user_choice_var_name=$2 # Store the variable name as a string
while true; do
echo -e "$prompt_message"
read user_input
eval $user_choice_var="'$user_input'" # Indirectly assign the input to the variable
case ${!user_choice_var} in # Use indirect reference to check the value
1|2|y|Y|yes|YES|n|N|no|NO)
return 0
eval $user_choice_var_name="'$user_input'" # Assign the input to the variable indirectly
case $(eval echo \$$user_choice_var_name) in # Use indirect expansion to check the value
1|2|3)
break
;;
*)
let attempt++
echo "Invalid choice. Please try again."
if [ $attempt -eq $max_attempts ]; then
echo "Failed too many times. Exiting."
exit 1
fi
;;
esac
done
}
echo "Setting up your R2R configuration..."
echo -e "Default options are displayed in ${GREEN}Green${NC}"
echo -e "\n"
# Define the update_config function to use jq for updating config.json with correct types for numbers
update_config() {
local update_path="$1"
local value="$2"
local is_numeric="$3" # New parameter to check if the value should be treated as a numeric value
# Select vector database provider
prompt_message="Select your vector database provider:\n1) ${GREEN}pg_vector (Supabase)${NC} | 2) qdrant\n\nEnter choice [1-2]: "
prompt_with_retry "$prompt_message" db_choice
if [[ "$is_numeric" == "yes" ]]; then
jq "$update_path = $value" config.json > config.tmp && mv config.tmp config.json
else
jq "$update_path = \"$value\"" config.json > config.tmp && mv config.tmp config.json
fi
}
# Example usage of prompt_with_retry
prompt_message="Select your vector database provider:\n1) ${GREEN}PostgreSQL (Local)${NC} | 2) pg_vector (Supabase) | 3) qdrant\n\nEnter choice [1-3]: "
db_choice=0
prompt_with_retry "$prompt_message" "db_choice"
# Example usage of update_config
# This assumes you have jq installed and config.json is in the current directory
case $db_choice in
1)
update_config '.database.provider = "pg_vector"'
echo "Make sure the vectors extension plugin has been enabled in your PostgreSQL."
#echo -e "Make sure the ${YELLOW}vector${NC} extension plugin has been enabled in ${GREEN}Supabase ${NC} > ${YELLOW}Project > Database > Extensions${NC}."
update_config '.database.provider' 'local' 'no'
echo "Using PostgreSQL (Local) as the default database."
;;
2)
update_config '.database.provider = "qdrant"'
update_config '.database.provider' 'pg_vector' 'no'
echo -e "Make sure the ${YELLOW}vectors${NC} extension plugin has been enabled in ${YELLOW}Supabase > Project > Database > Extensions${NC}."
;;
3)
update_config '.database.provider' 'qdrant' 'no'
;;
esac
# Call update_env_example with the user's database
echo -e "\n"
prompt_message="Do you want to integrate with websearch?\n1) ${GREEN}no${NC} | 2) yes\n\nEnter choice [1-2]: "
prompt_with_retry "$prompt_message" integrate_websearch
integrate_websearch=0
prompt_with_retry "$prompt_message" "integrate_websearch"
case "$integrate_websearch" in
[yY] | [yY][eE][sS] | [1] )
@@ -111,7 +124,7 @@ esac
update_env_example $db_choice $integrate_websearch
# Select embedding provider (OpenAI for now)
update_config '.embedding.provider = "openai"'
update_config '.embedding.provider' 'openai' 'no'
# Select model
echo -e "\n"
@@ -141,17 +154,18 @@ echo -e "\t- Pricing: Approximately 12,500 pages per dollar. Balances cost and p
echo -e "\n"
prompt_message="Enter choice [1-3]: "
prompt_with_retry "$prompt_message" model_choice
model_choice=0
prompt_with_retry "$prompt_message" "model_choice"
case $model_choice in
1)
update_config '.embedding.model = "text-embedding-3-small"'
update_config '.embedding.model' 'text-embedding-3-small' 'no'
;;
2)
update_config '.embedding.model = "text-embedding-3-large"'
update_config '.embedding.model' 'text-embedding-3-large' 'no'
;;
3)
update_config '.embedding.model = "text-embedding-ada-002"'
update_config '.embedding.model' 'text-embedding-ada-002' 'no'
;;
*)
echo "Invalid choice. Exiting."
@@ -163,22 +177,23 @@ echo "Would you like to use the recommended default sizes for the model or speci
echo -e "1) Use ${GREEN}default sizes${NC}"
echo "2) Specify custom values"
prompt_message="Enter choice [1-2]: "
prompt_with_retry "$prompt_message" size_choice
size_choice=0
prompt_with_retry "$prompt_message" "size_choice"
echo -e "\n"
if [ "$size_choice" = "1" ]; then
case $model_choice in
1)
update_config '.embedding.dimension = 1536'
update_config '.embedding.batch_size = 32'
update_config '.embedding.dimension' '1536' 'yes'
update_config '.embedding.batch_size' '32' 'yes'
;;
2)
update_config '.embedding.dimension = 4096'
update_config '.embedding.batch_size = 16'
update_config '.embedding.dimension' '4096' 'yes'
update_config '.embedding.batch_size' '16' 'yes'
;;
3)
update_config '.embedding.dimension = 2048'
update_config '.embedding.batch_size = 24'
update_config '.embedding.dimension' '2048' 'yes'
update_config '.embedding.batch_size' '24' 'yes'
;;
esac
elif [ "$size_choice" = "2" ]; then
@@ -189,7 +204,8 @@ elif [ "$size_choice" = "2" ]; then
echo "Other) Type custom dimension"
echo -e "\n"
prompt_message="Enter choice [1-3] or type it: "
prompt_with_retry "$prompt_message" dimension_choice
dimension_choice=0
prompt_with_retry "$prompt_message" "dimension_choice"
case $dimension_choice in
1)
@@ -206,7 +222,7 @@ elif [ "$size_choice" = "2" ]; then
exit 1
;;
esac
update_config ".embedding.dimension = $custom_dimension"
update_config ".embedding.dimension" "$custom_dimension" "yes"
echo "Select the batch size (consider processing speed and cost):"
echo "1) 16 - Suitable for high-quality embeddings with slower processing and higher cost."
@@ -215,7 +231,8 @@ elif [ "$size_choice" = "2" ]; then
echo "Other) Type custom batch size"
echo -e "\n"
prompt_message="Enter choice [1-3] or type it: "
prompt_with_retry "$prompt_message" batch_size_choice
batch_size_choice=0
prompt_with_retry "$prompt_message" "batch_size_choice"
case $batch_size_choice in
1)
@@ -232,7 +249,7 @@ elif [ "$size_choice" = "2" ]; then
exit 1
;;
esac
update_config ".embedding.batch_size = $custom_batch_size"
update_config ".embedding.batch_size" "$custom_batch_size" "yes"
fi
echo "Configuration setup is complete."
@@ -250,4 +267,5 @@ for i in {1..100}; do
printf "\r0 %% [%-50s] %3d%%" "$bar" "$i"
done
echo -e "\n"
echo -e "\nFiles updated: ${YELLOW}.env.example${NC} and ${YELLOW}config.json${NC}. For a the full list of set up options modify ${YELLOW}config.json${NC} directly."
echo -e "\nFiles updated: ${YELLOW}.env.example${NC} and ${YELLOW}config.json${NC}."
echo -e "For a the full list of set up options modify ${YELLOW}config.json${NC} directly."