Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Processing source documents in Docs Agent

This README file provides information on the steps involved in the following two Python scripts, which are used to process source documents into plain text chunks and generate embeddings for vector databases:

Docs Agent pre-processing flow

Figure 1. Docs Agent's pre-processing flow from source documents to the vector database.

Note: The markdown_to_plain_text.py script is deprecated in favor of the files_to_plain_text.py script.

Docs Agent chunking technique example

The files_to_plain_text.py script splits documents into smaller chunks based on Markdown headings (#, ##, and ###).

For example, consider the following Markdown page:

# Page title

This is the introduction paragraph of this page.

## Section 1

This is the paragraph of section 1.

### Sub-section 1.1

This is the paragraph of sub-section 1.1.

### Sub-section 1.2

This is the paragraph of sub-section 1.2.

## Section 2

This is the paragraph of section 2.

This example Markdown page is split into the following 5 chunks:

# Page title

This is the introduction paragraph of this page.
## Section 1

This is the paragraph of section 1.
### Sub-section 1.1

This is the paragraph of sub-section 1.1.
### Sub-section 1.2

This is the paragraph of sub-section 1.2.
## Section 2

This is the paragraph of section 2.

Additionally, becasue the token size limitation of embedding models, the script recursively splits the chunks above until each chunk's size becomes less than 5000 bytes (characters).

Steps in the files_to_plain_text.py script

In the default setting, when processing Markdown files to plain text using the files_to_plain_text.py script, the following events take place:

  1. Read the configuration file (config.yaml) to identify input and output directories.
  2. Construct an array of input sources (which are the path entries).
  3. For each input source, do the following:
    1. Extract all input fields (path, url_prefix, and more).
    2. Call the process_files_from_input() method using these input fields.
    3. For each sub-path in the input directory and for each file in these directories:
      1. Check if the file extension is .md (that is, a Markdown file).
      2. Construct an output directory that preserves the path structure.
      3. Read the content of the Markdown file.
      4. Call the process_page_and_section_titles() method to reformat the page and section titles.
        1. Process Front Matter in Markdown.
        2. Detect (or construct) the title of the page.
        3. Detect Markdown headings (#, ##, and ###).
        4. Convert Markdown headings into plain English (to preserve context when generating embeddings).
      5. Call the process_document_into_sections() method to split the content into small text chunks.
        1. Create a new empty array.
        2. Divide the content using Markdown headings (#, ##, and ###).
        3. Insert each chunk into the array and simplify the heading to # (title).
        4. Return the array.
      6. For each text chunk, do the following:
        1. Call the markdown_to_text() method to clean up Markdown and HTML syntax.
          1. Remove <!-- --> lines in Markdown.
          2. Convert Markdown to HTML (which makes the plan text extraction easy).
          3. Use BeautifulSoup to extract plain text from the HTML.
          4. Remove [][] in Markdown.
          5. Remove {: } in Markdown.
          6. Remove {. } in Markdown.
          7. Remove a single line sh in Markdown.
          8. Remove code text and blocks.
          9. Return the plain text.
        2. Construct the text chunk’s metadata (including URL) for the file_index.json file.
        3. Write the text chunk into a file in the output directory.

Steps in the populate_vector_database.py script

When processing plain text chunks to embeddings using the populate_vector_database.py script, the following events take place:

  1. Read the configuration file (config.yaml) to identify the plain text directory and Chroma settings.
  2. Set up the Gemini API environment.
  3. Select the embeddings model.
  4. Configure the embedding function (including the API call limit).
  5. For each sub-path in the plain text directory and for each file in these directories:
    1. Check if the file extension is .md (that is, a Markdown file).
    2. Read the content of the Markdown file.
    3. Construct the URL of the text chunk’s source.
    4. Read the metadata associated with the text chunk file.
    5. Store the text chunk and metabase to the vector database, which also generates an embedding for the text chunk at the time of insertion.
    6. Skip if the file size is larger than 5000 bytes (due to the API limit).
    7. Skip if the text chunk is already in the vector database and the checksum hasn’t changed.

Delete chunks process

The process below describes how the delete chunks feature is implemented in the populate_vector_database.py script:

  1. Read all exiting entries in the target database.

  2. Read all candidate entries in the file_index.json file (created after running the agent chunk command).

  3. For each entry in the existing entries found in step 1:

    Compare the text_chunk_filename fields (included in the entry's metadata).

    1. If not found in the candidate entires in step 2, delete this entry in the database.

    2. If found, compare the md_hash fields:

      If they are different, delete this entry in the database.