How it Works
The purpose of this project is to extract relevant sections of an input document. These are then appended to the user’s prompt.
First, text is extracted from the input document. Pulling from a limited, predefined list of words, each sentence is ascribed stop gaps according to the roles words play in standard English. The types of words that are accounted for are: Determiners Prepositions Conjunctions Modifiers Auxiliary words Compositional words
Using this, it isolates the first noun-phrase or prepositional phrase from each sentence. Using this, it can predict the section of a sentence that will contain the subject, without the need to define the subject with any certainty. This allows it to determine what a sentence is about without the need for neural models to learn domain-specific language. This also adds flexibility, identifying instances where a word in the query modifies the subject of a sentence, but does not look so deep as to find instances where it is a minor point.
Once stop-gaps have been added, only sentences that may be relevant to the query are analyzed. The full sentence is appended to the query as context for the LLM