Querying across free text fields such as report bodies using relational methods (e.g. like
) is computationally intensive and performance is poor in larger data sets (> 100,000 records) performance is very poor. Bridge provides a way to search using a free text index built on Apache Lucene and integrated into the SDK.
Basic Use¶
Building a search of the free text index requires a FullTextEntityManager
and query parser. Getting a FullTextEntityManager
is as simple as running Java::HarbingerSdk::DataUtils.getFullTextEntityManager
. To create a search query parser use the Java::HarbingerSdk::Search::search_query
method. Pass in the name of the index to search (ex: reportBody
) and the search string. This query parser gives common search operators such as boolean syntax.
# A method to search with a given string input and a default limit of 10
def search(string,limit=10)
fm = Java::HarbingerSdk::DataUtils.getFullTextEntityManager
begin
phrase_query = Java::harbinger.sdk.Search::search_query("reportBody", string)
rescue Java::OrgApacheLuceneQueryParser::ParseException => e
return "Error while parsing search string"
end
ftquery = fm.createFullTextQuery(phrase_query)
# uncomment if you would like the lucene score back with the RadReport object
ftquery.setProjection(Java::org.hibernate.search.jpa.FullTextQuery::THIS, Java::org.hibernate.search.jpa.FullTextQuery::SCORE)
reports = ftquery.setMaxResults(limit).getResultList()
fm.close()
return reports
end
# Usage: search("pnemonia")
# search("pneumonia AND pneumothorax")
Best Practice - When searching from user input, ensure an application can handle failed parsing. The exception that will be thrown for a parser error is
Java::OrgApacheLuceneQueryParser::ParseException
.
The return of the method above will yield an ArrayList
of RadReport
objects. Use the same methods/accessors on these objects as a relational database.
Search syntax¶
A search query is broken up into terms
and phrases
. A term
is a single word such as "lung" or "nodule". A phrase
is a group of words surrounded by double quotes such as "lung nodule". Multiple terms can be combined together with operators
to form a more complex query.
Tip - The
search
method created in the example above assumes the "search" parameter that is passed in is already formatted in the form required by the search query. You may wish to simplify or hide the required syntax in the application user interface to make it more accessible.
Operators¶
Terms can be combined through logic operators. The search query supports AND
, +
, OR
, NOT
and -
as operators (Note: operators must be ALL CAPS).
OR Operator¶
The OR
operator is the default conjunction operator. This means that if there is no operator between two terms, the OR
operator is used. The OR
operator links two terms and finds a matching document if either of the terms exist in a document. This is equivalent to a union using sets. The symbol ||
can be used in place of the word OR
.
To search for documents that contain either "lung nodule" or just "lung" use the query "lung nodule" lung
or "lung nodule" OR lung
.
AND Operator¶
The AND
operator matches documents where both terms exist anywhere in the text of a single document. This is equivalent to an intersection using sets. The symbol &&
can be used in place of the word AND
. To search for documents that contain "lung nodule" and "chest" use the query "lung nodule" AND "chest"
.
+ (plus) Operator¶
The +
or required operator ensures that the term after the +
symbol exist somewhere in the document. To search for documents that must contain "lung" and may contain "chest" use the query +lung chest
.
NOT Operator¶
The NOT
operator excludes documents that contain the term after NOT. This is equivalent to a difference using sets. The symbol !
can be used in place of the word NOT
. To search for documents that contain "lung nodule" but not "chest" use the query "lung nodule" NOT "chest"
. Note: The NOT
operator cannot be used with a single term. For example, the following search will return no results: NOT "lung nodule"
- (minus) Operator¶
The -
or prohibit operator excludes documents that contain the term after the -
symbol. To search for documents that contain "lung nodule" but not "chest" use the query lung nodule" -"chest"
Grouping¶
The search query supports using parentheses to group clauses into sub queries. This can be very useful to control the boolean logic of a query. To search for either "lung" or "nodule" and "chest" use the query (lung OR nodule) AND chest
. This eliminates confusion and ensures that chest must exist and either the terms lung or nodule may exist.
Proximity Searches¶
The search query supports finding words within specified distance of each other. To perform a proximity search, use the tilde, ~
, symbol at the end of a phrase. To search for a "nodule" and "lung" within 10 words of each other in a document use the search "lung nodule"~10
Wildcard Searches¶
The search query supports single and multiple character wildcard searches within single terms (not within phrase queries). To perform a single character wildcard search use the ?
symbol. To perform a multiple character wildcard search use the *
symbol. Note: You cannot use a * or ? symbol as the first character of a search.
Examples¶
- The single character wildcard search looks for terms that match that with the single character replaced. For example, to search for "text" or "test" you can use the search
te?t
- Multiple character wildcard searches looks for 0 or more characters. To search for test, tests or tester, you can use the search
test*
- You can also use the wildcard searches in the middle of a term such as
te*t
Rank Boosting¶
The search query provides the relevance level of matching documents based on the terms found. Boosting allows you to control the relevance of a document by boosting its term. The higher the boost factor, the more relevant the term will be. To boost a term use the caret, ^
, symbol with a boost factor (a number) at the end of the term you are searching.
For example, if you are searching for lung nodule
and you want the term "lung" to be more relevant boost it using the ^
symbol along with the boost factor next to the term such as lung^4 nodule
. This will make documents with the term "lung" appear more relevant. You can also boost Phrase Terms as in the example "lung nodule"^4 "chest"
.
By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2).
Escaping special characters¶
It is possible to escape operators such as (
using a \
, but punctuation is not captured within the index, though they remain part of the document body.
Further details about querying parsing are available in the lucene documentation.
Additional Filtering¶
In addition to the reportBody
index, there are other fields that can further refine a search. While all fields are represented within Lucene as strings, they are formatted to take advantage of their values. Below is a list of all the fields, their coresponding schema locations, and a description which will include any special formatting considerations.
Index Name | Schema Location | Description |
---|---|---|
reportBody | rad_reports.report_body | |
reportImpression | rad_reports.report_impression | |
diagnosis | rad_exam_details.diagnosis | |
gender | patients.gender | |
radExam.resource.modality.id | modalities.id | |
radExam.resource.modality.modality | modalities.modality | |
radExam.patientMrn.mrn | patient_mrns.mrn | |
radExam.patientMrn.id | patient_mrns.id | |
endExam | rad_exam_times.end_exam | YYYYMMDD |
rad1.id | rad_reports.rad1_id | |
rad1.name | employee.name_id | |
rad2.id | rad_reports.rad1_id | |
rad2.name | employee.name_id | |
rad3.id | rad_reports.rad1_id | |
rad3.name | employee.name_id | |
rad4.id | rad_reports.rad1_id | |
rad4.name | employee.name_id | |
reportEvent | rad_reports.report_event | YYYYMMDD |
patientGender | patients.gender | |
radExam.relativePatientAge | The age of the patient at the time of the exam (end exam) in years | |
radExam.procedure.id | rad_exams.procedure_id | |
radExam.procedure.code | procedures.code | |
radExam.radExamDepartment.id | rad_exams.rad_exam_department_id | |
radExam.radExamDepartment.description | rad_exam_departments.description | |
radExam.resource.id | rad_exams.resource_id | |
radExam.site.id | rad_exams.site_id | |
radExam.site.site | sites.site | |
radExam.siteClass.id | rad_exams.site_class_id | |
radExam.siteClass.name | site_classes.name | |
radExam.siteClass.patientType.id | site_classes.patient_type_id | |
radExam.siteClass.patientType.patientType | patient_types.patient_type | |
reportStatus.id | rad_reports.report_status_id | |
reportStatus.universalEventType.id | universal_event_types.id |
Within the SDK there are static methods to help build these indexes into a filter for the search query. They are:
Java::HarbingerSdk::Search::term_range_filter(index_name, start_date_string, stop_date_string)
- Filtering a date formatted index between two formatted date strings (YYYYMMDD).Java::HarbingerSdk::Search::term_values_filter(index_name, [array_of_value_strings])
- Filtering an index based on a list of possible values.Java::HarbingerSdk::Search::term_filter(index_name, value_string)
- Filtering an index based on a single string value.Java::HarbingerSdk::Search::numeric_range_filter(index_name, start_number, stop_number)
- Filtering a number formatted index based on two numbers.
Each can be combined into a set of filters with these boolean operators:
Java::HarbingerSdk::Search::and_filters([array_of_filters])
- Join an array of filters with anAND
operator.Java::HarbingerSdk::Search::or_filters([array_of_filters])
- Join an array of filters with anOR
operator.
These methods can be found in the API documentation under the Search
class.
An example combining these filters into a search term on reportBody
by the rad_exam_times.end_exam
and rad_reports.report_event
date for a specific procedure:
entity_manager ||= Java::HarbingerSdk::DataUtils.getEntityManager
full_text_entity_manager ||= Java::HarbingerSdk::DataUtils.getFullTextEntityManager(entity_manager)
# This is the standard parsed input terms for search. Downcase terms and boolean operators should be all caps
# ex: "pneumonia AND pneumothorax"
# bad example: "Pneumonia and pneumothorax"
search_terms = "pneumonia"
# Swap out with reportImpression as desired
begin
phrase_query = Java::harbinger.sdk.Search::search_query("reportBody", search_terms)
rescue Java::OrgApacheLuceneQueryParser::ParseException => e
return "Error while parsing search string"
end
#Arbitrary times but formatted as needed for the query
start_date = 1.year.ago.strftime("%Y%m%d")
stop_date = Time.now.strftime("%Y%m%d")
#Time range filter on reportEvent
timefilter1 = Java::harbinger.sdk.Search::term_range_filter("reportEvent", start_date, stop_date)
#Time range filter on endExam
timefilter2 = Java::harbinger.sdk.Search::term_range_filter("endExam", start_date, stop_date)
#Procedure description filter for exact results
pquery = Java::HarbingerSdkData::Procedure.createQuery(@entity_manager)
descriptions = pquery.where(pquery.in(".description",["CT CHEST W CONT PULMONARY ARTERIES"])).select(".id").list.to_a.collect(&:to_s)
procdescfilter = Java::harbinger.sdk.Search::term_values_filter("radExam.procedure.id", descriptions)
#Build the full text query with the given search query and filters
ftquery = full_text_entity_manager.createFullTextQuery(phrase_query)
ftquery.setProjection(Java::org.hibernate.search.jpa.FullTextQuery::THIS, Java::org.hibernate.search.jpa.FullTextQuery::SCORE)
ftquery.setFilter(Java::harbinger.sdk.Search::and_filters([timefilter1,timefilter2,procdescfilter]))
ftquery.setMaxResults(100).getResultList().to_a
Limitations¶
There is no facility to combine a free text search and a SQL where
clause into a single query at this time.