Full Text Search for the enterprise with Oracle Text

You work in software and your stack includes an Oracle database.  One day the business approaches you and says ‘I want a search page for our product/order/customer data.  Make it work like Google’.  You think to yourself, “If I could make a search page work like Google I would work at Google”!!!  Fear not, developer.  This problem has been solved many times in the past.  In this blog post I’m going to show you how to approach this problem, and show you a shortcut in case your environment’s stack includes an Oracle database.

Approaching Full Text Search

The problem you’re solving has a name and that name is Full Text Search.  The problem is that your Relational database, while presumably well normalized, is not good at searching for single words across huge data sets.  You need a different kind of database which is optimized for full text search.  A Search database will physically store the data differently so that it can quickly look up your search terms and return some metadata associated with those terms.  In your RDMBS, records are identified by keys.  In your Search index, they keys are the search terms.

There are several well known full text search solutions.  The bare minimum list you should probably know about is Solr/LuceneSphinx, and ElasticSearch.  These are all great full text search solutions, but they all require a lot of overhead to operate.  New servers, new software to install, new syntaxes to learn, admin consoles, and new interfaces or libraries to build into your front end application.

Oracle Stack Solution: Oracle Text

One drawback of each of the aforementioned search solutions is that you will likely want to run it on a dedicated machine (or VM).  If you work in an Oracle shop it likely means that you work in an enterprise where provisioning hardware (even virtual hardware) can be annoyingly difficult and time-consuming process.   I this environment, Oracle Text jumps out as a really nice solution.  Oracle Text is a full text search solution that is built in to all modern version of Oracle’s database.  This means that you don’t have to request a new machine, and request for new software to be installed on that machine in each of your QA and Production environments (or request for root access to do it yourself).  With Oracle Text you just run some DDL to create the index and start using it!*  The only hardware issue you should consider is the amount of disk in use on your Oracle database.

Here’s a simple example of how to take advantage of an Oracle Text search index.  Let’s assume that I have a database with products and reviews (a product has many reviews) and I want to be able to return search results for both at once.

The most straight-forward way to start is to gather all of the data you want to index into a single VARCHAR2 column named SEARCH_TEXT on our PRODUCT table.  If you need to index more than 4000 characters, use a CLOB.

alter table PRODUCTS add SEARCH_TEXT varchar2(4000);

Now we need to populate that column with the search data we want to index from the PRODUCTS and REVIEWS tables  We are going to fetch the data into the search text column as a big space delimited string.  The below query is called a correlated update, and is specific to Oracle.   You can accomplish the same thing with a procedure but I find this more concise.

update PRODUCTS set SEARCH_TEXT = (
 select P.name ||' '|| P.description ||' '||R.title ||' '||R.review_text
 from PRODUCTS P, REVIEWS R
 where P.ID = R.PRODUCT_ID
 ) where PRODUCTS.id = P.ID;

Next we create Oracle Text index on that column.  The important part is the ctxsys.context at the end of this statement.  Context is one of the three types of text indexes that oracle offers, but the best one for blocks of structured text.

create index PRODUCT_REVIEW_SEARCH_IDX on PRODUCT(SEARCH_TEXT)
 indextype is ctxsys.context;

It is worth noting that you can configure the index to use a separate tablespace so that you can control where on the disk your index lives.  See the docs for more info.

Next we we run a command to ‘sync‘ the index.  This actually indexes the data for the first time.  Run it again after you’ve inserted or updated data to update the index.  In fact, you should  plan on running this command periodically as part of a dbms_scheduler or whatever your enterprise’s favorite scheduler is.

ctx_ddl.sync_index(PRODUCT_REVIEW_SEARCH_IDX);

Now we can run a full text search query and see some results.  A statement like this will return all product records which have the word ‘paper’ in the title, description, or reviews. yay!  It’s pretty awesome that we can run searches on this index in our existing RDBMS and apply whatever filters, sorts, and joins we want without having to call out to another system.

select * from PRODUCTS where contains(SEARCH_TEXT, 'paper') > 0;

Finally, we create a job to periodically ‘optimize‘ the index.  According to the docs your index gets fragmented and slower over time and this will fix it up.  I’ve had luck with running this nightly but YMMV.

ctx_ddl.optimize_index(PRODUCT_REVIEW_SEARCH_IDX, 'FULL');

After you’ve got your index up and running you can get some useful info and stats out of it with the CTX_REPORTS package.  Among other things it will tell you how fragmented your index is, and what words are the most frequently indexed.

I’ve really just scratched the surface to show you how to get a text index up and running fast.  Oracle has a ton of options to tune the index, and search features like fuzzy searching, stemming, and wildcards.

*Ok, maybe you should still consult a DBA first if you have access to one.

Resources:

http://www.oracle.com/us/products/database/enterprise-edition/oracletext12cpresentation-1961514.pdf?ssSourceSiteId=otnen

http://docs.oracle.com/cd/E11882_01/text.112/e24436.pdf

 

Performance Optimization: Doing Science

I couldn’t hold a candle to Brian Green on such topics as Quantum Entaglement, Higgs Boson, or Grand Unified Theory (despite obtaining a B.A. in Physics), however I can apply the scientific method to improving the performance of your software.

The Scientific Method (sciencebuddies.org)

In this article I will explain a basic, but often overlooked foundation for improving the performance of any software application.

Much of software development is an art, but performance tuning is a science.  I’ve seen a lot of good developers waste time significant amounts of time on performance with little to show for it, or just as bad, improve performance without knowing exactly which change had the desired effect.

Do you remember talking about the Scientific Method from your high school science class?  The diagram on the right is a refresher.  The scientific method is the repeatable process on which all scientific exploration is based.  It gives scientists across the world a common language and framework to compare the process and outcomes of experiments.

The scientific process provides a few of important points that can be applied to software performance optimization:

  1. Repeatable process – use the same process for every performance enhancement you make
  2. Only modify one variable at a time – Do not make multiple tweaks at the same time.
  3. Record the results of each optimization.  Track what you did and how much it helped.
Performance Optimization Method (clickonchris.com)

This sounds simple right?  It is.  The tough part for software developers is to never break these rules during a round of optimizations.  To the right I’ve also included a more detailed diagram of what the scientific process looks like when applied to performance optimization.  Let’s call it the Performance Optimization Method.

But I know what I’m doing!  Why shouldn’t I make multiple tweaks at once?

Lets say you do make two changes at once.  You optimize two queries and drop the page load time from 3s to .1s.  Do you know how much relative impact the changes had?  Did each change reduce the cost by the same amount (50%/50%)?  Did one query account for most of the cost (75%/25%)?  Or did one of the changes not even have any impact (100%/0%)?  What if the two changes were somehow interdependent?  For the most part these questions are impossible to answer unless you use a repeatable process and only modify one variable at a time.  There are exceptions *(there are always exceptions.  If you have a good profiling tool that tells you exactly what two different method calls cost and you are absolutely sure they are not somehow related then you could cut a corner and make multiple changes at once.  If the results do not turn out as expected you still need to go back and make the changes one at a time).  By the way, I hope you are testing against a volume of data you expect in production.

Don’t forget to record the result of each optimization.  This way you can throw your results into a table, and with a little explanation about the process and results you turn it into a report and send it to management so they can see how you’re spending their budget (and how good you are at science).  Having these sorts of metrics reports also makes it easy for stakeholders to justify the time spent on performance optimization activities.

The law of diminishing returns applies to performance enhancements.  At some point you will have picked all of the low-hanging fruit and enhancements start to get progressively more expensive.  Stakeholders need insight into how this is progressing on your project so they can make decisions on how much more to spend on performance.  Metrics reports should provide sufficient detail for stakeholders to make those decisions.

Ultimately you will end up with a faster application and a clear story of how you got there.  Isn’t science fun?