Classification of Listings
Classification of listings, or how to tag them is a central issue in search. A live person has to devise the categories and hopefully everyone else will think along the same lines. I spent some time on this, and I found there were only two kinds of search:
- Browsing broad categories
- Specific searches
Any search engine has to satisfy these two styles of search. But how?
An ontology is a hierarchy of relationships between various objects. If you want to search for cars, you can start with the broad and narrow it down by following the ontological tree until you reach what you are interested in. Mathematically you are working with “Directed Graphs“, or digraphs. You can look at green cars, red cars, expensive cars, manual shift, high performance and so on.
I had then to create this ontology to support the broad to narrow and back to broad search methods. Getting this to work in Microsoft SQL Server was a challenge, but I had some help from Joe Celko. He is one of the best and easiest to read authors on SQL Server and SQL in general.
Using Joe Celko’s methods, I devised a way to maintain ontologies in SQL server, and I was able to create and maintain relationships between various categories. This was a huge achievement and it added a new dimension to my search engine. The question I had not asked yet was, “What am I going to do about language?”
Machine translation concerns the ability of a computer to translate from one natural language to another. I had to think ahead on this topic and backtrack a little on my ontology. I decided I had to make the search engine multilingual, and it had to incorporate new languages easily. How was I going to do it? Use a service like Google Translate? I decided I had to write this piece from scratch, otherwise I would have no control over search terms and their relationships to one another.
Creating morphemes for concepts was how I started this process. A morpheme represented an atomic concept or object. I used English to create the morphemes, but that was purely a convenience for me. Morphemes had to be indivisible enough that there was no smaller unit of something to think of when searching. So the ontology was built from the morphemes now, which again was English for my own convenience.
Sitting on top of the tokens was a multilingual dictionary, a table that had standard language codes as an indexed column. I used UTF-16 so I could be sure to encode any possible language that could be computer encoded. I started by translating the tokens (and ontology) into an initial 31 languages. This would support searches where each word in a search phrase could be in a different language (one word in Chinese, one in Russian, one in Hebrew, one in French and so on). The search engine would still find what you were looking for, because it was agnostic as to what words you used in whatever language.
I spent months and months on the ontology, tokens and translations, eventually building up a sizable dictionary for all kinds of terms. Of course there were stop words and other filler parts of speech that I could throw away depending on target language. I began to work on lexemes for each language, as it is necessary to know if your word is a verb or noun, has it been declined, what are the plural forms, is the language gender neutral or gender specific and so on. I was beginning to see the limitations of SQL Server and any RDBMS (Relational Data Base Management System) in trying to track these variants and their relationships. The lexeme problem started consuming too much of my time and I left it for later.
Next: what about user interface?