This semester we examined two common techniques for representing, organizing, retrieving, and filtering information:
the deductive Booleanlogic of classes and their combinations, and the inductive Bayesianlogic of probabilities conditioned on past events.
As Bernhard Rieder observed, each of these techniques “implies a particular way of doing things” but can “be applied to a wide array of domains.”
For example, Boolean logic can be applied to classify and filter scientific articles, works of choreography, consumer products, or potential romantic partners, ifone can come up with a way to represent these things as belonging to classes defined by necessary and su?cient properties.
Likewise, Bayesian logic can be applied to these same domains and many others, ifone can represent these things as having quantified properties suitable for training a statistical model. In either case, for any particular domain of application, there are “many decisions to be made concerning the units to take into account, the parameters to specify, the tweaks to apply, the outputs to produce, and so forth” (Rieder).
For this question, you will identify and describe two different applications of information classification and filtering.The first should be an application for which you believe Boolean logic to be the preferable technique, and the second should be an application for which you believe Bayesian logic to be the preferable technique.
For each of the two applicationsthat you identify, you should:
Give two reasons why your chosen technique is preferable for this particular application. Again, do not just identify general strengths of the technique—instead give specific reasons why the technique might be preferable for this particular application.
filtering presented or used? The application may be purely hypothetical, or it may be based on a real application that you are aware of. See the next page for an example.
probabilities produced by the model could be used to classify and filter. Do not just give a
general explanation of how the technique works—instead focus on some of the specific decisions to be made for this particular application.
4.Finally, identify one potential problemwith applying the technique in this way. What limitations might the technique impose, or what negative consequences might result? One more time: do not just identify general limitations or drawbacks of the technique—instead describe a specific negative outcome that could result from using this technique for this particular application.
Exampleof a description of an application, like you might write for part 1 above (do not
use this example in your own answer):
DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome.
1
INFORMATION
RETRIEVAL
1
Boolean retrieval
The meaning of the term information retrieval can be very broad. Just getting
a credit card out of your wallet so that you can type in the card number
is a form of information retrieval. However, as an academic field of study,
information retrieval might be defined thus:
Information retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information need
from within large collections (usually stored on computers).
As defined in this way, information retrieval used to be an activity that only
a few people engaged in: reference librarians, paralegals, and similar professional searchers. Now the world has changed, and hundreds of millions
of people engage in information retrieval every day when they use a web
search engine or search their email.1 Information retrieval is fast becoming
the dominant form of information access, overtaking traditional databasestyle searching (the sort that is going on when a clerk says to you: “I’m sorry,
I can only look up your order if you can give me your Order ID”).
IR can also cover other kinds of data and information problems beyond
that specified in the core definition above. The term “unstructured data”
refers to data which does not have clear, semantically overt, easy-for-a-computer
structure. It is the opposite of structured data, the canonical example of
which is a relational database, of the sort companies usually use to maintain product inventories and personnel records. In reality, almost no data
are truly “unstructured”. This is definitely true of all text data if you count
the latent linguistic structure of human languages. But even accepting that
the intended notion of structure is overt structure, most text has structure,
such as headings and paragraphs and footnotes, which is commonly represented in documents by explicit markup (such as the coding underlying web
1. In modern parlance, the word “search” has tended to replace “(information) retrieval”; the
term “search” is quite ambiguous, but in context we use the two synonymously.
Online edition (c) 2009 Cambridge UP
2
1 Boolean retrieval
pages). IR is also used to facilitate “semistructured” search such as finding a
document where the title contains Java and the body contains threading.
The field of information retrieval also covers supporting users in browsing
or filtering document collections or further processing a set of retrieved documents. Given a set of documents, clustering is the task of coming up with a
good grouping of the documents based on their contents. It is similar to arranging books on a bookshelf according to their topic. Given a set of topics,
standing information needs, or other categories (such as suitability of texts
for different age groups), classification is the task of deciding which class(es),
if any, each of a set of documents belongs to. It is often approached by first
manually classifying some documents and then hoping to be able to classify
new documents automatically.
Information retrieval systems can also be distinguished by the scale at
which they operate, and it is useful to distinguish three prominent scales.
In web search, the system has to provide search over billions of documents
stored on millions of computers. Distinctive issues are needing to gather
documents for indexing, being able to build systems that work efficiently
at this enormous scale, and handling particular aspects of the web, such as
the exploitation of hypertext and not being fooled by site providers manipulating page content in an attempt to boost their search engine rankings,
given the commercial importance of the web. We focus on all these issues
in Chapters 19–21. At the other extreme is personal information retrieval. In
the last few years, consumer operating systems have integrated information
retrieval (such as Apple’s Mac OS X Spotlight or Windows Vista’s Instant
Search). Email programs usually not only provide search but also text classification: they at least provide a spam (junk mail) filter, and commonly also
provide either manual or automatic means for classifying mail so that it can
be placed directly into particular folders. Distinctive issues here include handling the broad range of document types on a typical personal computer,
and making the search system maintenance free and sufficiently lightweight
in terms of startup, processing, and disk space usage that it can run on one
machine without annoying its owner. In between is the space of enterprise,
institutional, and domain-specific search, where retrieval might be provided for
collections such as a corporation’s internal documents, a database of patents,
or research articles on biochemistry. In this case, the documents will typically be stored on centralized file systems and one or a handful of dedicated
machines will provide search over the collection. This book contains techniques of value over this whole spectrum, but our coverage of some aspects
of parallel and distributed search in web-scale search systems is comparatively light owing to the relatively small published literature on the details
of such systems. However, outside of a handful of web search companies, a
software developer is most likely to encounter the personal search and enterprise scenarios.
Online edition (c) 2009 Cambridge UP
1.1 An example information retrieval problem
3
In this chapter we begin with a very simple example of an information
retrieval problem, and introduce the idea of a term-document matrix (Section 1.1) and the central inverted index data structure (Section 1.2). We will
then examine the Boolean retrieval model and how Boolean queries are processed (Sections 1.3 and 1.4).
1.1
GREP
An example information retrieval problem
A fat book which many people own is Shakespeare’s Collected Works. Suppose you wanted to determine which plays of Shakespeare contain the words
Brutus AND Caesar AND NOT Calpurnia. One way to do that is to start at the
beginning and to read through all the text, noting for each play whether
it contains Brutus and Caesar and excluding it from consideration if it contains Calpurnia. The simplest form of document retrieval is for a computer
to do this sort of linear scan through documents. This process is commonly
referred to as grepping through text, after the Unix command grep, which
performs this process. Grepping through text can be a very effective process,
especially given the speed of modern computers, and often allows useful
possibilities for wildcard pattern matching through the use of regular expressions. With modern computers, for simple querying of modest collections
(the size of Shakespeare’s Collected Works is a bit under one million words
of text in total), you really need nothing more.
But for many purposes, you do need more:
1. To process large document collections quickly. The amount of online data
has grown at least as quickly as the speed of computers, and we would
now like to be able to search collections that total in the order of billions
to trillions of words.
2. To allow more flexible matching operations. For example, it is impractical
to perform the query Romans NEAR countrymen with grep, where NEAR
might be defined as “within 5 words” or “within the same sentence”.
3. To allow ranked retrieval: in many cases you want the best answer to an
information need among many documents that contain certain words.
INDEX
INCIDENCE MATRIX
TERM
The way to avoid linearly scanning the texts for each query is to index the
documents in advance. Let us stick with Shakespeare’s Collected Works,
and use it to introduce the basics of the Boolean retrieval model. Suppose
we record for each document – here a play of Shakespeare’s – whether it
contains each word out of all the words Shakespeare used (Shakespeare used
about 32,000 different words). The result is a binary term-document incidence
matrix, as in Figure 1.1. Terms are the indexed units (further discussed in
Section 2.2); they are usually words, and for the moment you can think of
Online edition (c) 2009 Cambridge UP
4
1 Boolean retrieval
Antony
Brutus
Caesar
Calpurnia
Cleopatra
mercy
worser
…
Antony
and
Cleopatra
1
1
1
0
1
1
1
Julius
Caesar
The
Tempest
Hamlet
Othello
Macbeth
1
1
1
1
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
1
1
0
0
1
0
0
1
1
1
0
1
0
0
1
0
…
◮ Figure 1.1 A term-document incidence matrix. Matrix element (t, d) is 1 if the
play in column d contains the word in row t, and is 0 otherwise.
them as words, but the information retrieval literature normally speaks of
terms because some of them, such as perhaps I-9 or Hong Kong are not usually
thought of as words. Now, depending on whether we look at the matrix rows
or columns, we can have a vector for each term, which shows the documents
it appears in, or a vector for each document, showing the terms that occur in
it.2
To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the
vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a
bitwise AND:
110100 AND 110111 AND 101111 = 100100
B OOLEAN RETRIEVAL
MODEL
DOCUMENT
COLLECTION
CORPUS
The answers for this query are thus Antony and Cleopatra and Hamlet (Figure 1.2).
The Boolean retrieval model is a model for information retrieval in which we
can pose any query which is in the form of a Boolean expression of terms,
that is, in which terms are combined with the operators AND, OR, and NOT.
The model views each document as just a set of words.
Let us now consider a more realistic scenario, simultaneously using the
opportunity to introduce some terminology and notation. Suppose we have
N = 1 million documents. By documents we mean whatever units we have
decided to build a retrieval system over. They might be individual memos
or chapters of a book (see Section 2.1.2 (page 20) for further discussion). We
will refer to the group of documents over which we perform retrieval as the
(document) collection. It is sometimes also referred to as a corpus (a body of
texts). Suppose each document is about 1000 words long (2–3 book pages). If
2. Formally, we take the transpose of the matrix to be able to get the terms as column vectors.
Online edition (c) 2009 Cambridge UP
5
1.1 An example information retrieval problem
Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to Domitius Enobarbus]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii
Lord Polonius:
I did enact Julius Caesar: I was killed i’ the
Capitol; Brutus killed me.
◮ Figure 1.2 Results from Shakespeare for the query Brutus
AND Caesar AND NOT
Calpurnia.
AD HOC RETRIEVAL
INFORMATION NEED
QUERY
RELEVANCE
EFFECTIVENESS
we assume an average of 6 bytes per word including spaces and punctuation,
then this is a document collection about 6 GB in size. Typically, there might
be about M = 500,000 distinct terms in these documents. There is nothing
special about the numbers we have chosen, and they might vary by an order
of magnitude or more, but they give us some idea of the dimensions of the
kinds of problems we need to handle. We will discuss and model these size
assumptions in Section 5.1 (page 86).
Our goal is to develop a system to address the ad hoc retrieval task. This is
the most standard IR task. In it, a system aims to provide documents from
within the collection that are relevant to an arbitrary user information need,
communicated to the system by means of a one-off, user-initiated query. An
information need is the topic about which the user desires to know more, and
is differentiated from a query, which is what the user conveys to the computer in an attempt to communicate the information need. A document is
relevant if it is one that the user perceives as containing information of value
with respect to their personal information need. Our example above was
rather artificial in that the information need was defined in terms of particular words, whereas usually a user is interested in a topic like “pipeline
leaks” and would like to find relevant documents regardless of whether they
precisely use those words or express the concept with other words such as
pipeline rupture. To assess the effectiveness of an IR system (i.e., the quality of
its search results), a user will usually want to know two key statistics about
the system’s returned results for a query:
PRECISION
Precision: What fraction of the returned results are relevant to the information need?
RECALL
Recall: What fraction of the relevant documents in the collection were returned by the system?
Online edition (c) 2009 Cambridge UP
6
1 Boolean retrieval
INVERTED INDEX
DICTIONARY
VOCABULARY
LEXICON
POSTING
POSTINGS LIST
POSTINGS
1.2
Detailed discussion of relevance and evaluation measures including precision and recall is found in Chapter 8.
We now cannot build a term-document matrix in a naive way. A 500K ×
1M matrix has half-a-trillion 0’s and 1’s – too many to fit in a computer’s
memory. But the crucial observation is that the matrix is extremely sparse,
that is, it has few non-zero entries. Because each document is 1000 words
long, the matrix has no more than one billion 1’s, so a minimum of 99.8% of
the cells are zero. A much better representation is to record only the things
that do occur, that is, the 1 positions.
This idea is central to the first major concept in information retrieval, the
inverted index. The name is actually redundant: an index always maps back
from terms to the parts of a document where they occur. Nevertheless, inverted index, or sometimes inverted file, has become the standard term in information retrieval.3 The basic idea of an inverted index is shown in Figure 1.3.
We keep a dictionary of terms (sometimes also referred to as a vocabulary or
lexicon; in this book, we use dictionary for the data structure and vocabulary
for the set of terms). Then for each term, we have a list that records which
documents the term occurs in. Each item in the list – which records that a
term appeared in a document (and, later, often, the positions in the document) – is conventionally called a posting.4 The list is then called a postings
list (or inverted list), and all the postings lists taken together are referred to as
the postings. The dictionary in Figure 1.3 has been sorted alphabetically and
each postings list is sorted by document ID. We will see why this is useful in
Section 1.3, below, but later we will also consider alternatives to doing this
(Section 7.1.5).
A first take at building an inverted index
To gain the speed benefits of indexing at retrieval time, we have to build the
index in advance. The major steps in this are:
1. Collect the documents to be indexed:
Friends, Romans, countrymen. So let it be with Caesar . . .
2. Tokenize the text, turning each document into a list of tokens:
Friends Romans countrymen So . . .
3. Some information retrieval researchers prefer the term inverted file, but expressions like index construction and index compression are much more common than inverted file construction and
inverted file compression. For consistency, we use (inverted) index throughout this book.
4. In a (non-positional) inverted index, a posting is just a document ID, but it is inherently
associated with a term, via the postings list it is placed on; sometimes we will also talk of a
(term, docID) pair as a posting.
Online edition (c) 2009 Cambridge UP
7
1.2 A first take at building an inverted index
Brutus
−→
1
2
4
11
31
45
173
174
Caesar
−→
1
2
4
5
6
16
57
132
Calpurnia
−→
2
31
54
101
…
..
.
| {z }
Dictionary
|
{z
Postings
}
◮ Figure 1.3 The two parts of an inverted index. The dictionary is commonly kept
in memory, with pointers to each postings list, which is stored on disk.
3. Do linguistic preprocessing, producing a list of normalized tokens, which
are the indexing terms: friend roman countryman so . . .
4. Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
DOC ID
SORTING
DOCUMENT
FREQUENCY
We will define and discuss the earlier stages of processing, that is, steps 1–3,
in Section 2.2 (page 22). Until then you can think of tokens and normalized
tokens as also loosely equivalent to words. Here, we assume that the first
3 steps have already been done, and we examine building a basic inverted
index by sort-based indexing.
Within a document collection, we assume that each document has a unique
serial number, known as the document identifier (docID). During index construction, we can simply assign successive integers to each new document
when it is first encountered. The input to indexing is a list of normalized
tokens for each document, which we can equally think of as a list of pairs of
term and docID, as in Figure 1.4. The core indexing step is sorting this list
so that the terms are alphabetical, giving us the representation in the middle
column of Figure 1.4. Multiple occurrences of the same term from the same
document are then merged.5 Instances of the same term are then grouped,
and the result is split into a dictionary and postings, as shown in the right
column of Figure 1.4. Since a term generally occurs in a number of documents, this data organization already reduces the storage requirements of
the index. The dictionary also records some statistics, such as the number of
documents which contain each term (the document frequency, which is here
also the length of each postings list). This information is not vital for a basic Boolean search engine, but it allows us to improve the efficiency of the
5. Unix users can note that these steps are similar to use of the sort and then uniq commands.
Online edition (c) 2009 Cambridge UP
8
1 Boolean retrieval
Doc 1
I did enact Julius Caesar: I was killed
i’ the Capitol; Brutus killed me.
Doc 2
So let it be with Caesar. The noble Brutus
hath told you Caesar was ambitious:
term
docID
term
docID
I
1
ambitious
2
term doc. freq.
did
1
be
2
ambitious 1
enact
1
brutus
1
be 1
julius
1
brutus
2
brutus 2
caesar
1
capitol
1
I
1
caesar
1
capitol 1
was
1
caesar
2
caesar 2
killed
1
caesar
2
did 1
i’
1
did
1
enact 1
the
1
enact
1
hath 1
capitol
1
hath
1
brutus
1
I
1
I 1
killed
1
I
1
i’ 1
me
1
i’
1
=⇒
=⇒ it 1
so
2
it
2
julius 1
let
2
julius
1
killed 1
it
2
killed
1
be
2
killed
1
let 1
with
2
let
2
me 1
caesar
2
me
1
noble 1
the
2
noble
2
so 1
noble
2
so
2
the 2
brutus
2
the
1
told 1
hath
2
the
2
told
2
told
2
you 1
you
2
you
2
was 2
caesar
2
was
1
with 1
was
2
was
2
ambitious
2
with
2
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
postings lists
2
2
1 → 2
1
1 → 2
1
1
2
1
1
2
1
1
2
1
2
2
1 → 2
2
2
1 → 2
2
◮ Figure 1.4 Building an index by sorting and grouping. The sequence of terms
in each document, tagged by their documentID (left) is sorted alphabetically (middle). Instances of the same term are then grouped by word and then by documentID.
The terms and documentIDs are then separated out (right). The dictionary stores
the terms, and has a pointer to the postings list for each term. It commonly also
stores other summary information such as, here, the document frequency of each
term. We use this information for improving query time efficiency and, later, for
weighting in ranked retrieval models. Each postings list stores the list of documents
in which a term occurs, and may store other information such as the term frequency
(the frequency of each term in each document) or the position(s) of the term in each
document.
Online edition (c) 2009 Cambridge UP
1.2 A first take at building an inverted index
9
search engine at query time, and it is a statistic later used in many ranked retrieval models. The postings are secondarily sorted by docID. This provides
the basis for efficient query processing. This inverted index structure is essentially without rivals as the most efficient structure for supporting ad hoc
text search.
In the resulting index, we pay for storage of both the dictionary and the
postings lists. The latter are much larger, but the dictionary is commonly
kept in memory, while postings lists are normally kept on disk, so the size
of each is important, and in Chapter 5 we will examine how each can be
optimized for storage and access efficiency. What data structure should be
used for a postings list? A fixed length array would be wasteful as some
words occur in many documents, and others in very few. For an in-memory
postings list, two good alternatives are singly linked lists or variable length
arrays. Singly linked lists allow cheap insertion of documents into postings
lists (following updates, such as when recrawling the web for updated documents), and naturally extend to more advanced indexing strategies such as
skip lists (Section 2.3), which require additional pointers. Variable length arrays win in space requirements by avoiding the overhead for pointers and in
time requirements because their use of contiguous memory increases speed
on modern processors with memory caches. Extra pointers can in practice be
encoded into the lists as offsets. If updates are relatively infrequent, variable
length arrays will be more compact and faster to traverse. We can also use a
hybrid scheme with a linked list of fixed length arrays for each term. When
postings lists are stored on disk, they are stored (perhaps compressed) as a
contiguous run of postings without explicit pointers (as in Figure 1.3), so as
to minimize the size of the postings list and the number of disk seeks to read
a postings list into memory.
?
Exercise 1.1
[ ⋆]
Draw the inverted index that would be built for the following document collection.
(See Figure 1.3 for an example.)
Doc 1
Doc 2
Doc 3
Doc 4
new home sales top forecasts
home sales rise in july
increase in home sales in july
july new home sales rise
Exercise 1.2
Consider these documents:
Doc 1
Doc 2
Doc 3
Doc 4
breakthrough drug for schizophrenia
new schizophrenia drug
new approach for treatment of schizophrenia
new hopes for schizophrenia patients
a. Draw the term-document incidence matrix for this document collection.
Online edition (c) 2009 Cambridge UP
[ ⋆]
10
1 Boolean retrieval
Brutus
−→
1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia
−→
2 → 31 → 54 → 101
Intersection
=⇒
2 → 31
◮ Figure 1.5 Intersecting the postings lists for Brutus and Calpurnia from Figure 1.3.
b. Draw the inverted index representation for this collection, as in Figure 1.3 (page 7).
Exercise 1.3
[⋆]
For the document collection shown in Exercise 1.2, what are the returned results for
these queries:
a. schizophrenia AND drug
b. for AND NOT (drug OR approach)
1.3
SIMPLE CONJUNCTIVE
QUERIES
(1.1)
Processing Boolean queries
How do we process a query using an inverted index and the basic Boolean
retrieval model? Consider processing the simple conjunctive query:
Brutus AND Calpurnia
over the inverted index partially shown in Figure 1.3 (page 7). We:
1. Locate Brutus in the Dictionary
2. Retrieve its postings
3. Locate Calpurnia in the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists, as shown in Figure 1.5.
POSTINGS LIST
INTERSECTION
POSTINGS MERGE
The intersection operation is the crucial one: we need to efficiently intersect
postings lists so as to be able to quickly find documents that contain both
terms. (This operation is sometimes referred to as merging postings lists:
this slightly counterintuitive name reflects using the term merge algorithm for
a general family of algorithms that combine multiple sorted lists by interleaved advancing of pointers through each; here we are merging the lists
with a logical AND operation.)
There is a simple and effective method of intersecting postings lists using
the merge algorithm (see Figure 1.6): we maintain pointers into both lists
Online edition (c) 2009 Cambridge UP
1.3 Processing Boolean queries
11
I NTERSECT ( p1, p2 )
1 answer ← h i
2 while p1 6= NIL and p2 6= NIL
3 do if docID ( p1) = docID ( p2)
4
then A DD ( answer, docID ( p1 ))
5
p1 ← next( p1 )
6
p2 ← next( p2 )
7
else if docID ( p1) < docID ( p2)
8
then p1 ← next( p1 )
9
else p2 ← next( p2 )
10 return answer
◮ Figure 1.6 Algorithm for the intersection of two postings lists p1 and p2 .
and walk through the two postings lists simultaneously, in time linear in
the total number of postings entries. At each step, we compare the docID
pointed to by both pointers. If they are the same, we put that docID in the
results list, and advance both pointers. Otherwise we advance the pointer
pointing to the smaller docID. If the lengths of the postings lists are x and
y, the intersection takes O( x + y) operations. Formally, the complexity of
querying is Θ( N ), where N is the number of documents in the collection.6
Our indexing methods gain us just a constant, not a difference in Θ time
complexity compared to a linear scan, but in practice the constant is huge.
To use this algorithm, it is crucial that postings be sorted by a single global
ordering. Using a numeric sort by docID is one simple way to achieve this.
We can extend the intersection operation to process more complicated queries
like:
(1.2)
QUERY OPTIMIZATION
(1.3)
(Brutus OR Caesar) AND NOT Calpurnia
Query optimization is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by
the system. A major element of this for Boolean queries is the order in which
postings lists are accessed. What is the best order for query processing? Consider a query that is an AND of t terms, for instance:
Brutus AND Caesar AND Calpurnia
For each of the t terms, we need to get its postings, then AND them together.
The standard heuristic is to process terms in order of increasing document
6. The notation Θ(·) is used to express an asymptotically tight bound on the complexity of
an algorithm. Informally, this is often written as O (·), but this notation really expresses an
asymptotic upper bound, which need not be tight (Cormen et al. 1990).
Online edition (c) 2009 Cambridge UP
12
1 Boolean retrieval
I NTERSECT (ht1, . . . , tn i)
1 terms ← S ORT B Y I NCREASING F REQUENCY (ht1 , . . . , tn i)
2 result ← postings( f irst(terms))
3 terms ← rest(terms)
4 while terms 6= NIL and result 6= NIL
5 do result ← I NTERSECT (result, postings( f irst(terms)))
6
terms ← rest(terms)
7 return result
◮ Figure 1.7 Algorithm for conjunctive queries that returns the set of documents
containing each term in the input list of terms.
frequency: if we start by intersecting the two smallest postings lists, then all
intermediate results must be no bigger than the smallest postings list, and we
are therefore likely to do the least amount of total work. So, for the postings
lists in Figure 1.3 (page 7), we execute the above query as:
(1.4)
(Calpurnia AND Brutus) AND Caesar
This is a first justification for keeping the frequency of terms in the dictionary:
it allows us to make this ordering decision based on in-memory data before
accessing any postings list.
Consider now the optimization of more general queries, such as:
(1.5)
(madding OR crowd) AND (ignoble OR strife) AND (killed OR slain)
As before, we will get the frequencies for all terms, and we can then (conservatively) estimate the size of each OR by the sum of the frequencies of its
disjuncts. We can then process the query in increasing order of the size of
each disjunctive term.
For arbitrary Boolean queries, we have to evaluate and temporarily store
the answers for intermediate expressions in a complex expression. However,
in many circumstances, either because of the nature of the query language,
or just because this is the most common type of query that users submit, a
query is purely conjunctive. In this case, rather than viewing merging postings lists as a function with two inputs and a distinct output, it is more efficient to intersect each retrieved postings list with the current intermediate
result in memory, where we initialize the intermediate result by loading the
postings list of the least frequent term. This algorithm is shown in Figure 1.7.
The intersection operation is then asymmetric: the intermediate results list
is in memory while the list it is being intersected with is being read from
disk. Moreover the intermediate results list is always at least as short as the
other list, and in many cases it is orders of magnitude shorter. The postings
Online edition (c) 2009 Cambridge UP
13
1.3 Processing Boolean queries
intersection can still be done by the algorithm in Figure 1.6, but when the
difference between the list lengths is very large, opportunities to use alternative techniques open up. The intersection can be calculated in place by
destructively modifying or marking invalid items in the intermediate results
list. Or the intersection can be done as a sequence of binary searches in the
long postings lists for each posting in the intermediate results list. Another
possibility is to store the long postings list as a hashtable, so that membership
of an intermediate result item can be calculated in constant rather than linear
or log time. However, such alternative techniques are difficult to combine
with postings list compression of the sort discussed in Chapter 5. Moreover,
standard postings list intersection operations remain necessary when both
terms of a query are very common.
?
[ ⋆]
Exercise 1.4
For the queries below, can we still run through the intersection in time O( x + y),
where x and y are the lengths of the postings lists for Brutus and Caesar? If not, what
can we achieve?
a. Brutus AND NOT Caesar
b. Brutus OR NOT Caesar
[ ⋆]
Exercise 1.5
Extend the postings merge algorithm to arbitrary Boolean query formulas. What is
its time complexity? For instance, consider:
c. (Brutus OR Caesar) AND NOT (Antony OR Cleopatra)
Can we always merge in linear time? Linear in what? Can we do better than this?
Exercise 1.6
[⋆⋆]
We can use distributive laws for AND and OR to rewrite queries.
a. Show how to rewrite the query in Exercise 1.5 into disjunctive normal form using
the distributive laws.
b. Would the resulting query be more or less efficiently evaluated than the original
form of this query?
c. Is this result true in general or does it depend on the words and the contents of
the document collection?
Exercise 1.7
Recommend a query processing order for
d. (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
given the following postings list sizes:
Online edition (c) 2009 Cambridge UP
[ ⋆]
14
1 Boolean retrieval
Term
eyes
kaleidoscope
marmalade
skies
tangerine
trees
Postings size
213312
87009
107913
271658
46653
316812
Exercise 1.8
[⋆]
If the query is:
e. friends AND romans AND (NOT countrymen)
how could we use the frequency of countrymen in evaluating the best query evaluation
order? In particular, propose a way of handling negation in determining the order of
query processing.
Exercise 1.9
[⋆⋆]
For a conjunctive query, is processing postings lists in order of size guaranteed to be
optimal? Explain why it is, or give an example where it isn’t.
Exercise 1.10
Write out a postings merge algorithm, in the style of Figure 1.6 (page 11), for an x
query.
[⋆⋆]
OR
y
Exercise 1.11
[⋆⋆]
How should the Boolean query x AND NOT y be handled? Why is naive evaluation
of this query normally very expensive? Write out a postings merge algorithm that
evaluates this query efficiently.
1.4
RANKED RETRIEVAL
MODEL
FREE TEXT QUERIES
PROXIMITY OPERATOR
The extended Boolean model versus ranked retrieval
The Boolean retrieval model contrasts with ranked retrieval models such as the
vector space model (Section 6.3), in which users largely use free text queries,
that is, just typing one or more words rather than using a precise language
with operators for building up query expressions, and the system decides
which documents best satisfy the query. Despite decades of academic research on the advantages of ranked retrieval, systems implementing the Boolean retrieval model were the main or only search option provided by large
commercial information providers for three decades until the early 1990s (approximately the date of arrival of the World Wide Web). However, these
systems did not have just the basic Boolean operations (AND, OR, and NOT)
which we have presented so far. A strict Boolean expression over terms with
an unordered results set is too limited for many of the information needs
that people have, and these systems implemented extended Boolean retrieval
models by incorporating additional operators such as term proximity operators. A proximity operator is a way of specifying that two terms in a query
Online edition (c) 2009 Cambridge UP
1.4 The extended Boolean model versus ranked retrieval
15
must occur close to each other in a document, where closeness may be measured by limiting the allowed number of intervening words or by reference
to a structural unit such as a sentence or paragraph.
✎
Example 1.1: Commercial Boolean searching: Westlaw. Westlaw (http://www.westlaw.com/)
is the largest commercial legal search service (in terms of the number of paying subscribers), with over half a million subscribers performing millions of searches a day
over tens of terabytes of text data. The service was started in 1975. In 2005, Boolean
search (called “Terms and Connectors” by Westlaw) was still the default, and used
by a large percentage of users, although ranked free text querying (called “Natural
Language” by Westlaw) was added in 1992. Here are some example Boolean queries
on Westlaw:
Information need: Information on the legal theories involved in preventing the
disclosure of trade secrets by employees formerly employed by a competing
company. Query: "trade secret" /s disclos! /s prevent /s employe!
Information need: Requirements for disabled people to be able to access a workplace.
Query: disab! /p access! /s work-site work-place (employment /3 place)
Information need: Cases about a host’s responsibility for drunk guests.
Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest
Note the long, precise queries and the use of proximity operators, both uncommon
in web search. Submitted queries average about ten words in length. Unlike web
search conventions, a space between words represents disjunction (the tightest binding operator), & is AND and /s, /p, and /k ask for matches in the same sentence,
same paragraph or within k words respectively. Double quotes give a phrase search
(consecutive words); see Section 2.4 (page 39). The exclamation mark (!) gives a trailing wildcard query (see Section 3.2, page 51); thus liab! matches all words starting
with liab. Additionally work-site matches any of worksite, work-site or work site; see
Section 2.2.1 (page 22). Typical expert queries are usually carefully defined and incrementally developed until they obtain what look to be good results to the user.
Many users, particularly professionals, prefer Boolean query models. Boolean
queries are precise: a document either matches the query or it does not. This offers the user greater control and transparency over what is retrieved. And some domains, such as legal materials, allow an effective means of document ranking within a
Boolean model: Westlaw returns documents in reverse chronological order, which is
in practice quite effective. In 2007, the majority of law librarians still seem to recommend terms and connectors for high recall searches, and the majority of legal
users think they are getting greater control by using them. However, this does not
mean that Boolean queries are more effective for professional searchers. Indeed, experimenting on a Westlaw subcollection, Turtle (1994) found that free text queries
produced better results than Boolean queries prepared by Westlaw’s own reference
librarians for the majority of the information needs in his experiments. A general
problem with Boolean search is that using AND operators tends to produce high precision but low recall searches, while using OR operators gives low precision but high
recall searches, and it is difficult or impossible to find a satisfactory middle ground.
In this chapter, we have looked at the structure and construction of a basic
Online edition (c) 2009 Cambridge UP
16
1 Boolean retrieval
inverted index, comprising a dictionary and postings lists. We introduced
the Boolean retrieval model, and examined how to do efficient retrieval via
linear time merges and simple query optimization. In Chapters 2–7 we will
consider in detail richer query models and the sort of augmented index structures that are needed to handle them efficiently. Here we just mention a few
of the main additional things we would like to be able to do:
1. We would like to better determine the set of terms in the dictionary and
to provide retrieval that is tolerant to spelling mistakes and inconsistent
choice of words.
2. It is often useful to search for compounds or phrases that denote a concept
such as “operating system”. As the Westlaw examples show, we might also
wish to do proximity queries such as Gates NEAR Microsoft. To answer
such queries, the index has to be augmented to capture the proximities of
terms in documents.
TERM FREQUENCY
3. A Boolean model only records term presence or absence, but often we
would like to accumulate evidence, giving more weight to documents that
have a term several times as opposed to ones that contain it only once. To
be able to do this we need term frequency information (the number of times
a term occurs in a document) in postings lists.
4. Boolean queries just retrieve a set of matching documents, but commonly
we wish to have an effective method to order (or “rank”) the returned
results. This requires having a mechanism for determining a document
score which encapsulates how good a match a document is for a query.
With these additional ideas, we will have seen most of the basic technology that supports ad hoc searching over unstructured information. Ad hoc
searching over documents has recently conquered the world, powering not
only web search engines but the kind of unstructured search that lies behind
the large eCommerce websites. Although the main web search engines differ
by emphasizing free text querying, most of the basic issues and technologies
of indexing and querying remain the same, as we will see in later chapters.
Moreover, over time, web search engines have added at least partial implementations of some of the most popular operators from extended Boolean
models: phrase search is especially popular and most have a very partial
implementation of Boolean operators. Nevertheless, while these options are
liked by expert searchers, they are little used by most people and are not the
main focus in work on trying to improve web search engine performance.
?
Exercise 1.12
[⋆]
Write a query using Westlaw syntax which would find any of the words professor,
teacher, or lecturer in the same sentence as a form of the verb explain.
Online edition (c) 2009 Cambridge UP
1.5 References and further reading
17
Exercise 1.13
[ ⋆]
Try using the Boolean search features on a couple of major web search engines. For
instance, choose a word, such as burglar, and submit the queries (i) burglar, (ii) burglar
AND burglar, and (iii) burglar OR burglar. Look at the estimated number of results and
top hits. Do they make sense in terms of Boolean logic? Often they haven’t for major
search engines. Can you make sense of what is going on? What about if you try
different words? For example, query for (i) knight, (ii) conquer, and then (iii) knight OR
conquer. What bound should the number of results from the first two queries place
on the third query? Is this bound observed?
1.5
References and further reading
The practical pursuit of computerized information retrieval began in the late
1940s (Cleverdon 1991, Liddy 2005). A great increase in the production of
scientific literature, much in the form of less formal technical reports rather
than traditional journal articles, coupled with the availability of computers,
led to interest in automatic document retrieval. However, in those days, document retrieval was always based on author, title, and keywords; full-text
search came much later.
The article of Bush (1945) provided lasting inspiration for the new field:
“Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at
random, ‘memex’ will do. A memex is a device in which an individual
stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility.
It is an enlarged intimate supplement to his memory.”
The term Information Retrieval was coined by Calvin Mooers in 1948/1950
(Mooers 1950).
In 1958, much newspaper attention was paid to demonstrations at a conference (see Taube and Wooster 1958) of IBM “auto-indexing” machines, based
primarily on the work of H. P. Luhn. Commercial interest quickly gravitated
towards Boolean retrieval systems, but the early years saw a heady debate
over various disparate technologies for retrieval systems. For example Mooers (1961) dissented:
“It is a common fallacy, underwritten at this date by the investment of
several million dollars in a variety of retrieval hardware, that the algebra of George Boole (1847) is the appropriate formalism for retrieval
system design. This view is as widely and uncritically accepted as it is
wrong.”
The observation of AND vs. OR giving you opposite extremes in a precision/
recall tradeoff, but not the middle ground comes from (Lee and Fox 1988).
Online edition (c) 2009 Cambridge UP
18
REGULAR EXPRESSIONS
1 Boolean retrieval
The book (Witten et al. 1999) is the standard reference for an in-depth comparison of the space and time efficiency of the inverted index versus other
possible data structures; a more succinct and up-to-date presentation appears in Zobel and Moffat (2006). We further discuss several approaches in
Chapter 5.
Friedl (2006) covers the practical usage of regular expressions for searching.
The underlying computer science appears in (Hopcroft et al. 2000).
Online edition (c) 2009 Cambridge UP
An “information dynamic”
Communal labor
Semantic
labor
Universal labor
observation of the natural world,
counting, spoken & written language
pursuit of greater
exactness
Syntactic
labor
logical and mathematical
formalisms
signal transmission,
machines to rearrange signs,
mechanical implementation
of logic and math
pursuit of greater
exactness
Intellectual “property”
•
“Congress shall have Power ... To promote the Progress of Science
and useful Arts, by securing for limited Times to Authors and Inventors
the exclusive Right to their respective Writings and Discoveries …”
•
“Facts” are not copyrightable … but compilations of facts may be, if
its facts have been “selected, coordinated, or arranged in such a way
that the resulting work as a whole constitutes an original work of
authorship.”
•
“Rural's selection of listings … lacks the modicum of creativity
necessary to transform mere selection into copyrightable expression
… there is nothing remotely creative about arranging names
alphabetically in a white pages directory. It is an age-old practice,
firmly rooted in tradition and so commonplace that it has come to be
expected as a matter of course.”
Formally
representing text
for Boolean retrieval
Converting one digital code to another
x≥4?
N
Y
x≥2?
N
x≥6?
Y
x≥1?
N
Y
N
x≥3?
N
Y
Y
x≥5?
N
Y
x≥7?
N
Y
x=0
x=1
x=2
x=3
x=4
x=5
x=6
x=7
000
001
010
011
100
101
110
111
Converting one digital code to another
1. Does the letter come after O?
No
2. Does the letter come after G?
Yes
3. Does the letter come after K?
No
4. Does the letter come after I?
Yes
5. Does the letter come after J?
Yes
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0
0
0
0
1
0
0
0
1
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
1
1
0
0
0
1
1
1
0
1
0
0
0
0
1
0
0
1
0
1
0
1
0
0
1
0
1
1
0
1
1
0
0
0
1
1
0
1
0
1
1
1
0
0
1
1
1
1
1
0
0
0
0
1
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
0
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
1
1
0
0
0
1
1
0
0
1
1
1
0
1
0
“… any function capable of being
described as a precise logical
statement can be implemented
by … [a] system of switches.”
– Danny Hillis,
The Pattern on the Stone p. 3
base
chocolate
marshmallow
fudge
maple walnut
fudge
vanilla cherry
fudge
vanilla cherry
walnut fudge
Exhaustivity and exclusivity
•
A set of classes are exhaustive if their sum (∨) equals the
universe class
•
A set of classes are exclusive if, for each pair of classes,
their product (&) equals the empty class
•
In classifications, the classes at each level should be
exclusive, and exhaustive with respect to their
superordinate (containing) class
•
•
•
c fudge made with a chocolate
base
m fudge made with a maple
walnut base
v fudge made with a vanilla
base
• c&m
⟺ ∅
m&v ⟺ ∅
v&c ⟺ ∅
•
c, m, and v are
(mutually) exclusive
•c
∨ m ∨ v ⟺ U
•
a fudge made with
marshmallows added
•
c, m, and v are
(collectively) exhaustive
•
w fudge made with walnuts
added
•
a and ~a are
exclusive and exhaustive
•
h fudge made with cherries
added
•
same for w and ~w,
and for h and ~h
Analyzing textual meaning:
the “objective” way
“The method of identifying subjects by counting
references is a hopeless one unless we allow
ourselves to count indirect references, to group
items referred to and consider a reference to a
member of the group to be an indirect reference to
the group. But if we allow ourselves to do this, it is
quickly apparent that no unique results can be
expected, for there are many possible ways of
grouping any set of items referred to…”
– Wilson, p. 85
“Unstructured” text
•
What does it mean to call text unstructured?
•
Text has structure:
words, phrases, sentences, headings, paragraphs,
footnotes, introductions, conclusions, etc.
•
The “structure” it lacks is: explicit, systematic
categorization and naming by people who understand it
•
“Unstructured” in this case means: unprocessed by
semantic labor
Applications
•
Making text computable enables
syntactic labor to carry out:
•
“Ad hoc” search
•
Browsing and filtering
•
Clustering
•
Classification
Information retrieval
•
Given a collection of resources and a user's query, find
relevant resources
•
Collection:
repository of retrievable resources
•
Query:
representation of the user’s interest
•
Relevance:
relation between query and resources
grep
•
The simplest way to search
•
But, we need “plain text”
•
Even “plain” text can be rather complicated
•
Digital documents are just sequences of bytes—these
must be properly converted into characters
•
For anything more complex than grep, we need to index
structure & analysis
retrieval
organization & storage
Indexing
Museums
Art
Endowments
Philanthropy
UNC
Rembrandt Harmenszoon van Rijn
Ackland Art Museum
Sheldon Peck
Leena Peck
Bag of words
Call me Ishmael. Some years ago—never mind how
long precisely—having little or no money in my purse,
and nothing particular to interest me on shore, I
thought I would sail about a little and see the watery
part of the world. It is a way I have of driving off the
spleen and regulating the circulation. Whenever I find
myself growing grim about the mouth; whenever it is
a damp, drizzly November in my soul; whenever I find
myself involuntarily pausing before coffin warehouses,
and bringing up the rear of every funeral I meet; and
especially whenever my hypos get such an upper
hand of me, that it requires a strong moral principle to
prevent me from deliberately stepping into the street,
and methodically knocking people's hats off—then, I
account it high time to get to sea as soon as I can. This
is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his
sword; I quietly take to the ship. There is nothing
surprising in this. If they but knew it, almost all men in
their degree, some time or other, cherish very nearly
the same feelings towards the ocean with me.
the
10
all
1
street
1
from
1
i
9
money
1
hypos
1
would
1
and
7
hats
1
sea
1
warehouses
1
to
5
soon
1
substitute
1
there
1
a
5
years
1
driving
1
long
1
me
5
stepping
1
sail
1
cato
1
it
5
before
1
pistol
1
their
1
of
4
1
never
1
call
1
whenever
4
1
degree
1
way
1
my
4
methodical
ly
deliberate
ly
then
1
funeral
1
throws
1
is
4
his
1
involuntarily
1
interest
1
in
4
very
1
people's
1
that
1
find
2
account
1
november
1
purse
1
get
2
every
1
strong
1
but
1
some
2
nearly
1
on
1
part
1
about
2
they
1
cherish
1
particular
1
little
2
world
1
damp
1
circulation 1
with
2
bringing
1
quietly
1
up
1
myself
2
grim
1
prevent
1
can
1
this
2
rear
1
into
1
growing
1
as
2
upper
1
knocking
1
meet
1
nothing
2
towards
1
feelings
1
coffin
1
off
2
drizzly
1
philosophical
1
almost
1
or
2
soul
1
moral
1
mind
1
time
2
see
1
precisely
1
an
1
Inverted Index
https://developer.apple.com/library/mac/documentation/UserExperience/Conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html
Boolean retrieval
“car”
1
2
6
“jaguar”
3
4
5
7
“automobile”
all documents
Thursday
•
Assignment #2 will be returned
•
Assignment #3 will be handed out;
due 1 week later, on November 14
•
Hjørland, “Classical Databases and
Knowledge Organization: A Case for
Boolean Retrieval and Human DecisionMaking during Searches.”
•
Why/how has Boolean retrieval been
criticized?
•
What are Hjørland's arguments in favor of
Boolean retrieval? What examples does he
use?
•
What does he identify as differences
between “computer science” and
“information science” traditions?
Boolean retrieval
Boolean retrieval
“car”
1
2
6
“jaguar”
3
4
5
7
“automobile”
all documents
Boolean search strategy
1. Identify different concepts:
global warming, crops
2. List variant terms for each concept:
global warming, greenhouse effect, greenhouse gases
crops, crop yields, crop production, food supply
3. Join variants with OR and concepts with AND: (global
warming OR greenhouse effect OR greenhouse gases)
AND (crops OR crop yields OR crop production OR food
supply)
Tokenization
•
“Call me Ishmael.” → [“Call”, “me”, “Ishmael”]
•
“I spent $15.75 on one cocktail!”
→ [“I”, “spent”, “$15.75”, “on”, “one”, “cocktail”]
Token vs. type vs. term
•
Tokens: Night,
•
Types: Night,
•
Terms: night,
town, Night, town, a, glass, Color, mahogany, Color,
mahogany, center, Rose, is, a, rose, is, a, rose, is, a, rose
town, a, glass, Color, mahogany, center, Rose, is, rose
town, glass, color, mahogany, center, rose
(Depends on how we process tokens)
Downcasing
Tar
Tar
tAr
tar
Heels
heels
HeElS
heels
tar heels
Stemming
abominable
abominably
abominates
abominations
abomin
Stemming
universal
universally
universe
universities
university
univers
Lemmatization
universal
universally
universe
universities
university
universal
universe
university
Stopwords
http://www.ranks.nl/stopwords
https://rybesh.github.io/indexing/
Next week
•
Evaluating information retrieval
•
How do we know when an IR
system is working well?
•
Buckland, Michael.
“Evaluation of Selection
Methods.” In Information and
Society, 153–164.
Classical Databases and Knowledge Organization:
A Case for Boolean Retrieval and Human
Decision-Making During Searches
Birger Hjørland
Royal School of Library and Information Science, University of Copenhagen, 6 Birketinget, DK-2300
Copenhagen S, Denmark. E-mail: pnl617@iva.ku.dk
This paper considers classical bibliographic databases
based on the Boolean retrieval model (such as MEDLINE
and PsycInfo). This model is challenged by modern
search engines and information retrieval (IR) researchers, who often consider Boolean retrieval a less efficient
approach. The paper examines this claim and argues for
the continued value of Boolean systems, and suggests
two further considerations: (a) the important role of
human expertise in searching (expert searchers and
“information literate” users) and (b) the role of library
and information science and knowledge organization
(KO) in the design and use of classical databases. An
underlying issue is the kind of retrieval system for which
one should aim. Warner’s (2010) differentiation between
the computer science traditions and an older libraryoriented tradition seems important; the former aim to
transform queries automatically into (ranked) sets of
relevant documents, whereas the latter aims to increase
the “selection power” of users. The Boolean retrieval
model is valuable in providing users with the power to
make informed searches and have full control over what
is found and what is not. These issues may have significant implications for the maintenance of information
science and KO as research fields as well as for the
information profession as a profession in its own right.
Introduction
Knowledge organization (KO) (also termed information
organization) is a research field within library and information science (LIS) that is concerned with, among other
things, indexing, classification, metadata, and the optimization of bibliographical records (in physical bibliographies/
catalogs as well as in electronic databases). Information
retrieval (IR) is another research field located in both
Received October 29, 2013; revised January 22, 2014; accepted January 22,
2014
© 2014 ASIS&T • Published online 4 June 2014 in Wiley Online Library
(wileyonlinelibrary.com). DOI: 10.1002/asi.23250
LIS and (today overwhelmingly) computer science.
However, these two fields have the same basic objective: to
make documents and “information” retrievable. They should
therefore be seen as competing fields, and their relative
strengths and weaknesses should be examined.
Such an examination is difficult, however, because KO
and IR are influenced by different ideas and techniques and
have different research traditions. The difficulty of achieving
coherence may be caused in part by the fragmented and
conceptually not highly coherent nature of the research traditions described. As Ellis (1996, pp. ix–x) wrote, “. . . in a
field as diverse as information retrieval research, it is impossible to provide any overall conclusion on the state of the
art.” This view is confirmed by the recent debate on whether
progress is being made in the field of IR (Armstrong, Moat,
Webber, & Zobel, 2009a, 2009b; Puurula, 2013; Trotman &
Keeler, 2011). This paper is based on the view that it is
important to uncover the often hidden assumptions in
research fields. As Einstein wrote, “Science without epistemology is — insofar as it is thinkable at all — primitive and
muddled” (1949, p. 684).
KO and IR comprise different approaches and research
traditions. In KO, for example, citation analysis, facet analysis, and user-based approaches have been described (see
Hjørland, 2013a, 2013b, 2013c). The relation between KO
and IR has not yet been thoroughly examined. This paper
begins such an examination but is only concerned with
systems based on the Boolean retrieval model. The most
prestigious IR models (such as vector space models, probabilistic models, and language models) are left to a later paper.1
The Boolean retrieval model has predominated in
classical bibliographical databases such as MEDLINE,
PsycINFO, Science Citation Index, and other databases
available through database-hosts such as Dialog (now ProQuest Dialog) and DataStar. They are typically used by
information professionals (intermediaries between databases and end users) or by advanced users. Therefore, there
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 66(8):1559–1575, 2015
seems to be a close connection between (a) Boolean search
systems, (b) the need for human searchers (expert searchers
and advanced searchers), and (c) information science (developing knowledge concerning optimal search strategies and
information literacy). These three issues are therefore considered together in this article.
The paper is written from the perspective of KO and the
author has a background as a professional searcher. It is
organized as follows: The next section introduces the field of
IR and discusses its connection to KO. It concludes with a
view of basic different assumptions in the two fields, which
is used as the foundation for this article. The third section
introduces searching as a field of research and practice
within library and information science. The fourth section
looks closer at classical databases and how they involve KO.
The fifth section looks at the Boolean retrieval model and
contains a subsection presenting and discussing the criticism
that has been raised against it. The basic strength of Boolean
retrieval is its well-defined selection criteria and the
advanced user’s ability to fully controll the search process.
It is argued that much of the criticism of Boolean retrieval
seems to be based on problematic assumptions (which
may be overcome by expert searchers in properly indexed
databases). This section also contains a subsection that presents empirical studies on searching for systematic reviews.
In the Conclusion it is claimed that not only searchers
but also the entire field of information science may
suffer if concepts such as “recall devices” and “precision
devices” no longer have well-defined meanings due to the
existence of retrieval models without well-defined matching
criteria.
The Field of Information Retrieval
The 1950s saw the introduction of the computer into the
field of documentation together with the idea of information storage and retrieval. This concept led some library
researchers and documentalists to approach their field
from a new theoretical perspective: Libraries could be
understood as information-processing systems (and so
could scientific and scholarly communication—journals,
conferences, informal communication, reviews, etc.). The
concept of information storage and retrieval gave rise to
the field of information science, considered by some a new
field, by others as just a new and better name for library
science and documentation. The term information retrieval
was introduced by Mooers (1919–1994), who defined it
thus:
Information retrieval is the name for the process or method
whereby a prospective user of information is able to convert his
need for information into an actual list of citations to documents
in storage containing information useful to him. It is the finding
or discovery process with respect to stored information. It is
another, more general, name for the production of a demand
bibliography. Information retrieval embraces the intellectual
aspects of the description of information and its specification
for search, and also whatever systems, technique, or machines
1560
are employed to carry out the operation. Information retrieval is
crucial to documentation and organization of knowledge.
(Mooers, 1951, p. 25)
Van Rijsbergen also stated (in 1979)2 that IR is about finding
documents:
Information retrieval is a wide, often loosely-defined term but in
these pages I shall be concerned only with automatic information retrieval systems. Automatic as opposed to manual and
information as opposed to data or fact. Unfortunately the word
information can be very misleading. In the context of information retrieval (IR), information, in the technical meaning given
in Shannon’s theory of communication, is not readily measured
(Shannon & Weaver, 1964). In fact in many cases, one can
adequately describe the kind of retrieval by simply substituting
“document” for “information.” Nevertheless, “information
retrieval” has become accepted as a description of the kind of
work published by Cleverdon, Salton, Spark Jones, Lancaster
and others. A perfectly straightforward definition along this line
is given by Lancaster (1968): “Information retrieval is the term
conventionally, though somewhat inaccurately, applied to the
type of activity discussed in this volume. An information
retrieval system does not inform (i.e. change the knowledge of)
the user on the subject of his inquiry. It merely informs on the
existence (or non-existence) and whereabouts of documents
relating to his request.” This specifically excludes QuestionAnswering systems as typified by Winograd (1972) and those
described by Minsky (1968). It also excludes data retrieval
systems such as used by, say, the stock exchange for on-line
quotations. (p. 1)
A related view was expressed by Kemp (1988), who,
however, argued that “knowledge retrieval” should substitute for “information retrieval.”3 Another definition is provided by Baeza-Yates and Ribeiro-Neto (1999):
Information retrieval (IR) [is the] part of computer science
which studies the retrieval of information (not data) from a
collection of written documents. The retrieved documents aim
at satisfying a user information need usually expressed in
natural language. (p. 444)
In a second edition the same authors formulated it thus:
Information retrieval (IR) is a broad area of Computer Science
focused primarily on providing the users with easy access to
information of their interest, as follows: Information retrieval
deals with the representation, storage, organization of, and
access to information items such as documents, Web-pages,
online catalogs, structured and semi-structured records, multimedia objects. The representation and organization of the information should be such as to provide the users with easy access
to information of their interest. (Baeza-Yates & Ribeiro-Neto,
2011, p. 1)
It is clear from the quotations above that Mooers (1951)
considered KO to be part of IR: it included the “description
of information,” “specification of information for search,”
and is crucial to “organization of knowledge.” Mooers
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2015
DOI: 10.1002/asi
(1951) and Kemp (1988) found that IR covers both manual
and automatic retrieval, whereas van Rijsbergen (1979) and
Baeza-Yates and Ribeiro-Neto (2011) limited it to automatic
approaches, and the latter even to computer science.4 By
limiting IR to automatic processes, as van Rijsbergen (1979)
does, the possibility of considering the role of human expertise (as in this paper) seems lost; by limiting the field to
computer science, as Baeza-Yates and Ribeiro-Neto (2011)
do, the contributions from other fields (such as linguistics,
LIS, and science studies) seem to be neglected. It is also
clear that the term information is not generally accepted in
this field (and the difference between KO and IR is therefore
not likely to be based on differences in meaning of information versus knowledge).
The quotations also show that the term IR is not used
consistently: We cannot find univocity in the use of the terms
IR and KO. On the other hand, people in information science
have a fairly good impression of the meaning of these two
expressions. The core traditions of IR include the works of
researchers such as Salton (1991), Spärck Jones (1972), van
Rijsbergen (1979), and Robertson (1975), all of whom seem
today to be more closely connected to computer science than
to information science (as this field is understood here,
equivalent to LIS). In KO, on the other hand, researchers
such as Dahlberg (1974), Ranganathan (1967), Soergel
(1999), Svenonius (2000), and Vickery (1986) are among
the most well-known and cited.
A suggestion could be that IR is about automatic
methods in contrast to “manual” or “intellectual” methods
in KO. This way of understanding the difference is not
without basis in the two fields because most IR researchers
work with automated retrieval and at least some KO
researchers still think of KO as being about physical stores
and manual indexing. However, given the importance of
digital media, it seems problematic to base the scholarly
field of KO upon the technologies of yesterday. If this field
is to earn a place in the world, it should be able to demonstrate findings that are also useful in the digital world.
The distinction between “human-based indexing” and
“computer-based indexing” can also be considered unfruitful because it is not theoretically well supported: Because
human beings sometimes perform like a machine (i.e.,
mechanically following rules of indexing taught in schools
of librarianship), we need a deeper understanding of the
indexing process, whether it is done by humans or by computers. By implication, the distinction between human
indexing versus computer indexing does not represent two
theories of indexing. In Hjørland (2011b), I demonstrated
how different theories of knowledge (rationalism, empiricism, historicism, and pragmatism) represent the basic
theories of IR as well as of KO.
Warner (2010) found that two information and computer paradigms, the cognitive and the physical, have been
distinguished in IR research, but argued that “For the purposes of discussion . . . they can be considered as a single
heterogeneous paradigm, linked but not united” (pp. 4–5).
He found that this information and computer paradigm has
not always been explicit about its own values or examined
its own basic assumptions. In his analysis, the basic
assumption in that tradition may broadly be termed query
transformation, which implies that a user’s verbal query—
formulated before the start of the search—is transformed
by an information system into a set of documents or bibliographical records. According to Warner, these records
have been evaluated according to their relevance (by measures such as “recall” and “precision”) in relation to the
query. Warner (2010) finds that the underlying methodology tends to reify the concept of relevance and that the
underlying indexing philosophy in the searched material is
neglected and taken as given. Finally, he finds that this
approach contains an implicit teleology aimed at the construction of a perfect system. Contrary to the dominant
paradigm of the computer/information-oriented tradition,
Warner presents a tradition that is based more on library
science and the practice of indexing. This tradition is far
older, but less influential today. In his opinion, there are
two especially valuable elements in this tradition. The first
is the explicit priority of “selection power,”5 that is the
user’s ability to make relevant distinctions during a search.
The second is the recognition of the need for human labor
to create this selection power.6
One way to interpret Warner’s (2010) distinction is to
say that IR researchers tend to see IR as an input–output
process, in which the knowledge represented is a “black
box,” whereas KO is to a higher degree concerned with
what is inside the box, that is, how the universe of
recorded knowledge and culture has been represented in
bibliographical records and databases. From this perspective, the users are not primarily interacting with computers,
but with the universe of recorded knowledge, thus making
studies of knowledge and culture important for information
science.
Warner’s (2010) differentiation between IR as query
transformation versus KO as providing “selection power”
seems important and is taken as the point of departure in
my examination of IR in both this and a following article.
There is, however, one major issue to address. Whereas the
former mentioned researchers (Salton, Spärck Jones, van
Rijsbergen, and Robertson) seem to fit well with Warner’s
(2010) characterization of IR, the tradition of online
searching, represented by researchers such as Harter (1986)
and Lancaster (2003), seems less likely to do so. With the
arrival of “classical” bibliographic databases (such as
MEDLINE, Chemical Abstracts, and PsycINFO) and database hosts such as Dialog, a rapid development of search
techniques followed, mainly based on Boolean logic (the
extended Boolean model includes techniques such as proximity operators, truncation or tail wildcard, ranking, and
explosion or expansion).7 Such classical databases (and the
extended Boolean model) do not correspond to Warner’s
criticism of query transformation; on the contrary, they represent the most advanced form of selection power ever
developed in relation to bibliographical searches. In this
respect, they form a contrast to IR techniques as developed
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2015
DOI: 10.1002/asi
1561
in typical search engines for the Internet. The fact that the
Boolean model is normally discussed in the literature of IR
(and not usually, so far, in the literature of KO) does not
make Warner’s suggestion less relevant, as he has also
indicated,
“. . . the Boolean logic used in many retrieval systems, does,
under certain conditions, have the advantage of relative transparency to the searcher. The objection that it is an ineffective
way of transforming an information need into a set of relevant
records is no longer tenable.” (2000, p. 80)
Warner’s two approaches (query transformation and selection power) may be useful as fundamental precepts on which
to base an understanding of the role of KO in the electronic
world.
Information Searching and Information
Professionals
In schools of LIS, students are trained in searching
databases and using Internet search engines. They are
trained in using systems developed mainly by computer
scientists. Classical bibliographical databases developed
rapidly in the 1960s (see Bourne & Hahn, 2003; Hahn,
1998). Although these systems were developed outside
LIS, their use quickly found a place in many LIS curricula.
Until the development of systems for end users around
1990, searching databases was mainly done by professionals (whether they were termed documentalists, research
librarians, librarians, or information specialists). These
were mediated searches undertaken by professionals on
behalf of the users or with the users. These classical databases were essentially based on the Boolean retrieval
model (explained below).8
In about 1990, Internet search engines came along and
made end user searching easy and this became the dominant
way of searching digital information. Such systems were
typically based on alternatives to the Boolean model
(explained below). Bawden wrote,
. . . at that time [about 1995] there were a number of search
engines beside AltaVista: Lycos, Excite, Yahoo, Infoseek,
Hotbot and more. They all had different searching functions,
and were better for different types of material. People like me
ran courses on how to get the best out of them. There was even
a book, Search Engines for the World Wide Web, written by
Alfred and Emily Glossbrenner in 1998 [3rd ed., 2001], which
covered them all, and advised that we had to learn the strengths
and weaknesses of each one, after first mastering the idiosyncrasies of them all. . . . And then Google came along and
changed everything. Not because it was necessarily a better
system. (2013, electronic source, no page)
Bawden writes in the past tense: that he used to “[run]
courses on how to get the best out of [search engines].” Does
he mean that Google has made the teaching of online
1562
searching (and searching in general) obsolete? Or is he just
saying that there are no real competitors to Google today and
that the focus of teaching has changed accordingly? I do
believe that teaching the strong and weak aspects of different kinds of systems is still common in LIS schools, and I
have difficulty in imagining the purpose of LIS education if
searching information/documents is not among the core
competencies of our students.
The classical databases were not driven out of business
by the Internet and its search engines, although information
searching by professionals was damaged.9 Professional
searching still exists, partly in services offered by research
libraries and most professionally in the fields of medicine,
law, and patenting. In medicine, expert searching is in particular done in relation to systematic reviewing in the
approach known as evidence-based practice (EBP; see, e.g.,
Hjørland, 2011a; Kitchenham & Charters, 2007) according
to which it is mandatory to identify all documents fulfilling
certain criteria. Information professionals today have to
know both classical databases as well as search engines—
among other things—to help users find the information and
documents they need.
At least until recently, Harter’s (1986) textbook on
searching classical databases has been used and was
extremely popular in LIS schools. To my knowledge, no
one seems to have written a good successor yet: There are
many books on information seeking (studies of the actual
conduct of seeking)10 and information retrieval.11 There are
also many practical books on searching (e.g., Bell, 2012;
Hock, 2013), but they seem to lack a research basis and
searching seems to have been neglected by information scientists,12 something that was also found bibliometrically by
Zhao and Strotmann (2008).13 Searching as a field is here
understood as the study of the optimal conduct of searching, both in classical databases and the use of search
engines, but it also includes the classic term literature
searching (and “hand searching”). As a field of research,
searching should provide research-based support for normative decisions during search processes.14 A main point
here is that classical databases and Boolean retrieval are
still taught and are still considered important by LIS
schools,15 but research activities and the development of
research-based teaching materials in the field of searching
seem to have suffered (medical information science seems,
however, to be the exception confirming the rule; see the
section on empirical studies).
Computer science is mostly about making the systems,
whether we speak of classical database systems such as
Dialog (now Dialog-ProQuest) or of modern search engines.
The main role of librarians and information specialists has
never been to design such systems but—as Bawden (2013)
wrote—to know the possibilities and weaknesses in the concrete task of helping users find information.16 As stated
previously, it is not primarily human–computer interaction
that is the main issue for LIS, it is rather the interaction with
the world of recorded knowledge. This difference is important to keep in mind and it provides information science with
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2015
DOI: 10.1002/asi
a unique research perspective (at least potentially). An
example may explain the difference.
The database host Dialog had, in addition to real databases (e.g., the biological database BIOSIS Previews),
special ONTAP (ONline Training And Practice) databases
(e.g., ONTAP BIOSIS Previews) containing only portions of
the complete databases (which were free or much cheaper to
use). They contained records from a short time period and
may or may not have been current. They were intended for
training only. The point is that they may be adequate for
learning the narrow technical aspects of database searching.
However, for most people it is more interesting to search the
real databases, in which queries can be tested in relation to
what has been recorded on a given subject. It is much more
interesting to interact with the world of recorded knowledge
indexed in real-life databases. This perspective is important
from the point of view of LIS because it goes beyond
human–computer interaction.
From this perspective, Warner’s (2010) distinction
between the computer science16 traditions and an older
library-oriented tradition also seems to be important; the
former aimed to transform queries automatically into
(ranked) sets of relevant documents, whereas the latter
aimed to increase the “selection power” of users. He writes:
The value placed on query transformation [in computer science]
is dissonant with common practice, where users may prefer to
explore an area and may value fully informed exploration.
(2010, p. 5)
Warner’s view is related to what might be termed a hermeneutical approach to searching (cf. Boell & CecezKecmanovic, 2010) as opposed to a positivist approach. The
positivist view implies that searching can be done in a
formal way (algorithmic) that retrieves relevant knowledge
without bias in the search (and this is the assumption behind
evidence-based practice). The hermeneutic approach is
based on the assumption that there is a constant reinterpretation of the relevant literature, implying the need for great
flexibility in search criteria and a great level of iteration in
search processes—and, most important, an understanding of
what is going on during the search. Booth et al. (2013) also
found that Bates’ “berry picking” is in line with this hermeneutic approach: “Rather than sticking to an a priori search
protocol Bates described how a searcher’s concept of a
query is influenced by every new item of information that
they encounter” (p. 2).
This paper argues that Boolean search—although developed in computer science—is fully consonant with Warner’s
goal of increasing the selection power of users and also that
Boolean search is a better fit to the hermeneutical understanding of the search when compared with most alternative
retrieval models.
KO in Classical Bibliographic Databases
A given database, say, the medical database MEDLINE,
may be a comprehensive bibliographical database with a
broad and deep coverage of its domain (MEDLINE contains
more than 20 million references). However, despite the
impression of completeness, any bibliographical database is
always selective. Such selectiveness is often mainly formal
(e.g., limited to journal articles), but it is also in the end
connected to views of what is relevant and what constitutes
the field (e.g., of medicine) even if it includes many references to adjacent fields (often based on implicit views of the
domain). As we see later, even the coverage of a comprehensive database like MEDLINE is, for some important
purposes, not sufficient. The selection of material to the
bibliography constitutes the first and most basic level of KO.
Because the meaning of terms is implicitly understood in
this disciplinary context and to the extent that classification
and indexing is based on the principle of “literary warrant,”
this selection influences the developments of thesauri and
ontologies, which may thus be understood as a higher level
of KO.
Another level of KO is the bibliographical record and its
organization into fields (and the corresponding organization
of data in linear and inverted files). Such records vary from
database to database and from host to host. They are very
complicated structures that professional searchers have to
learn to master.17 It is not possible here to go too deeply into
this field, but just state a few points:
• A given record contains a mixture of fields derived from the
document it represents, as well as information added by the
database producer (it may contain the whole document in
addition to value-added information). It may also contain
information imported from third parties, as well as user-added
information in the case of social tagging and related technologies.
• Some fields contain controlled vocabularies developed by
information professionals (e.g., descriptors and classification
codes). Other fields contain “natural language” (i.e., the
authors’ language for special purposes). Many databases
include citation indexing, that is, the possibility of searching
bibliographical references in each document represented.
• It is important to realize that the efficiency of given fields for
optimal search strategies is relative from domain to domain
(the value of searching document titles, for example, varies
according to how titles are used in different domains). In the
social sciences the use of metaphors may thus limit the value
of title searches (for more, see Hjørland & Kyllesbech
Nielsen, 2001).
• By implication, the experienced searcher should know not just
about the database systems and the bibliographic records (or
full-text records) but also about the concepts and genres of the
primary literature. This aspect connects information science
with fields such as scholarly communication, written composition, genre studies, and language for special purposes.
Whereas KO in the narrow sense is about the design of bibliographical records and systems of controlled vocabularies,
KO in the broader sense is about how knowledge is organized
in different domains and how this can be used for IR. This
broader perspective on KO becomes increasingly important in
the context of full-text databases and the Internet.
• The specific requirement of indexing for Boolean searches is
to represent different facets of the document that are in the
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2015
DOI: 10.1002/asi
1563
search process to be combined by Boolean “and.” This is
known as “building blocks search strategy.”18 Such indexing
looks like the approach known as facet analysis (see
Hjørland, 2013b), but whereas the tradition of facet analysis is
mainly logical or speculative, it is important to anticipate
what facets need to be combined during a search. In evidencebased practice the methodological facet is important (e.g., a
descriptor for randomized clinical trials; which, by the way,
has only been considered in MEDLINE since about 1994,
after the breakthrough of the EBP paradigm). The construction of facets should therefore be based on studies of researchers’ criteria of relevance (which are most explicate
in EBP).
A third level of KO in classical databases consists of the
information retrieval thesaurus,19 ontologies, and other kinds
of controlled vocabulary constructed by information specialists.20 There is a huge literature on thesauri, ontologies, and
controlled vocabularies and their construction. From the IR
point of view, their value has been questioned (e.g., by
Salton, 1996) or neglected (cf. Warner, 2010, pp. 3, 148). An
important point about controlled vocabularies in general is
that they cannot once and for all provide a neutral and
objective connection between concepts. To quote Tuominen,
Talja, and Savolainen (2003),
While unitary documentary languages ensure a maximum
of mutual understanding [. . .], they do so by legitimizing a
particular ideological and sociopolitical worldview, and by
silencing other meanings, voices, and ways of knowing
[. . .]. Unitary documentary languages embody a belief in
the existence of a unified body of knowledge. They express
a belief in the possibility to capture reality isomorphically in “information,” and presuppose a neutral ground
from which to judge the truth-value of different theories.
(p. 563)
Each thesaurus and other controlled vocabularies reflect in
some way a philosophy of KO (e.g., a user-based view, a
facet-analytical view, or a domain-analytical view). The
value of such a tool is in the end connected to the principles and qualifications on which they are based: It is
meaningless to talk about the usefulness of thesauri while
disregarding their individual qualities. Skepticism concerning concrete systems may thus be justified, owing to the
specific principles and (lack of) qualifications of their constructors. They may be based on the problematic view that
relations in thesauri are “context-free, definitional, and true
in all possible worlds” (see Svenonius, 2000, 2004, and
Hjørland, in press, for a criticism) and related essentialist
philosophy. For the most part, it has not been taken into
account that “a controlled vocabulary is a way to insert an
interpretive layer of semantics between the term entered by
the user and the underlying database to better represent the
original intention of the terms of the user” (Fast, Leise, &
Steckel, 2002; electronic source, no page). To the degree
that existing thesauri and ontologies have been constructed
under the influence of problematic assumptions, there
might be room for improvement.
1564
A fourth level of KO in Boolean systems is generated by
the searcher:
• During search the user organizes the document representations in sets based on Boolean algebra. The final set is
supposed to contain references to the documents that satisfy
a set of criteria defined by the user (it may also contain
noise; the set is meant to be examined afterwards record
by record. The set is therefore not meant to satisfy the
user’s information need in a direct way. It is only meant to
delimit the full database to a set which has a high probability of containing the relevant documents). In other words,
the aim of KO and IR is purely economic: to save the time
of the searcher by reducing the number of records to be
considered.
• Theoretically, no bibliographical record or system allows
100% perfect subject retrieval and the efficiency of the search
is dependent on:
1. The quality of the bibliographical records, their indexing,
metadata, and relevant discriminatory power (relevant
also in relation to the “voices” mentioned by Tuominen
et al. (2003) cited previously; see also Hjørland,
2010). These may be termed objective search
possibilities.
2. The qualifications of the searcher (the subjective search
possibilities). For example, if a searcher does not know
about citation indexing, an objective possibility is not utilized. Studies have found that users are—to some degree—
able to adapt their searching to system performance; by
making an extra effort, they may compensate for shortcomings in the system.21
• Organization of Boolean sets allowing a high level of recall
and an acceptable level of precision thus depends on the
underlying information in either indexing or other available
information. The decisive issue is whether relevant documents can unequivocally be recognized at first hand
(whether criteria for what should be found can be established). This is a question of the philosophy and methodology of science (e.g., to the extent that an interpretation of
EBP results in norms such as “studies should be based on
RCT,” such norms are also the norms of relevance in IR and
the norms that the indexers have to meet; documents based
on RCT should thus be distinguishable from those which are
not). The understanding of requirements for indexing and
searching should therefore be produced backwards, that is,
sets of relevant documents should be identified and
described in a way that clearly differentiates them from all
other records. If searches fail, the reason should be identified and the indexing improved (known as “failure analysis”).22 The real difficulty in indexing is thus in determining
criteria of relevance, which are to be negotiated by epistemological criteria.
In conclusion, KO in the context of bibliographical
databases and full-text databases focuses on the optimal
construction of bibliographical records, subject access
points, metadata, and controlled vocabularies/thesauri/
ontologies. However, this optimal construction has to be
based in the broader study of and involvement with documents, genres, discourses, epistemologies, “voices,” and
domains.
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2015
DOI: 10.1002/asi
Boolean Retrieval
Boolean retrieval is based on Boolean algebra, named
after the English mathematician Boole (1815–1864).23 This
has had a great influence on computer science and plays an
important role in online searching. However, its usefulness
in relation to other retrieval models has often been disputed,
as discussed below. Boolean retrieval is commonly one of
the most used and taught models for IR, but it is often less
efficient compared to other retrieval models, as discussed in
standard textbooks such as that of Baeza-Yates and
Ribeiro-Neto (2011).
In most Boolean retrieval systems, bibliographical
records are contained in “linear files,” associated with
“inverted files” in which words, strings, or other symbols are
contained with references to record numbers in the linear
files. Usually only types (different terms/symbols), not
tokens (repeated types), are identified, but it is possible to
specify the type in a given field (e.g., in the title or in the
descriptor field), or in a given distance from other types and
thereby provide more precise search criteria. Searching is
done in the inverted files and matching documents (or rather
just their identification numbers) are organized in sets,
which may be combined with the logical “AND,” “OR,” or
“NOT” (the Boolean operators).24 Each operator—and each
complete search profile—can visually be described using
Venn diagrams.
To understand the possibilities of Boolean search when
used in its most advanced ways, it is necessary to know
about bibliographical records in online databases, as briefly
introduced above. Among the subject access points, there
can be many aspects of information such as whether a
journal is peer-reviewed or not, its impact factor, and other
kinds of information. Such kinds of information are available to the professional searcher. Semiotically, it could be
said that advanced bibliographical records consist of many
“voices” (in the sense of Bakhtin, 1986). Advanced searchers learn to utilize these voices and possibilities in creative
ways to optimize the search. During the search, sets of
documents are created, which may be combined by Boolean
operators in complex ways. This is a powerful technique
but, as already stated, it depends on the quality of the
records, the facilities of the database host, and the knowledge and experience of the searcher. Although Boolean
logic is a two-value logic (a given word form or symbol,
representing a property with some degree of certainty, is
either present or not present without possibilities of graduation), many combinations of sets based on two-valued logic
can compensate for this limitation. Information science
is not only about how to improve records and search facilities, but also about how to optimize the use of existing
records and facilities (e.g., about the comparative value
of citation-based retrieval and descriptor-based retrieval
in different domains) and about educating professional
searchers.
Although Boolean searches may be said to transform a
query into sets of potentially relevant records, it is much
more in line with Warner’s (2010) aim of increasing the
“selection power” of users because of the iterative nature of
the search and users’ control over the process. During the
process, users may explore how terms are used, how they
co-occur, and how knowledge is organized. In other words,
the query is not a static entity, but is modified during the
search process. In advanced systems, there is a very wide
range of possibilities to choose from (and searches should
never be performed without access to the documentation of
the system’s basic commands as well as to the structure and
search fields in the single databases used).
Boolean retrieval is a kind of “exact-match retrieval,”
which is opposed to “best-match models.” Turtle and Croft
defined exact-match retrieval as follows:
• The query specifies precise retrieval criteria.
• Every document either matches or fails to match the query.
• The result is a set of documents.25 (Croft, 1998; electronic
source, no page)
The opposite of exact match is termed “best match” (or
“partial match” or “scoring systems”). Turtle and Croft
defined best-match retrieval in the following way:
• The query describes a good or “best” matching document.
• The result is a ranked list of documents.
• The result may include an estimate of quality. (Croft, 1998)
Boolean approaches differ from best-match systems in that
the searchers maintain full control over the search process,
whereas search engines work behind the back of the users,
so to speak. Boolean searching is a kind of “exact-match
retrieval” in that given documents or document representations either fulfill the demands or fail to fulfill the demands
expressed in search logic. It requires that the users obtain
the necessary knowledge to be able to evaluate the consequences of their decisions made during the search. Of
course, it is easier not to involve oneself with such issues,
but this comes at the cost of not knowing what is not found
and why this is the case. In some cases this may be fine, that
is, in cases where overview of a specific part of the information is not necessary. Especially (but not exclusively) for
academic purposes, it seems important that users maintain
control over the search process.
In relation to this, it is noteworthy that informationretrieval specialists seldom discuss the evaluation of systems
in connection with their own research activities and experiences, but tend to consider users’ information needs without
making this move of reflexivity. Reflexivity means that the
principles one suggests must also be valid in relation to
oneself:
In the most general terms, considered as a feature or style of
inquiry, reflexivity is a concept that expresses a concern to turn
‘subject’ into ‘object,’ to turn the activity of inquiry into a topic
for (its own) inquiry. To ‘be reflexive’ is to bring the process of
doing an activity into the purview of that activity as a feature of
it. (Ashmore, 2001, p. 12881)
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2015
DOI: 10.1002/asi
1565
[The sociology of science] would be reflexive. In principle its
patterns of explanation would have to be applicable to sociology itself. (Bloor, 1976, p. 7)
In relation to information science, reflexivity means that
information scientists should also consider themselves to be
users of information systems and—IR—researchers should
consider information systems and information needs in relation to their own behaviour. In other words, IR researchers
should consider their own systems in relation to their own
research activities and design systems that also suit themselves26 (which also implies that they should be able to
formulate criteria for which documents are relevant to their
own research).
It should be said that different retrieval models may be
combined—and often have been combined. The extended
Boolean model (Salton, Fox, & Wu, 1983)27 makes use of
partial matching and term weights as in the vector space
model, but still uses the Boolean queries. There are many
techniques and the traditional Boolean search systems are
often improved with the inclusion of other features such as
proximity searching and ranking. From the point of view of
this paper, the important thing is whether the user has full
control over the search—if he or she wants to have it. It is
also clear that all kinds of tools that may provide the
searcher with useful information in making decisions (e.g.,
information about term frequencies and citation patterns in
documents and in the whole database) should be provided.
Criticism of Boolean Retrieval
Göker and Davies (2009) wrote, “The Boolean model is
the first model of information retrieval and probably also the
most criticized model” (p. 3). It has been claimed that a
serious limitation of Boolean searches is that they are based
on a two-value logic. Documents are divided into two sets:
relevant documents and nonrelevant documents. However,
in real life most documents have degrees of relevance.28 A
further limitation that has also been mentioned is that
Boolean logic only concerns the relevance of documents,
not their mutual relations.29
Many texts, for example, Belkin and Croft (1987),
Baeza-Yates and Ribeiro-Neto (1999, pp. 27, 281; 2011,
p. 66), Chowdhury (2010, pp. 206–207), Chu (2010, pp.
112–114), Cooper (1988), Korfhage (1997), and Turtle and
Croft (1992) provide related criticism of exact-match
models. Baeza-Yates and Ribeiro-Neto (2011) wrote,
“Today, it is well known that index term weighting can lead
to a substantial improvement in retrieval quality [compared
to Boolean search]” (p. 66). Turtle and Croft (1992) found
that: “For some kinds of searching (e.g., known-item searching) the deterministic and predictable exact-match models
perform very well, while for more general searching or for
untrained users the vector space and probabilistic models
have clear performance advantages” (p. 289).
Van Rijsbergen considers Boolean retrieval a tool “so
‘blunt’ that it is almost useless” (1991, p. 85) and (seemingly
1566
to the contrary): “a very attractive approach indeed, unfortunately most systems will either retrieve a set much too
large or nothing at all. It is often the case that although a
document is not a model of a query, it nevertheless has some
relevance to the user need and indeed is about the query. So
here we have a situation where a two-valued logic just will
not work. The difficulty is how to extend the Boolean logic
without at the same time losing its advantages” (van
Rijsbergen & Lalmas, 1996, p. 387).
Here, one of the critiques will be considered in more
detail. Larson (2010, p. 2557) lists seven problems with
Boolean searches. Each of these will be presented and discussed in turn:
1. “Boolean logic requires that the users be trained in construction of queries, because it is not intuitively obvious,
and differs from common natural language usage. (E.g., a
user seeking books on ‘cats and dogs’ probably wants
books on either, not just those that treat both.)”
Comment: It is correct that training is necessary and that
Boolean logic is somewhat counterintuitive; the NOT
command, in particular, is known to be problematic and
counterintuitive to use; however, despite these difficulties,
Boolean logic is powerful: when the technique has been
learned, the searcher is empowered in ways with which no
other technique can compete. The first objection is only
relevant if commercial criteria are taken into account or if
extreme simplicity is mandatory.
2. “Boolean AND operations tend to be too restrictive. A
search for “A AND B AND C” will eliminate any records
that do not have all these terms. Those that have one or
two of the three terms are rejected along with those that
have none of the terms.”
Comment: If a search for “A AND B AND C” turns out to
be too restrictive, the searcher may just search: (A AND
B) OR (A AND C) OR (B AND C). Many additional
strategies are available to increase recall (dependent on
the content and structure of the database). Larson’s
second point of criticism is therefore wrong.
3. “Boolean OR operations tend to be too inclusive. A search
for “A OR B OR C” will retrieve any record with at least
one of the terms, but no...
Essay Writing Service Features
Our Experience
No matter how complex your assignment is, we can find the right professional for your specific task. Achiever Papers is an essay writing company that hires only the smartest minds to help you with your projects. Our expertise allows us to provide students with high-quality academic writing, editing & proofreading services.Free Features
Free revision policy
$10Free bibliography & reference
$8Free title page
$8Free formatting
$8How Our Dissertation Writing Service Works
First, you will need to complete an order form. It's not difficult but, if anything is unclear, you may always chat with us so that we can guide you through it. On the order form, you will need to include some basic information concerning your order: subject, topic, number of pages, etc. We also encourage our clients to upload any relevant information or sources that will help.
Complete the order formOnce we have all the information and instructions that we need, we select the most suitable writer for your assignment. While everything seems to be clear, the writer, who has complete knowledge of the subject, may need clarification from you. It is at that point that you would receive a call or email from us.
Writer’s assignmentAs soon as the writer has finished, it will be delivered both to the website and to your email address so that you will not miss it. If your deadline is close at hand, we will place a call to you to make sure that you receive the paper on time.
Completing the order and download