Boolean and Bayesian Scenarios with explanation

This semester we examined two common techniques for representing, organizing, retrieving, and filtering information:

the deductive Booleanlogic of classes and their combinations, and the inductive Bayesianlogic of probabilities conditioned on past events.

As Bernhard Rieder observed, each of these techniques “implies a particular way of doing things” but can “be applied to a wide array of domains.”

For example, Boolean logic can be applied to classify and filter scientific articles, works of choreography, consumer products, or potential romantic partners, ifone can come up with a way to represent these things as belonging to classes defined by necessary and su?cient properties.

Likewise, Bayesian logic can be applied to these same domains and many others, ifone can represent these things as having quantified properties suitable for training a statistical model. In either case, for any particular domain of application, there are “many decisions to be made concerning the units to take into account, the parameters to specify, the tweaks to apply, the outputs to produce, and so forth” (Rieder).

For this question, you will identify and describe two different applications of information classification and filtering.The first should be an application for which you believe Boolean logic to be the preferable technique, and the second should be an application for which you believe Bayesian logic to be the preferable technique.

For each of the two applicationsthat you identify, you should:

Describe the application briefly but with su?cient detail to provide context for your argument in favor of the chosen technique. Who uses it, and why? How do they use it? What is the “stu? ” being classified and filtered? How are the outputs of classification and

Explain how the technique you’ve chosen could be applied. If the application is the one for which you believe Boolean logic would be preferable, explain how the “stu? ” could be represented as classes defined by necessary and su?cient properties, and how someone using the application would use these classes to classify and filter. If the application is the one for which you believe Bayesian logic would be preferable, explain how the “stu? ” could be represented as things having quantified properties, and how these representations could be used to train a statistical model, and how the estimated

Give two reasons why your chosen technique is preferable for this particular application. Again, do not just identify general strengths of the technique—instead give specific reasons why the technique might be preferable for this particular application.

filtering presented or used? The application may be purely hypothetical, or it may be based on a real application that you are aware of. See the next page for an example.

probabilities produced by the model could be used to classify and filter. Do not just give a

general explanation of how the technique works—instead focus on some of the specific decisions to be made for this particular application.

4.Finally, identify one potential problemwith applying the technique in this way. What limitations might the technique impose, or what negative consequences might result? One more time: do not just identify general limitations or drawbacks of the technique—instead describe a specific negative outcome that could result from using this technique for this particular application.

Exampleof a description of an application, like you might write for part 1 above (do not

use this example in your own answer):

DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome.
1
INFORMATION
RETRIEVAL
1
Boolean retrieval
The meaning of the term information retrieval can be very broad. Just getting
a credit card out of your wallet so that you can type in the card number
is a form of information retrieval. However, as an academic field of study,
information retrieval might be defined thus:
Information retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information need
from within large collections (usually stored on computers).
As defined in this way, information retrieval used to be an activity that only
a few people engaged in: reference librarians, paralegals, and similar professional searchers. Now the world has changed, and hundreds of millions
of people engage in information retrieval every day when they use a web
search engine or search their email.1 Information retrieval is fast becoming
the dominant form of information access, overtaking traditional databasestyle searching (the sort that is going on when a clerk says to you: “I’m sorry,
I can only look up your order if you can give me your Order ID”).
IR can also cover other kinds of data and information problems beyond
that specified in the core definition above. The term “unstructured data”
refers to data which does not have clear, semantically overt, easy-for-a-computer
structure. It is the opposite of structured data, the canonical example of
which is a relational database, of the sort companies usually use to maintain product inventories and personnel records. In reality, almost no data
are truly “unstructured”. This is definitely true of all text data if you count
the latent linguistic structure of human languages. But even accepting that
the intended notion of structure is overt structure, most text has structure,
such as headings and paragraphs and footnotes, which is commonly represented in documents by explicit markup (such as the coding underlying web
1. In modern parlance, the word “search” has tended to replace “(information) retrieval”; the
term “search” is quite ambiguous, but in context we use the two synonymously.
Online edition (c) 2009 Cambridge UP
2
1 Boolean retrieval
pages). IR is also used to facilitate “semistructured” search such as finding a
document where the title contains Java and the body contains threading.
The field of information retrieval also covers supporting users in browsing
or filtering document collections or further processing a set of retrieved documents. Given a set of documents, clustering is the task of coming up with a
good grouping of the documents based on their contents. It is similar to arranging books on a bookshelf according to their topic. Given a set of topics,
standing information needs, or other categories (such as suitability of texts
for different age groups), classification is the task of deciding which class(es),
if any, each of a set of documents belongs to. It is often approached by first
manually classifying some documents and then hoping to be able to classify
new documents automatically.
Information retrieval systems can also be distinguished by the scale at
which they operate, and it is useful to distinguish three prominent scales.
In web search, the system has to provide search over billions of documents
stored on millions of computers. Distinctive issues are needing to gather
documents for indexing, being able to build systems that work efficiently
at this enormous scale, and handling particular aspects of the web, such as
the exploitation of hypertext and not being fooled by site providers manipulating page content in an attempt to boost their search engine rankings,
given the commercial importance of the web. We focus on all these issues
in Chapters 19–21. At the other extreme is personal information retrieval. In
the last few years, consumer operating systems have integrated information
retrieval (such as Apple’s Mac OS X Spotlight or Windows Vista’s Instant
Search). Email programs usually not only provide search but also text classification: they at least provide a spam (junk mail) filter, and commonly also
provide either manual or automatic means for classifying mail so that it can
be placed directly into particular folders. Distinctive issues here include handling the broad range of document types on a typical personal computer,
and making the search system maintenance free and sufficiently lightweight
in terms of startup, processing, and disk space usage that it can run on one
machine without annoying its owner. In between is the space of enterprise,
institutional, and domain-specific search, where retrieval might be provided for
collections such as a corporation’s internal documents, a database of patents,
or research articles on biochemistry. In this case, the documents will typically be stored on centralized file systems and one or a handful of dedicated
machines will provide search over the collection. This book contains techniques of value over this whole spectrum, but our coverage of some aspects
of parallel and distributed search in web-scale search systems is comparatively light owing to the relatively small published literature on the details
of such systems. However, outside of a handful of web search companies, a
software developer is most likely to encounter the personal search and enterprise scenarios.
Online edition (c) 2009 Cambridge UP
1.1 An example information retrieval problem
3
In this chapter we begin with a very simple example of an information
retrieval problem, and introduce the idea of a term-document matrix (Section 1.1) and the central inverted index data structure (Section 1.2). We will
then examine the Boolean retrieval model and how Boolean queries are processed (Sections 1.3 and 1.4).
1.1
GREP
An example information retrieval problem
A fat book which many people own is Shakespeare’s Collected Works. Suppose you wanted to determine which plays of Shakespeare contain the words
Brutus AND Caesar AND NOT Calpurnia. One way to do that is to start at the
beginning and to read through all the text, noting for each play whether
it contains Brutus and Caesar and excluding it from consideration if it contains Calpurnia. The simplest form of document retrieval is for a computer
to do this sort of linear scan through documents. This process is commonly
referred to as grepping through text, after the Unix command grep, which
performs this process. Grepping through text can be a very effective process,
especially given the speed of modern computers, and often allows useful
possibilities for wildcard pattern matching through the use of regular expressions. With modern computers, for simple querying of modest collections
(the size of Shakespeare’s Collected Works is a bit under one million words
of text in total), you really need nothing more.
But for many purposes, you do need more:
1. To process large document collections quickly. The amount of online data
has grown at least as quickly as the speed of computers, and we would
now like to be able to search collections that total in the order of billions
to trillions of words.
2. To allow more flexible matching operations. For example, it is impractical
to perform the query Romans NEAR countrymen with grep, where NEAR
might be defined as “within 5 words” or “within the same sentence”.
3. To allow ranked retrieval: in many cases you want the best answer to an
information need among many documents that contain certain words.
INDEX
INCIDENCE MATRIX
TERM
The way to avoid linearly scanning the texts for each query is to index the
documents in advance. Let us stick with Shakespeare’s Collected Works,
and use it to introduce the basics of the Boolean retrieval model. Suppose
we record for each document – here a play of Shakespeare’s – whether it
contains each word out of all the words Shakespeare used (Shakespeare used
about 32,000 different words). The result is a binary term-document incidence
matrix, as in Figure 1.1. Terms are the indexed units (further discussed in
Section 2.2); they are usually words, and for the moment you can think of
Online edition (c) 2009 Cambridge UP
4
1 Boolean retrieval
Antony
Brutus
Caesar
Calpurnia
Cleopatra
mercy
worser
…
Antony
and
Cleopatra
1
1
1
0
1
1
1
Julius
Caesar
The
Tempest
Hamlet
Othello
Macbeth
1
1
1
1
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
1
1
0
0
1
0
0
1
1
1
0
1
0
0
1
0
…
◮ Figure 1.1 A term-document incidence matrix. Matrix element (t, d) is 1 if the
play in column d contains the word in row t, and is 0 otherwise.
them as words, but the information retrieval literature normally speaks of
terms because some of them, such as perhaps I-9 or Hong Kong are not usually
thought of as words. Now, depending on whether we look at the matrix rows
or columns, we can have a vector for each term, which shows the documents
it appears in, or a vector for each document, showing the terms that occur in
it.2
To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the
vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a
bitwise AND:
110100 AND 110111 AND 101111 = 100100
B OOLEAN RETRIEVAL
MODEL
DOCUMENT
COLLECTION
CORPUS
The answers for this query are thus Antony and Cleopatra and Hamlet (Figure 1.2).
The Boolean retrieval model is a model for information retrieval in which we
can pose any query which is in the form of a Boolean expression of terms,
that is, in which terms are combined with the operators AND, OR, and NOT.
The model views each document as just a set of words.
Let us now consider a more realistic scenario, simultaneously using the
opportunity to introduce some terminology and notation. Suppose we have
N = 1 million documents. By documents we mean whatever units we have
decided to build a retrieval system over. They might be individual memos
or chapters of a book (see Section 2.1.2 (page 20) for further discussion). We
will refer to the group of documents over which we perform retrieval as the
(document) collection. It is sometimes also referred to as a corpus (a body of
texts). Suppose each document is about 1000 words long (2–3 book pages). If
2. Formally, we take the transpose of the matrix to be able to get the terms as column vectors.
Online edition (c) 2009 Cambridge UP
5
1.1 An example information retrieval problem
Antony and Cleopatra, Act III, Scene ii
Agrippa [Aside to Domitius Enobarbus]: Why, Enobarbus,
When Antony found Julius Caesar dead,
He cried almost to roaring; and he wept
When at Philippi he found Brutus slain.
Hamlet, Act III, Scene ii
Lord Polonius:
I did enact Julius Caesar: I was killed i’ the
Capitol; Brutus killed me.
◮ Figure 1.2 Results from Shakespeare for the query Brutus
AND Caesar AND NOT
Calpurnia.
AD HOC RETRIEVAL
INFORMATION NEED
QUERY
RELEVANCE
EFFECTIVENESS
we assume an average of 6 bytes per word including spaces and punctuation,
then this is a document collection about 6 GB in size. Typically, there might
be about M = 500,000 distinct terms in these documents. There is nothing
special about the numbers we have chosen, and they might vary by an order
of magnitude or more, but they give us some idea of the dimensions of the
kinds of problems we need to handle. We will discuss and model these size
assumptions in Section 5.1 (page 86).
Our goal is to develop a system to address the ad hoc retrieval task. This is
the most standard IR task. In it, a system aims to provide documents from
within the collection that are relevant to an arbitrary user information need,
communicated to the system by means of a one-off, user-initiated query. An
information need is the topic about which the user desires to know more, and
is differentiated from a query, which is what the user conveys to the computer in an attempt to communicate the information need. A document is
relevant if it is one that the user perceives as containing information of value
with respect to their personal information need. Our example above was
rather artificial in that the information need was defined in terms of particular words, whereas usually a user is interested in a topic like “pipeline
leaks” and would like to find relevant documents regardless of whether they
precisely use those words or express the concept with other words such as
pipeline rupture. To assess the effectiveness of an IR system (i.e., the quality of
its search results), a user will usually want to know two key statistics about
the system’s returned results for a query:
PRECISION
Precision: What fraction of the returned results are relevant to the information need?
RECALL
Recall: What fraction of the relevant documents in the collection were returned by the system?
Online edition (c) 2009 Cambridge UP
6
1 Boolean retrieval
INVERTED INDEX
DICTIONARY
VOCABULARY
LEXICON
POSTING
POSTINGS LIST
POSTINGS
1.2
Detailed discussion of relevance and evaluation measures including precision and recall is found in Chapter 8.
We now cannot build a term-document matrix in a naive way. A 500K ×
1M matrix has half-a-trillion 0’s and 1’s – too many to fit in a computer’s
memory. But the crucial observation is that the matrix is extremely sparse,
that is, it has few non-zero entries. Because each document is 1000 words
long, the matrix has no more than one billion 1’s, so a minimum of 99.8% of
the cells are zero. A much better representation is to record only the things
that do occur, that is, the 1 positions.
This idea is central to the first major concept in information retrieval, the
inverted index. The name is actually redundant: an index always maps back
from terms to the parts of a document where they occur. Nevertheless, inverted index, or sometimes inverted file, has become the standard term in information retrieval.3 The basic idea of an inverted index is shown in Figure 1.3.
We keep a dictionary of terms (sometimes also referred to as a vocabulary or
lexicon; in this book, we use dictionary for the data structure and vocabulary
for the set of terms). Then for each term, we have a list that records which
documents the term occurs in. Each item in the list – which records that a
term appeared in a document (and, later, often, the positions in the document) – is conventionally called a posting.4 The list is then called a postings
list (or inverted list), and all the postings lists taken together are referred to as
the postings. The dictionary in Figure 1.3 has been sorted alphabetically and
each postings list is sorted by document ID. We will see why this is useful in
Section 1.3, below, but later we will also consider alternatives to doing this
(Section 7.1.5).
A first take at building an inverted index
To gain the speed benefits of indexing at retrieval time, we have to build the
index in advance. The major steps in this are:
1. Collect the documents to be indexed:
Friends, Romans, countrymen. So let it be with Caesar . . .
2. Tokenize the text, turning each document into a list of tokens:
Friends Romans countrymen So . . .
3. Some information retrieval researchers prefer the term inverted file, but expressions like index construction and index compression are much more common than inverted file construction and
inverted file compression. For consistency, we use (inverted) index throughout this book.
4. In a (non-positional) inverted index, a posting is just a document ID, but it is inherently
associated with a term, via the postings list it is placed on; sometimes we will also talk of a
(term, docID) pair as a posting.
Online edition (c) 2009 Cambridge UP
7
1.2 A first take at building an inverted index
Brutus
−→
1
2
4
11
31
45
173
174
Caesar
−→
1
2
4
5
6
16
57
132
Calpurnia
−→
2
31
54
101
…
..
.
| {z }
Dictionary
|
{z
Postings
}
◮ Figure 1.3 The two parts of an inverted index. The dictionary is commonly kept
in memory, with pointers to each postings list, which is stored on disk.
3. Do linguistic preprocessing, producing a list of normalized tokens, which
are the indexing terms: friend roman countryman so . . .
4. Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
DOC ID
SORTING
DOCUMENT
FREQUENCY
We will define and discuss the earlier stages of processing, that is, steps 1–3,
in Section 2.2 (page 22). Until then you can think of tokens and normalized
tokens as also loosely equivalent to words. Here, we assume that the first
3 steps have already been done, and we examine building a basic inverted
index by sort-based indexing.
Within a document collection, we assume that each document has a unique
serial number, known as the document identifier (docID). During index construction, we can simply assign successive integers to each new document
when it is first encountered. The input to indexing is a list of normalized
tokens for each document, which we can equally think of as a list of pairs of
term and docID, as in Figure 1.4. The core indexing step is sorting this list
so that the terms are alphabetical, giving us the representation in the middle
column of Figure 1.4. Multiple occurrences of the same term from the same
document are then merged.5 Instances of the same term are then grouped,
and the result is split into a dictionary and postings, as shown in the right
column of Figure 1.4. Since a term generally occurs in a number of documents, this data organization already reduces the storage requirements of
the index. The dictionary also records some statistics, such as the number of
documents which contain each term (the document frequency, which is here
also the length of each postings list). This information is not vital for a basic Boolean search engine, but it allows us to improve the efficiency of the
5. Unix users can note that these steps are similar to use of the sort and then uniq commands.
Online edition (c) 2009 Cambridge UP
8
1 Boolean retrieval
Doc 1
I did enact Julius Caesar: I was killed
i’ the Capitol; Brutus killed me.
Doc 2
So let it be with Caesar. The noble Brutus
hath told you Caesar was ambitious:
term
docID
term
docID
I
1
ambitious
2
term doc. freq.
did
1
be
2
ambitious 1
enact
1
brutus
1
be 1
julius
1
brutus
2
brutus 2
caesar
1
capitol
1
I
1
caesar
1
capitol 1
was
1
caesar
2
caesar 2
killed
1
caesar
2
did 1
i’
1
did
1
enact 1
the
1
enact
1
hath 1
capitol
1
hath
1
brutus
1
I
1
I 1
killed
1
I
1
i’ 1
me
1
i’
1
=⇒
=⇒ it 1
so
2
it
2
julius 1
let
2
julius
1
killed 1
it
2
killed
1
be
2
killed
1
let 1
with
2
let
2
me 1
caesar
2
me
1
noble 1
the
2
noble
2
so 1
noble
2
so
2
the 2
brutus
2
the
1
told 1
hath
2
the
2
told
2
told
2
you 1
you
2
you
2
was 2
caesar
2
was
1
with 1
was
2
was
2
ambitious
2
with
2
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
→
postings lists
2
2
1 → 2
1
1 → 2
1
1
2
1
1
2
1
1
2
1
2
2
1 → 2
2
2
1 → 2
2
◮ Figure 1.4 Building an index by sorting and grouping. The sequence of terms
in each document, tagged by their documentID (left) is sorted alphabetically (middle). Instances of the same term are then grouped by word and then by documentID.
The terms and documentIDs are then separated out (right). The dictionary stores
the terms, and has a pointer to the postings list for each term. It commonly also
stores other summary information such as, here, the document frequency of each
term. We use this information for improving query time efficiency and, later, for
weighting in ranked retrieval models. Each postings list stores the list of documents
in which a term occurs, and may store other information such as the term frequency
(the frequency of each term in each document) or the position(s) of the term in each
document.
Online edition (c) 2009 Cambridge UP
1.2 A first take at building an inverted index
9
search engine at query time, and it is a statistic later used in many ranked retrieval models. The postings are secondarily sorted by docID. This provides
the basis for efficient query processing. This inverted index structure is essentially without rivals as the most efficient structure for supporting ad hoc
text search.
In the resulting index, we pay for storage of both the dictionary and the
postings lists. The latter are much larger, but the dictionary is commonly
kept in memory, while postings lists are normally kept on disk, so the size
of each is important, and in Chapter 5 we will examine how each can be
optimized for storage and access efficiency. What data structure should be
used for a postings list? A fixed length array would be wasteful as some
words occur in many documents, and others in very few. For an in-memory
postings list, two good alternatives are singly linked lists or variable length
arrays. Singly linked lists allow cheap insertion of documents into postings
lists (following updates, such as when recrawling the web for updated documents), and naturally extend to more advanced indexing strategies such as
skip lists (Section 2.3), which require additional pointers. Variable length arrays win in space requirements by avoiding the overhead for pointers and in
time requirements because their use of contiguous memory increases speed
on modern processors with memory caches. Extra pointers can in practice be
encoded into the lists as offsets. If updates are relatively infrequent, variable
length arrays will be more compact and faster to traverse. We can also use a
hybrid scheme with a linked list of fixed length arrays for each term. When
postings lists are stored on disk, they are stored (perhaps compressed) as a
contiguous run of postings without explicit pointers (as in Figure 1.3), so as
to minimize the size of the postings list and the number of disk seeks to read
a postings list into memory.
?
Exercise 1.1
[ ⋆]
Draw the inverted index that would be built for the following document collection.
(See Figure 1.3 for an example.)
Doc 1
Doc 2
Doc 3
Doc 4
new home sales top forecasts
home sales rise in july
increase in home sales in july
july new home sales rise
Exercise 1.2
Consider these documents:
Doc 1
Doc 2
Doc 3
Doc 4
breakthrough drug for schizophrenia
new schizophrenia drug
new approach for treatment of schizophrenia
new hopes for schizophrenia patients
a. Draw the term-document incidence matrix for this document collection.
Online edition (c) 2009 Cambridge UP
[ ⋆]
10
1 Boolean retrieval
Brutus
−→
1 → 2 → 4 → 11 → 31 → 45 → 173 → 174
Calpurnia
−→
2 → 31 → 54 → 101
Intersection
=⇒
2 → 31
◮ Figure 1.5 Intersecting the postings lists for Brutus and Calpurnia from Figure 1.3.
b. Draw the inverted index representation for this collection, as in Figure 1.3 (page 7).
Exercise 1.3
[⋆]
For the document collection shown in Exercise 1.2, what are the returned results for
these queries:
a. schizophrenia AND drug
b. for AND NOT (drug OR approach)
1.3
SIMPLE CONJUNCTIVE
QUERIES
(1.1)
Processing Boolean queries
How do we process a query using an inverted index and the basic Boolean
retrieval model? Consider processing the simple conjunctive query:
Brutus AND Calpurnia
over the inverted index partially shown in Figure 1.3 (page 7). We:
1. Locate Brutus in the Dictionary
2. Retrieve its postings
3. Locate Calpurnia in the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists, as shown in Figure 1.5.
POSTINGS LIST
INTERSECTION
POSTINGS MERGE
The intersection operation is the crucial one: we need to efficiently intersect
postings lists so as to be able to quickly find documents that contain both
terms. (This operation is sometimes referred to as merging postings lists:
this slightly counterintuitive name reflects using the term merge algorithm for
a general family of algorithms that combine multiple sorted lists by interleaved advancing of pointers through each; here we are merging the lists
with a logical AND operation.)
There is a simple and effective method of intersecting postings lists using
the merge algorithm (see Figure 1.6): we maintain pointers into both lists
Online edition (c) 2009 Cambridge UP
1.3 Processing Boolean queries
11
I NTERSECT ( p1, p2 )
1 answer ← h i
2 while p1 6= NIL and p2 6= NIL
3 do if docID ( p1) = docID ( p2)
4
then A DD ( answer, docID ( p1 ))
5
p1 ← next( p1 )
6
p2 ← next( p2 )
7
else if docID ( p1) < docID ( p2) 8 then p1 ← next( p1 ) 9 else p2 ← next( p2 ) 10 return answer ◮ Figure 1.6 Algorithm for the intersection of two postings lists p1 and p2 . and walk through the two postings lists simultaneously, in time linear in the total number of postings entries. At each step, we compare the docID pointed to by both pointers. If they are the same, we put that docID in the results list, and advance both pointers. Otherwise we advance the pointer pointing to the smaller docID. If the lengths of the postings lists are x and y, the intersection takes O( x + y) operations. Formally, the complexity of querying is Θ( N ), where N is the number of documents in the collection.6 Our indexing methods gain us just a constant, not a difference in Θ time complexity compared to a linear scan, but in practice the constant is huge. To use this algorithm, it is crucial that postings be sorted by a single global ordering. Using a numeric sort by docID is one simple way to achieve this. We can extend the intersection operation to process more complicated queries like: (1.2) QUERY OPTIMIZATION (1.3) (Brutus OR Caesar) AND NOT Calpurnia Query optimization is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done by the system. A major element of this for Boolean queries is the order in which postings lists are accessed. What is the best order for query processing? Consider a query that is an AND of t terms, for instance: Brutus AND Caesar AND Calpurnia For each of the t terms, we need to get its postings, then AND them together. The standard heuristic is to process terms in order of increasing document 6. The notation Θ(·) is used to express an asymptotically tight bound on the complexity of an algorithm. Informally, this is often written as O (·), but this notation really expresses an asymptotic upper bound, which need not be tight (Cormen et al. 1990). Online edition (c) 2009 Cambridge UP 12 1 Boolean retrieval I NTERSECT (ht1, . . . , tn i) 1 terms ← S ORT B Y I NCREASING F REQUENCY (ht1 , . . . , tn i) 2 result ← postings( f irst(terms)) 3 terms ← rest(terms) 4 while terms 6= NIL and result 6= NIL 5 do result ← I NTERSECT (result, postings( f irst(terms))) 6 terms ← rest(terms) 7 return result ◮ Figure 1.7 Algorithm for conjunctive queries that returns the set of documents containing each term in the input list of terms. frequency: if we start by intersecting the two smallest postings lists, then all intermediate results must be no bigger than the smallest postings list, and we are therefore likely to do the least amount of total work. So, for the postings lists in Figure 1.3 (page 7), we execute the above query as: (1.4) (Calpurnia AND Brutus) AND Caesar This is a first justification for keeping the frequency of terms in the dictionary: it allows us to make this ordering decision based on in-memory data before accessing any postings list. Consider now the optimization of more general queries, such as: (1.5) (madding OR crowd) AND (ignoble OR strife) AND (killed OR slain) As before, we will get the frequencies for all terms, and we can then (conservatively) estimate the size of each OR by the sum of the frequencies of its disjuncts. We can then process the query in increasing order of the size of each disjunctive term. For arbitrary Boolean queries, we have to evaluate and temporarily store the answers for intermediate expressions in a complex expression. However, in many circumstances, either because of the nature of the query language, or just because this is the most common type of query that users submit, a query is purely conjunctive. In this case, rather than viewing merging postings lists as a function with two inputs and a distinct output, it is more efficient to intersect each retrieved postings list with the current intermediate result in memory, where we initialize the intermediate result by loading the postings list of the least frequent term. This algorithm is shown in Figure 1.7. The intersection operation is then asymmetric: the intermediate results list is in memory while the list it is being intersected with is being read from disk. Moreover the intermediate results list is always at least as short as the other list, and in many cases it is orders of magnitude shorter. The postings Online edition (c) 2009 Cambridge UP 13 1.3 Processing Boolean queries intersection can still be done by the algorithm in Figure 1.6, but when the difference between the list lengths is very large, opportunities to use alternative techniques open up. The intersection can be calculated in place by destructively modifying or marking invalid items in the intermediate results list. Or the intersection can be done as a sequence of binary searches in the long postings lists for each posting in the intermediate results list. Another possibility is to store the long postings list as a hashtable, so that membership of an intermediate result item can be calculated in constant rather than linear or log time. However, such alternative techniques are difficult to combine with postings list compression of the sort discussed in Chapter 5. Moreover, standard postings list intersection operations remain necessary when both terms of a query are very common. ? [ ⋆] Exercise 1.4 For the queries below, can we still run through the intersection in time O( x + y), where x and y are the lengths of the postings lists for Brutus and Caesar? If not, what can we achieve? a. Brutus AND NOT Caesar b. Brutus OR NOT Caesar [ ⋆] Exercise 1.5 Extend the postings merge algorithm to arbitrary Boolean query formulas. What is its time complexity? For instance, consider: c. (Brutus OR Caesar) AND NOT (Antony OR Cleopatra) Can we always merge in linear time? Linear in what? Can we do better than this? Exercise 1.6 [⋆⋆] We can use distributive laws for AND and OR to rewrite queries. a. Show how to rewrite the query in Exercise 1.5 into disjunctive normal form using the distributive laws. b. Would the resulting query be more or less efficiently evaluated than the original form of this query? c. Is this result true in general or does it depend on the words and the contents of the document collection? Exercise 1.7 Recommend a query processing order for d. (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) given the following postings list sizes: Online edition (c) 2009 Cambridge UP [ ⋆] 14 1 Boolean retrieval Term eyes kaleidoscope marmalade skies tangerine trees Postings size 213312 87009 107913 271658 46653 316812 Exercise 1.8 [⋆] If the query is: e. friends AND romans AND (NOT countrymen) how could we use the frequency of countrymen in evaluating the best query evaluation order? In particular, propose a way of handling negation in determining the order of query processing. Exercise 1.9 [⋆⋆] For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal? Explain why it is, or give an example where it isn’t. Exercise 1.10 Write out a postings merge algorithm, in the style of Figure 1.6 (page 11), for an x query. [⋆⋆] OR y Exercise 1.11 [⋆⋆] How should the Boolean query x AND NOT y be handled? Why is naive evaluation of this query normally very expensive? Write out a postings merge algorithm that evaluates this query efficiently. 1.4 RANKED RETRIEVAL MODEL FREE TEXT QUERIES PROXIMITY OPERATOR The extended Boolean model versus ranked retrieval The Boolean retrieval model contrasts with ranked retrieval models such as the vector space model (Section 6.3), in which users largely use free text queries, that is, just typing one or more words rather than using a precise language with operators for building up query expressions, and the system decides which documents best satisfy the query. Despite decades of academic research on the advantages of ranked retrieval, systems implementing the Boolean retrieval model were the main or only search option provided by large commercial information providers for three decades until the early 1990s (approximately the date of arrival of the World Wide Web). However, these systems did not have just the basic Boolean operations (AND, OR, and NOT) which we have presented so far. A strict Boolean expression over terms with an unordered results set is too limited for many of the information needs that people have, and these systems implemented extended Boolean retrieval models by incorporating additional operators such as term proximity operators. A proximity operator is a way of specifying that two terms in a query Online edition (c) 2009 Cambridge UP 1.4 The extended Boolean model versus ranked retrieval 15 must occur close to each other in a document, where closeness may be measured by limiting the allowed number of intervening words or by reference to a structural unit such as a sentence or paragraph. ✎ Example 1.1: Commercial Boolean searching: Westlaw. Westlaw (http://www.westlaw.com/) is the largest commercial legal search service (in terms of the number of paying subscribers), with over half a million subscribers performing millions of searches a day over tens of terabytes of text data. The service was started in 1975. In 2005, Boolean search (called “Terms and Connectors” by Westlaw) was still the default, and used by a large percentage of users, although ranked free text querying (called “Natural Language” by Westlaw) was added in 1992. Here are some example Boolean queries on Westlaw: Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company. Query: "trade secret" /s disclos! /s prevent /s employe! Information need: Requirements for disabled people to be able to access a workplace. Query: disab! /p access! /s work-site work-place (employment /3 place) Information need: Cases about a host’s responsibility for drunk guests. Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest Note the long, precise queries and the use of proximity operators, both uncommon in web search. Submitted queries average about ten words in length. Unlike web search conventions, a space between words represents disjunction (the tightest binding operator), & is AND and /s, /p, and /k ask for matches in the same sentence, same paragraph or within k words respectively. Double quotes give a phrase search (consecutive words); see Section 2.4 (page 39). The exclamation mark (!) gives a trailing wildcard query (see Section 3.2, page 51); thus liab! matches all words starting with liab. Additionally work-site matches any of worksite, work-site or work site; see Section 2.2.1 (page 22). Typical expert queries are usually carefully defined and incrementally developed until they obtain what look to be good results to the user. Many users, particularly professionals, prefer Boolean query models. Boolean queries are precise: a document either matches the query or it does not. This offers the user greater control and transparency over what is retrieved. And some domains, such as legal materials, allow an effective means of document ranking within a Boolean model: Westlaw returns documents in reverse chronological order, which is in practice quite effective. In 2007, the majority of law librarians still seem to recommend terms and connectors for high recall searches, and the majority of legal users think they are getting greater control by using them. However, this does not mean that Boolean queries are more effective for professional searchers. Indeed, experimenting on a Westlaw subcollection, Turtle (1994) found that free text queries produced better results than Boolean queries prepared by Westlaw’s own reference librarians for the majority of the information needs in his experiments. A general problem with Boolean search is that using AND operators tends to produce high precision but low recall searches, while using OR operators gives low precision but high recall searches, and it is difficult or impossible to find a satisfactory middle ground. In this chapter, we have looked at the structure and construction of a basic Online edition (c) 2009 Cambridge UP 16 1 Boolean retrieval inverted index, comprising a dictionary and postings lists. We introduced the Boolean retrieval model, and examined how to do efficient retrieval via linear time merges and simple query optimization. In Chapters 2–7 we will consider in detail richer query models and the sort of augmented index structures that are needed to handle them efficiently. Here we just mention a few of the main additional things we would like to be able to do: 1. We would like to better determine the set of terms in the dictionary and to provide retrieval that is tolerant to spelling mistakes and inconsistent choice of words. 2. It is often useful to search for compounds or phrases that denote a concept such as “operating system”. As the Westlaw examples show, we might also wish to do proximity queries such as Gates NEAR Microsoft. To answer such queries, the index has to be augmented to capture the proximities of terms in documents. TERM FREQUENCY 3. A Boolean model only records term presence or absence, but often we would like to accumulate evidence, giving more weight to documents that have a term several times as opposed to ones that contain it only once. To be able to do this we need term frequency information (the number of times a term occurs in a document) in postings lists. 4. Boolean queries just retrieve a set of matching documents, but commonly we wish to have an effective method to order (or “rank”) the returned results. This requires having a mechanism for determining a document score which encapsulates how good a match a document is for a query. With these additional ideas, we will have seen most of the basic technology that supports ad hoc searching over unstructured information. Ad hoc searching over documents has recently conquered the world, powering not only web search engines but the kind of unstructured search that lies behind the large eCommerce websites. Although the main web search engines differ by emphasizing free text querying, most of the basic issues and technologies of indexing and querying remain the same, as we will see in later chapters. Moreover, over time, web search engines have added at least partial implementations of some of the most popular operators from extended Boolean models: phrase search is especially popular and most have a very partial implementation of Boolean operators. Nevertheless, while these options are liked by expert searchers, they are little used by most people and are not the main focus in work on trying to improve web search engine performance. ? Exercise 1.12 [⋆] Write a query using Westlaw syntax which would find any of the words professor, teacher, or lecturer in the same sentence as a form of the verb explain. Online edition (c) 2009 Cambridge UP 1.5 References and further reading 17 Exercise 1.13 [ ⋆] Try using the Boolean search features on a couple of major web search engines. For instance, choose a word, such as burglar, and submit the queries (i) burglar, (ii) burglar AND burglar, and (iii) burglar OR burglar. Look at the estimated number of results and top hits. Do they make sense in terms of Boolean logic? Often they haven’t for major search engines. Can you make sense of what is going on? What about if you try different words? For example, query for (i) knight, (ii) conquer, and then (iii) knight OR conquer. What bound should the number of results from the first two queries place on the third query? Is this bound observed? 1.5 References and further reading The practical pursuit of computerized information retrieval began in the late 1940s (Cleverdon 1991, Liddy 2005). A great increase in the production of scientific literature, much in the form of less formal technical reports rather than traditional journal articles, coupled with the availability of computers, led to interest in automatic document retrieval. However, in those days, document retrieval was always based on author, title, and keywords; full-text search came much later. The article of Bush (1945) provided lasting inspiration for the new field: “Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, ‘memex’ will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.” The term Information Retrieval was coined by Calvin Mooers in 1948/1950 (Mooers 1950). In 1958, much newspaper attention was paid to demonstrations at a conference (see Taube and Wooster 1958) of IBM “auto-indexing” machines, based primarily on the work of H. P. Luhn. Commercial interest quickly gravitated towards Boolean retrieval systems, but the early years saw a heady debate over various disparate technologies for retrieval systems. For example Mooers (1961) dissented: “It is a common fallacy, underwritten at this date by the investment of several million dollars in a variety of retrieval hardware, that the algebra of George Boole (1847) is the appropriate formalism for retrieval system design. This view is as widely and uncritically accepted as it is wrong.” The observation of AND vs. OR giving you opposite extremes in a precision/ recall tradeoff, but not the middle ground comes from (Lee and Fox 1988). Online edition (c) 2009 Cambridge UP 18 REGULAR EXPRESSIONS 1 Boolean retrieval The book (Witten et al. 1999) is the standard reference for an in-depth comparison of the space and time efficiency of the inverted index versus other possible data structures; a more succinct and up-to-date presentation appears in Zobel and Moffat (2006). We further discuss several approaches in Chapter 5. Friedl (2006) covers the practical usage of regular expressions for searching. The underlying computer science appears in (Hopcroft et al. 2000). Online edition (c) 2009 Cambridge UP An “information dynamic” Communal labor Semantic labor Universal labor observation of the natural world, counting, spoken & written language pursuit of greater exactness Syntactic labor logical and mathematical formalisms signal transmission, machines to rearrange signs, mechanical implementation of logic and math pursuit of greater exactness Intellectual “property” • “Congress shall have Power ... To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries …” • “Facts” are not copyrightable … but compilations of facts may be, if its facts have been “selected, coordinated, or arranged in such a way that the resulting work as a whole constitutes an original work of authorship.” • “Rural's selection of listings … lacks the modicum of creativity necessary to transform mere selection into copyrightable expression … there is nothing remotely creative about arranging names alphabetically in a white pages directory. It is an age-old practice, firmly rooted in tradition and so commonplace that it has come to be expected as a matter of course.” Formally representing text for Boolean retrieval Converting one digital code to another x≥4? N Y x≥2? N x≥6? Y x≥1? N Y N x≥3? N Y Y x≥5? N Y x≥7? N Y x=0 x=1 x=2 x=3 x=4 x=5 x=6 x=7 000 001 010 011 100 101 110 111 Converting one digital code to another 1. Does the letter come after O? No 2. Does the letter come after G? Yes 3. Does the letter come after K? No 4. Does the letter come after I? Yes 5. Does the letter come after J? Yes A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 1 0 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 0 “… any function capable of being described as a precise logical statement can be implemented by … [a] system of switches.” – Danny Hillis, The Pattern on the Stone p. 3 base chocolate marshmallow fudge maple walnut fudge vanilla cherry fudge vanilla cherry walnut fudge Exhaustivity and exclusivity • A set of classes are exhaustive if their sum (∨) equals the universe class • A set of classes are exclusive if, for each pair of classes, their product (&) equals the empty class • In classifications, the classes at each level should be exclusive, and exhaustive with respect to their superordinate (containing) class • • • c fudge made with a chocolate base m fudge made with a maple walnut base v fudge made with a vanilla base • c&m ⟺ ∅ m&v ⟺ ∅ v&c ⟺ ∅ • c, m, and v are (mutually) exclusive •c ∨ m ∨ v ⟺ U • a fudge made with marshmallows added • c, m, and v are (collectively) exhaustive • w fudge made with walnuts added • a and ~a are exclusive and exhaustive • h fudge made with cherries added • same for w and ~w, and for h and ~h Analyzing textual meaning: the “objective” way “The method of identifying subjects by counting references is a hopeless one unless we allow ourselves to count indirect references, to group items referred to and consider a reference to a member of the group to be an indirect reference to the group. But if we allow ourselves to do this, it is quickly apparent that no unique results can be expected, for there are many possible ways of grouping any set of items referred to…” – Wilson, p. 85 “Unstructured” text • What does it mean to call text unstructured? • Text has structure: words, phrases, sentences, headings, paragraphs, footnotes, introductions, conclusions, etc. • The “structure” it lacks is: explicit, systematic categorization and naming by people who understand it • “Unstructured” in this case means: unprocessed by semantic labor Applications • Making text computable enables syntactic labor to carry out: • “Ad hoc” search • Browsing and filtering • Clustering • Classification Information retrieval • Given a collection of resources and a user's query, find relevant resources • Collection: repository of retrievable resources • Query: representation of the user’s interest • Relevance: relation between query and resources grep • The simplest way to search • But, we need “plain text” • Even “plain” text can be rather complicated • Digital documents are just sequences of bytes—these must be properly converted into characters • For anything more complex than grep, we need to index structure & analysis retrieval organization & storage Indexing Museums Art Endowments Philanthropy UNC Rembrandt Harmenszoon van Rijn Ackland Art Museum Sheldon Peck Leena Peck Bag of words Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving oﬀ the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coﬀin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats oﬀ—then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me. the 10 all 1 street 1 from 1 i 9 money 1 hypos 1 would 1 and 7 hats 1 sea 1 warehouses 1 to 5 soon 1 substitute 1 there 1 a 5 years 1 driving 1 long 1 me 5 stepping 1 sail 1 cato 1 it 5 before 1 pistol 1 their 1 of 4 1 never 1 call 1 whenever 4 1 degree 1 way 1 my 4 methodical ly deliberate ly then 1 funeral 1 throws 1 is 4 his 1 involuntarily 1 interest 1 in 4 very 1 people's 1 that 1 find 2 account 1 november 1 purse 1 get 2 every 1 strong 1 but 1 some 2 nearly 1 on 1 part 1 about 2 they 1 cherish 1 particular 1 little 2 world 1 damp 1 circulation 1 with 2 bringing 1 quietly 1 up 1 myself 2 grim 1 prevent 1 can 1 this 2 rear 1 into 1 growing 1 as 2 upper 1 knocking 1 meet 1 nothing 2 towards 1 feelings 1 coffin 1 off 2 drizzly 1 philosophical 1 almost 1 or 2 soul 1 moral 1 mind 1 time 2 see 1 precisely 1 an 1 Inverted Index https://developer.apple.com/library/mac/documentation/UserExperience/Conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html Boolean retrieval “car” 1 2 6 “jaguar” 3 4 5 7 “automobile” all documents Thursday • Assignment #2 will be returned • Assignment #3 will be handed out; due 1 week later, on November 14 • Hjørland, “Classical Databases and Knowledge Organization: A Case for Boolean Retrieval and Human DecisionMaking during Searches.” • Why/how has Boolean retrieval been criticized? • What are Hjørland's arguments in favor of Boolean retrieval? What examples does he use? • What does he identify as diﬀerences between “computer science” and “information science” traditions? Boolean retrieval Boolean retrieval “car” 1 2 6 “jaguar” 3 4 5 7 “automobile” all documents Boolean search strategy 1. Identify diﬀerent concepts: global warming, crops 2. List variant terms for each concept: global warming, greenhouse effect, greenhouse gases crops, crop yields, crop production, food supply 3. Join variants with OR and concepts with AND: (global warming OR greenhouse effect OR greenhouse gases) AND (crops OR crop yields OR crop production OR food supply) Tokenization • “Call me Ishmael.” → [“Call”, “me”, “Ishmael”] • “I spent $15.75 on one cocktail!” → [“I”, “spent”, “$15.75”, “on”, “one”, “cocktail”] Token vs. type vs. term • Tokens: Night, • Types: Night, • Terms: night, town, Night, town, a, glass, Color, mahogany, Color, mahogany, center, Rose, is, a, rose, is, a, rose, is, a, rose town, a, glass, Color, mahogany, center, Rose, is, rose town, glass, color, mahogany, center, rose (Depends on how we process tokens) Downcasing Tar Tar tAr tar Heels heels HeElS heels tar heels Stemming abominable abominably abominates abominations abomin Stemming universal universally universe universities university univers Lemmatization universal universally universe universities university universal universe university Stopwords http://www.ranks.nl/stopwords https://rybesh.github.io/indexing/ Next week • Evaluating information retrieval • How do we know when an IR system is working well? • Buckland, Michael. “Evaluation of Selection Methods.” In Information and Society, 153–164. Classical Databases and Knowledge Organization: A Case for Boolean Retrieval and Human Decision-Making During Searches Birger Hjørland Royal School of Library and Information Science, University of Copenhagen, 6 Birketinget, DK-2300 Copenhagen S, Denmark. E-mail: pnl617@iva.ku.dk This paper considers classical bibliographic databases based on the Boolean retrieval model (such as MEDLINE and PsycInfo). This model is challenged by modern search engines and information retrieval (IR) researchers, who often consider Boolean retrieval a less efficient approach. The paper examines this claim and argues for the continued value of Boolean systems, and suggests two further considerations: (a) the important role of human expertise in searching (expert searchers and “information literate” users) and (b) the role of library and information science and knowledge organization (KO) in the design and use of classical databases. An underlying issue is the kind of retrieval system for which one should aim. Warner’s (2010) differentiation between the computer science traditions and an older libraryoriented tradition seems important; the former aim to transform queries automatically into (ranked) sets of relevant documents, whereas the latter aims to increase the “selection power” of users. The Boolean retrieval model is valuable in providing users with the power to make informed searches and have full control over what is found and what is not. These issues may have significant implications for the maintenance of information science and KO as research fields as well as for the information profession as a profession in its own right. Introduction Knowledge organization (KO) (also termed information organization) is a research field within library and information science (LIS) that is concerned with, among other things, indexing, classification, metadata, and the optimization of bibliographical records (in physical bibliographies/ catalogs as well as in electronic databases). Information retrieval (IR) is another research field located in both Received October 29, 2013; revised January 22, 2014; accepted January 22, 2014 © 2014 ASIS&T • Published online 4 June 2014 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/asi.23250 LIS and (today overwhelmingly) computer science. However, these two fields have the same basic objective: to make documents and “information” retrievable. They should therefore be seen as competing fields, and their relative strengths and weaknesses should be examined. Such an examination is difficult, however, because KO and IR are influenced by different ideas and techniques and have different research traditions. The difficulty of achieving coherence may be caused in part by the fragmented and conceptually not highly coherent nature of the research traditions described. As Ellis (1996, pp. ix–x) wrote, “. . . in a field as diverse as information retrieval research, it is impossible to provide any overall conclusion on the state of the art.” This view is confirmed by the recent debate on whether progress is being made in the field of IR (Armstrong, Moat, Webber, & Zobel, 2009a, 2009b; Puurula, 2013; Trotman & Keeler, 2011). This paper is based on the view that it is important to uncover the often hidden assumptions in research fields. As Einstein wrote, “Science without epistemology is — insofar as it is thinkable at all — primitive and muddled” (1949, p. 684). KO and IR comprise different approaches and research traditions. In KO, for example, citation analysis, facet analysis, and user-based approaches have been described (see Hjørland, 2013a, 2013b, 2013c). The relation between KO and IR has not yet been thoroughly examined. This paper begins such an examination but is only concerned with systems based on the Boolean retrieval model. The most prestigious IR models (such as vector space models, probabilistic models, and language models) are left to a later paper.1 The Boolean retrieval model has predominated in classical bibliographical databases such as MEDLINE, PsycINFO, Science Citation Index, and other databases available through database-hosts such as Dialog (now ProQuest Dialog) and DataStar. They are typically used by information professionals (intermediaries between databases and end users) or by advanced users. Therefore, there JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 66(8):1559–1575, 2015 seems to be a close connection between (a) Boolean search systems, (b) the need for human searchers (expert searchers and advanced searchers), and (c) information science (developing knowledge concerning optimal search strategies and information literacy). These three issues are therefore considered together in this article. The paper is written from the perspective of KO and the author has a background as a professional searcher. It is organized as follows: The next section introduces the field of IR and discusses its connection to KO. It concludes with a view of basic different assumptions in the two fields, which is used as the foundation for this article. The third section introduces searching as a field of research and practice within library and information science. The fourth section looks closer at classical databases and how they involve KO. The fifth section looks at the Boolean retrieval model and contains a subsection presenting and discussing the criticism that has been raised against it. The basic strength of Boolean retrieval is its well-defined selection criteria and the advanced user’s ability to fully controll the search process. It is argued that much of the criticism of Boolean retrieval seems to be based on problematic assumptions (which may be overcome by expert searchers in properly indexed databases). This section also contains a subsection that presents empirical studies on searching for systematic reviews. In the Conclusion it is claimed that not only searchers but also the entire field of information science may suffer if concepts such as “recall devices” and “precision devices” no longer have well-defined meanings due to the existence of retrieval models without well-defined matching criteria. The Field of Information Retrieval The 1950s saw the introduction of the computer into the field of documentation together with the idea of information storage and retrieval. This concept led some library researchers and documentalists to approach their field from a new theoretical perspective: Libraries could be understood as information-processing systems (and so could scientific and scholarly communication—journals, conferences, informal communication, reviews, etc.). The concept of information storage and retrieval gave rise to the field of information science, considered by some a new field, by others as just a new and better name for library science and documentation. The term information retrieval was introduced by Mooers (1919–1994), who defined it thus: Information retrieval is the name for the process or method whereby a prospective user of information is able to convert his need for information into an actual list of citations to documents in storage containing information useful to him. It is the finding or discovery process with respect to stored information. It is another, more general, name for the production of a demand bibliography. Information retrieval embraces the intellectual aspects of the description of information and its specification for search, and also whatever systems, technique, or machines 1560 are employed to carry out the operation. Information retrieval is crucial to documentation and organization of knowledge. (Mooers, 1951, p. 25) Van Rijsbergen also stated (in 1979)2 that IR is about finding documents: Information retrieval is a wide, often loosely-defined term but in these pages I shall be concerned only with automatic information retrieval systems. Automatic as opposed to manual and information as opposed to data or fact. Unfortunately the word information can be very misleading. In the context of information retrieval (IR), information, in the technical meaning given in Shannon’s theory of communication, is not readily measured (Shannon & Weaver, 1964). In fact in many cases, one can adequately describe the kind of retrieval by simply substituting “document” for “information.” Nevertheless, “information retrieval” has become accepted as a description of the kind of work published by Cleverdon, Salton, Spark Jones, Lancaster and others. A perfectly straightforward definition along this line is given by Lancaster (1968): “Information retrieval is the term conventionally, though somewhat inaccurately, applied to the type of activity discussed in this volume. An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request.” This specifically excludes QuestionAnswering systems as typified by Winograd (1972) and those described by Minsky (1968). It also excludes data retrieval systems such as used by, say, the stock exchange for on-line quotations. (p. 1) A related view was expressed by Kemp (1988), who, however, argued that “knowledge retrieval” should substitute for “information retrieval.”3 Another definition is provided by Baeza-Yates and Ribeiro-Neto (1999): Information retrieval (IR) [is the] part of computer science which studies the retrieval of information (not data) from a collection of written documents. The retrieved documents aim at satisfying a user information need usually expressed in natural language. (p. 444) In a second edition the same authors formulated it thus: Information retrieval (IR) is a broad area of Computer Science focused primarily on providing the users with easy access to information of their interest, as follows: Information retrieval deals with the representation, storage, organization of, and access to information items such as documents, Web-pages, online catalogs, structured and semi-structured records, multimedia objects. The representation and organization of the information should be such as to provide the users with easy access to information of their interest. (Baeza-Yates & Ribeiro-Neto, 2011, p. 1) It is clear from the quotations above that Mooers (1951) considered KO to be part of IR: it included the “description of information,” “specification of information for search,” and is crucial to “organization of knowledge.” Mooers JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2015 DOI: 10.1002/asi (1951) and Kemp (1988) found that IR covers both manual and automatic retrieval, whereas van Rijsbergen (1979) and Baeza-Yates and Ribeiro-Neto (2011) limited it to automatic approaches, and the latter even to computer science.4 By limiting IR to automatic processes, as van Rijsbergen (1979) does, the possibility of considering the role of human expertise (as in this paper) seems lost; by limiting the field to computer science, as Baeza-Yates and Ribeiro-Neto (2011) do, the contributions from other fields (such as linguistics, LIS, and science studies) seem to be neglected. It is also clear that the term information is not generally accepted in this field (and the difference between KO and IR is therefore not likely to be based on differences in meaning of information versus knowledge). The quotations also show that the term IR is not used consistently: We cannot find univocity in the use of the terms IR and KO. On the other hand, people in information science have a fairly good impression of the meaning of these two expressions. The core traditions of IR include the works of researchers such as Salton (1991), Spärck Jones (1972), van Rijsbergen (1979), and Robertson (1975), all of whom seem today to be more closely connected to computer science than to information science (as this field is understood here, equivalent to LIS). In KO, on the other hand, researchers such as Dahlberg (1974), Ranganathan (1967), Soergel (1999), Svenonius (2000), and Vickery (1986) are among the most well-known and cited. A suggestion could be that IR is about automatic methods in contrast to “manual” or “intellectual” methods in KO. This way of understanding the difference is not without basis in the two fields because most IR researchers work with automated retrieval and at least some KO researchers still think of KO as being about physical stores and manual indexing. However, given the importance of digital media, it seems problematic to base the scholarly field of KO upon the technologies of yesterday. If this field is to earn a place in the world, it should be able to demonstrate findings that are also useful in the digital world. The distinction between “human-based indexing” and “computer-based indexing” can also be considered unfruitful because it is not theoretically well supported: Because human beings sometimes perform like a machine (i.e., mechanically following rules of indexing taught in schools of librarianship), we need a deeper understanding of the indexing process, whether it is done by humans or by computers. By implication, the distinction between human indexing versus computer indexing does not represent two theories of indexing. In Hjørland (2011b), I demonstrated how different theories of knowledge (rationalism, empiricism, historicism, and pragmatism) represent the basic theories of IR as well as of KO. Warner (2010) found that two information and computer paradigms, the cognitive and the physical, have been distinguished in IR research, but argued that “For the purposes of discussion . . . they can be considered as a single heterogeneous paradigm, linked but not united” (pp. 4–5). He found that this information and computer paradigm has not always been explicit about its own values or examined its own basic assumptions. In his analysis, the basic assumption in that tradition may broadly be termed query transformation, which implies that a user’s verbal query— formulated before the start of the search—is transformed by an information system into a set of documents or bibliographical records. According to Warner, these records have been evaluated according to their relevance (by measures such as “recall” and “precision”) in relation to the query. Warner (2010) finds that the underlying methodology tends to reify the concept of relevance and that the underlying indexing philosophy in the searched material is neglected and taken as given. Finally, he finds that this approach contains an implicit teleology aimed at the construction of a perfect system. Contrary to the dominant paradigm of the computer/information-oriented tradition, Warner presents a tradition that is based more on library science and the practice of indexing. This tradition is far older, but less influential today. In his opinion, there are two especially valuable elements in this tradition. The first is the explicit priority of “selection power,”5 that is the user’s ability to make relevant distinctions during a search. The second is the recognition of the need for human labor to create this selection power.6 One way to interpret Warner’s (2010) distinction is to say that IR researchers tend to see IR as an input–output process, in which the knowledge represented is a “black box,” whereas KO is to a higher degree concerned with what is inside the box, that is, how the universe of recorded knowledge and culture has been represented in bibliographical records and databases. From this perspective, the users are not primarily interacting with computers, but with the universe of recorded knowledge, thus making studies of knowledge and culture important for information science. Warner’s (2010) differentiation between IR as query transformation versus KO as providing “selection power” seems important and is taken as the point of departure in my examination of IR in both this and a following article. There is, however, one major issue to address. Whereas the former mentioned researchers (Salton, Spärck Jones, van Rijsbergen, and Robertson) seem to fit well with Warner’s (2010) characterization of IR, the tradition of online searching, represented by researchers such as Harter (1986) and Lancaster (2003), seems less likely to do so. With the arrival of “classical” bibliographic databases (such as MEDLINE, Chemical Abstracts, and PsycINFO) and database hosts such as Dialog, a rapid development of search techniques followed, mainly based on Boolean logic (the extended Boolean model includes techniques such as proximity operators, truncation or tail wildcard, ranking, and explosion or expansion).7 Such classical databases (and the extended Boolean model) do not correspond to Warner’s criticism of query transformation; on the contrary, they represent the most advanced form of selection power ever developed in relation to bibliographical searches. In this respect, they form a contrast to IR techniques as developed JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2015 DOI: 10.1002/asi 1561 in typical search engines for the Internet. The fact that the Boolean model is normally discussed in the literature of IR (and not usually, so far, in the literature of KO) does not make Warner’s suggestion less relevant, as he has also indicated, “. . . the Boolean logic used in many retrieval systems, does, under certain conditions, have the advantage of relative transparency to the searcher. The objection that it is an ineffective way of transforming an information need into a set of relevant records is no longer tenable.” (2000, p. 80) Warner’s two approaches (query transformation and selection power) may be useful as fundamental precepts on which to base an understanding of the role of KO in the electronic world. Information Searching and Information Professionals In schools of LIS, students are trained in searching databases and using Internet search engines. They are trained in using systems developed mainly by computer scientists. Classical bibliographical databases developed rapidly in the 1960s (see Bourne & Hahn, 2003; Hahn, 1998). Although these systems were developed outside LIS, their use quickly found a place in many LIS curricula. Until the development of systems for end users around 1990, searching databases was mainly done by professionals (whether they were termed documentalists, research librarians, librarians, or information specialists). These were mediated searches undertaken by professionals on behalf of the users or with the users. These classical databases were essentially based on the Boolean retrieval model (explained below).8 In about 1990, Internet search engines came along and made end user searching easy and this became the dominant way of searching digital information. Such systems were typically based on alternatives to the Boolean model (explained below). Bawden wrote, . . . at that time [about 1995] there were a number of search engines beside AltaVista: Lycos, Excite, Yahoo, Infoseek, Hotbot and more. They all had different searching functions, and were better for different types of material. People like me ran courses on how to get the best out of them. There was even a book, Search Engines for the World Wide Web, written by Alfred and Emily Glossbrenner in 1998 [3rd ed., 2001], which covered them all, and advised that we had to learn the strengths and weaknesses of each one, after first mastering the idiosyncrasies of them all. . . . And then Google came along and changed everything. Not because it was necessarily a better system. (2013, electronic source, no page) Bawden writes in the past tense: that he used to “[run] courses on how to get the best out of [search engines].” Does he mean that Google has made the teaching of online 1562 searching (and searching in general) obsolete? Or is he just saying that there are no real competitors to Google today and that the focus of teaching has changed accordingly? I do believe that teaching the strong and weak aspects of different kinds of systems is still common in LIS schools, and I have difficulty in imagining the purpose of LIS education if searching information/documents is not among the core competencies of our students. The classical databases were not driven out of business by the Internet and its search engines, although information searching by professionals was damaged.9 Professional searching still exists, partly in services offered by research libraries and most professionally in the fields of medicine, law, and patenting. In medicine, expert searching is in particular done in relation to systematic reviewing in the approach known as evidence-based practice (EBP; see, e.g., Hjørland, 2011a; Kitchenham & Charters, 2007) according to which it is mandatory to identify all documents fulfilling certain criteria. Information professionals today have to know both classical databases as well as search engines— among other things—to help users find the information and documents they need. At least until recently, Harter’s (1986) textbook on searching classical databases has been used and was extremely popular in LIS schools. To my knowledge, no one seems to have written a good successor yet: There are many books on information seeking (studies of the actual conduct of seeking)10 and information retrieval.11 There are also many practical books on searching (e.g., Bell, 2012; Hock, 2013), but they seem to lack a research basis and searching seems to have been neglected by information scientists,12 something that was also found bibliometrically by Zhao and Strotmann (2008).13 Searching as a field is here understood as the study of the optimal conduct of searching, both in classical databases and the use of search engines, but it also includes the classic term literature searching (and “hand searching”). As a field of research, searching should provide research-based support for normative decisions during search processes.14 A main point here is that classical databases and Boolean retrieval are still taught and are still considered important by LIS schools,15 but research activities and the development of research-based teaching materials in the field of searching seem to have suffered (medical information science seems, however, to be the exception confirming the rule; see the section on empirical studies). Computer science is mostly about making the systems, whether we speak of classical database systems such as Dialog (now Dialog-ProQuest) or of modern search engines. The main role of librarians and information specialists has never been to design such systems but—as Bawden (2013) wrote—to know the possibilities and weaknesses in the concrete task of helping users find information.16 As stated previously, it is not primarily human–computer interaction that is the main issue for LIS, it is rather the interaction with the world of recorded knowledge. This difference is important to keep in mind and it provides information science with JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2015 DOI: 10.1002/asi a unique research perspective (at least potentially). An example may explain the difference. The database host Dialog had, in addition to real databases (e.g., the biological database BIOSIS Previews), special ONTAP (ONline Training And Practice) databases (e.g., ONTAP BIOSIS Previews) containing only portions of the complete databases (which were free or much cheaper to use). They contained records from a short time period and may or may not have been current. They were intended for training only. The point is that they may be adequate for learning the narrow technical aspects of database searching. However, for most people it is more interesting to search the real databases, in which queries can be tested in relation to what has been recorded on a given subject. It is much more interesting to interact with the world of recorded knowledge indexed in real-life databases. This perspective is important from the point of view of LIS because it goes beyond human–computer interaction. From this perspective, Warner’s (2010) distinction between the computer science16 traditions and an older library-oriented tradition also seems to be important; the former aimed to transform queries automatically into (ranked) sets of relevant documents, whereas the latter aimed to increase the “selection power” of users. He writes: The value placed on query transformation [in computer science] is dissonant with common practice, where users may prefer to explore an area and may value fully informed exploration. (2010, p. 5) Warner’s view is related to what might be termed a hermeneutical approach to searching (cf. Boell & CecezKecmanovic, 2010) as opposed to a positivist approach. The positivist view implies that searching can be done in a formal way (algorithmic) that retrieves relevant knowledge without bias in the search (and this is the assumption behind evidence-based practice). The hermeneutic approach is based on the assumption that there is a constant reinterpretation of the relevant literature, implying the need for great flexibility in search criteria and a great level of iteration in search processes—and, most important, an understanding of what is going on during the search. Booth et al. (2013) also found that Bates’ “berry picking” is in line with this hermeneutic approach: “Rather than sticking to an a priori search protocol Bates described how a searcher’s concept of a query is influenced by every new item of information that they encounter” (p. 2). This paper argues that Boolean search—although developed in computer science—is fully consonant with Warner’s goal of increasing the selection power of users and also that Boolean search is a better fit to the hermeneutical understanding of the search when compared with most alternative retrieval models. KO in Classical Bibliographic Databases A given database, say, the medical database MEDLINE, may be a comprehensive bibliographical database with a broad and deep coverage of its domain (MEDLINE contains more than 20 million references). However, despite the impression of completeness, any bibliographical database is always selective. Such selectiveness is often mainly formal (e.g., limited to journal articles), but it is also in the end connected to views of what is relevant and what constitutes the field (e.g., of medicine) even if it includes many references to adjacent fields (often based on implicit views of the domain). As we see later, even the coverage of a comprehensive database like MEDLINE is, for some important purposes, not sufficient. The selection of material to the bibliography constitutes the first and most basic level of KO. Because the meaning of terms is implicitly understood in this disciplinary context and to the extent that classification and indexing is based on the principle of “literary warrant,” this selection influences the developments of thesauri and ontologies, which may thus be understood as a higher level of KO. Another level of KO is the bibliographical record and its organization into fields (and the corresponding organization of data in linear and inverted files). Such records vary from database to database and from host to host. They are very complicated structures that professional searchers have to learn to master.17 It is not possible here to go too deeply into this field, but just state a few points: • A given record contains a mixture of fields derived from the document it represents, as well as information added by the database producer (it may contain the whole document in addition to value-added information). It may also contain information imported from third parties, as well as user-added information in the case of social tagging and related technologies. • Some fields contain controlled vocabularies developed by information professionals (e.g., descriptors and classification codes). Other fields contain “natural language” (i.e., the authors’ language for special purposes). Many databases include citation indexing, that is, the possibility of searching bibliographical references in each document represented. • It is important to realize that the efficiency of given fields for optimal search strategies is relative from domain to domain (the value of searching document titles, for example, varies according to how titles are used in different domains). In the social sciences the use of metaphors may thus limit the value of title searches (for more, see Hjørland & Kyllesbech Nielsen, 2001). • By implication, the experienced searcher should know not just about the database systems and the bibliographic records (or full-text records) but also about the concepts and genres of the primary literature. This aspect connects information science with fields such as scholarly communication, written composition, genre studies, and language for special purposes. Whereas KO in the narrow sense is about the design of bibliographical records and systems of controlled vocabularies, KO in the broader sense is about how knowledge is organized in different domains and how this can be used for IR. This broader perspective on KO becomes increasingly important in the context of full-text databases and the Internet. • The specific requirement of indexing for Boolean searches is to represent different facets of the document that are in the JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2015 DOI: 10.1002/asi 1563 search process to be combined by Boolean “and.” This is known as “building blocks search strategy.”18 Such indexing looks like the approach known as facet analysis (see Hjørland, 2013b), but whereas the tradition of facet analysis is mainly logical or speculative, it is important to anticipate what facets need to be combined during a search. In evidencebased practice the methodological facet is important (e.g., a descriptor for randomized clinical trials; which, by the way, has only been considered in MEDLINE since about 1994, after the breakthrough of the EBP paradigm). The construction of facets should therefore be based on studies of researchers’ criteria of relevance (which are most explicate in EBP). A third level of KO in classical databases consists of the information retrieval thesaurus,19 ontologies, and other kinds of controlled vocabulary constructed by information specialists.20 There is a huge literature on thesauri, ontologies, and controlled vocabularies and their construction. From the IR point of view, their value has been questioned (e.g., by Salton, 1996) or neglected (cf. Warner, 2010, pp. 3, 148). An important point about controlled vocabularies in general is that they cannot once and for all provide a neutral and objective connection between concepts. To quote Tuominen, Talja, and Savolainen (2003), While unitary documentary languages ensure a maximum of mutual understanding [. . .], they do so by legitimizing a particular ideological and sociopolitical worldview, and by silencing other meanings, voices, and ways of knowing [. . .]. Unitary documentary languages embody a belief in the existence of a unified body of knowledge. They express a belief in the possibility to capture reality isomorphically in “information,” and presuppose a neutral ground from which to judge the truth-value of different theories. (p. 563) Each thesaurus and other controlled vocabularies reflect in some way a philosophy of KO (e.g., a user-based view, a facet-analytical view, or a domain-analytical view). The value of such a tool is in the end connected to the principles and qualifications on which they are based: It is meaningless to talk about the usefulness of thesauri while disregarding their individual qualities. Skepticism concerning concrete systems may thus be justified, owing to the specific principles and (lack of) qualifications of their constructors. They may be based on the problematic view that relations in thesauri are “context-free, definitional, and true in all possible worlds” (see Svenonius, 2000, 2004, and Hjørland, in press, for a criticism) and related essentialist philosophy. For the most part, it has not been taken into account that “a controlled vocabulary is a way to insert an interpretive layer of semantics between the term entered by the user and the underlying database to better represent the original intention of the terms of the user” (Fast, Leise, & Steckel, 2002; electronic source, no page). To the degree that existing thesauri and ontologies have been constructed under the influence of problematic assumptions, there might be room for improvement. 1564 A fourth level of KO in Boolean systems is generated by the searcher: • During search the user organizes the document representations in sets based on Boolean algebra. The final set is supposed to contain references to the documents that satisfy a set of criteria defined by the user (it may also contain noise; the set is meant to be examined afterwards record by record. The set is therefore not meant to satisfy the user’s information need in a direct way. It is only meant to delimit the full database to a set which has a high probability of containing the relevant documents). In other words, the aim of KO and IR is purely economic: to save the time of the searcher by reducing the number of records to be considered. • Theoretically, no bibliographical record or system allows 100% perfect subject retrieval and the efficiency of the search is dependent on: 1. The quality of the bibliographical records, their indexing, metadata, and relevant discriminatory power (relevant also in relation to the “voices” mentioned by Tuominen et al. (2003) cited previously; see also Hjørland, 2010). These may be termed objective search possibilities. 2. The qualifications of the searcher (the subjective search possibilities). For example, if a searcher does not know about citation indexing, an objective possibility is not utilized. Studies have found that users are—to some degree— able to adapt their searching to system performance; by making an extra effort, they may compensate for shortcomings in the system.21 • Organization of Boolean sets allowing a high level of recall and an acceptable level of precision thus depends on the underlying information in either indexing or other available information. The decisive issue is whether relevant documents can unequivocally be recognized at first hand (whether criteria for what should be found can be established). This is a question of the philosophy and methodology of science (e.g., to the extent that an interpretation of EBP results in norms such as “studies should be based on RCT,” such norms are also the norms of relevance in IR and the norms that the indexers have to meet; documents based on RCT should thus be distinguishable from those which are not). The understanding of requirements for indexing and searching should therefore be produced backwards, that is, sets of relevant documents should be identified and described in a way that clearly differentiates them from all other records. If searches fail, the reason should be identified and the indexing improved (known as “failure analysis”).22 The real difficulty in indexing is thus in determining criteria of relevance, which are to be negotiated by epistemological criteria. In conclusion, KO in the context of bibliographical databases and full-text databases focuses on the optimal construction of bibliographical records, subject access points, metadata, and controlled vocabularies/thesauri/ ontologies. However, this optimal construction has to be based in the broader study of and involvement with documents, genres, discourses, epistemologies, “voices,” and domains. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2015 DOI: 10.1002/asi Boolean Retrieval Boolean retrieval is based on Boolean algebra, named after the English mathematician Boole (1815–1864).23 This has had a great influence on computer science and plays an important role in online searching. However, its usefulness in relation to other retrieval models has often been disputed, as discussed below. Boolean retrieval is commonly one of the most used and taught models for IR, but it is often less efficient compared to other retrieval models, as discussed in standard textbooks such as that of Baeza-Yates and Ribeiro-Neto (2011). In most Boolean retrieval systems, bibliographical records are contained in “linear files,” associated with “inverted files” in which words, strings, or other symbols are contained with references to record numbers in the linear files. Usually only types (different terms/symbols), not tokens (repeated types), are identified, but it is possible to specify the type in a given field (e.g., in the title or in the descriptor field), or in a given distance from other types and thereby provide more precise search criteria. Searching is done in the inverted files and matching documents (or rather just their identification numbers) are organized in sets, which may be combined with the logical “AND,” “OR,” or “NOT” (the Boolean operators).24 Each operator—and each complete search profile—can visually be described using Venn diagrams. To understand the possibilities of Boolean search when used in its most advanced ways, it is necessary to know about bibliographical records in online databases, as briefly introduced above. Among the subject access points, there can be many aspects of information such as whether a journal is peer-reviewed or not, its impact factor, and other kinds of information. Such kinds of information are available to the professional searcher. Semiotically, it could be said that advanced bibliographical records consist of many “voices” (in the sense of Bakhtin, 1986). Advanced searchers learn to utilize these voices and possibilities in creative ways to optimize the search. During the search, sets of documents are created, which may be combined by Boolean operators in complex ways. This is a powerful technique but, as already stated, it depends on the quality of the records, the facilities of the database host, and the knowledge and experience of the searcher. Although Boolean logic is a two-value logic (a given word form or symbol, representing a property with some degree of certainty, is either present or not present without possibilities of graduation), many combinations of sets based on two-valued logic can compensate for this limitation. Information science is not only about how to improve records and search facilities, but also about how to optimize the use of existing records and facilities (e.g., about the comparative value of citation-based retrieval and descriptor-based retrieval in different domains) and about educating professional searchers. Although Boolean searches may be said to transform a query into sets of potentially relevant records, it is much more in line with Warner’s (2010) aim of increasing the “selection power” of users because of the iterative nature of the search and users’ control over the process. During the process, users may explore how terms are used, how they co-occur, and how knowledge is organized. In other words, the query is not a static entity, but is modified during the search process. In advanced systems, there is a very wide range of possibilities to choose from (and searches should never be performed without access to the documentation of the system’s basic commands as well as to the structure and search fields in the single databases used). Boolean retrieval is a kind of “exact-match retrieval,” which is opposed to “best-match models.” Turtle and Croft defined exact-match retrieval as follows: • The query specifies precise retrieval criteria. • Every document either matches or fails to match the query. • The result is a set of documents.25 (Croft, 1998; electronic source, no page) The opposite of exact match is termed “best match” (or “partial match” or “scoring systems”). Turtle and Croft defined best-match retrieval in the following way: • The query describes a good or “best” matching document. • The result is a ranked list of documents. • The result may include an estimate of quality. (Croft, 1998) Boolean approaches differ from best-match systems in that the searchers maintain full control over the search process, whereas search engines work behind the back of the users, so to speak. Boolean searching is a kind of “exact-match retrieval” in that given documents or document representations either fulfill the demands or fail to fulfill the demands expressed in search logic. It requires that the users obtain the necessary knowledge to be able to evaluate the consequences of their decisions made during the search. Of course, it is easier not to involve oneself with such issues, but this comes at the cost of not knowing what is not found and why this is the case. In some cases this may be fine, that is, in cases where overview of a specific part of the information is not necessary. Especially (but not exclusively) for academic purposes, it seems important that users maintain control over the search process. In relation to this, it is noteworthy that informationretrieval specialists seldom discuss the evaluation of systems in connection with their own research activities and experiences, but tend to consider users’ information needs without making this move of reflexivity. Reflexivity means that the principles one suggests must also be valid in relation to oneself: In the most general terms, considered as a feature or style of inquiry, reflexivity is a concept that expresses a concern to turn ‘subject’ into ‘object,’ to turn the activity of inquiry into a topic for (its own) inquiry. To ‘be reflexive’ is to bring the process of doing an activity into the purview of that activity as a feature of it. (Ashmore, 2001, p. 12881) JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—August 2015 DOI: 10.1002/asi 1565 [The sociology of science] would be reflexive. In principle its patterns of explanation would have to be applicable to sociology itself. (Bloor, 1976, p. 7) In relation to information science, reflexivity means that information scientists should also consider themselves to be users of information systems and—IR—researchers should consider information systems and information needs in relation to their own behaviour. In other words, IR researchers should consider their own systems in relation to their own research activities and design systems that also suit themselves26 (which also implies that they should be able to formulate criteria for which documents are relevant to their own research). It should be said that different retrieval models may be combined—and often have been combined. The extended Boolean model (Salton, Fox, & Wu, 1983)27 makes use of partial matching and term weights as in the vector space model, but still uses the Boolean queries. There are many techniques and the traditional Boolean search systems are often improved with the inclusion of other features such as proximity searching and ranking. From the point of view of this paper, the important thing is whether the user has full control over the search—if he or she wants to have it. It is also clear that all kinds of tools that may provide the searcher with useful information in making decisions (e.g., information about term frequencies and citation patterns in documents and in the whole database) should be provided. Criticism of Boolean Retrieval Göker and Davies (2009) wrote, “The Boolean model is the first model of information retrieval and probably also the most criticized model” (p. 3). It has been claimed that a serious limitation of Boolean searches is that they are based on a two-value logic. Documents are divided into two sets: relevant documents and nonrelevant documents. However, in real life most documents have degrees of relevance.28 A further limitation that has also been mentioned is that Boolean logic only concerns the relevance of documents, not their mutual relations.29 Many texts, for example, Belkin and Croft (1987), Baeza-Yates and Ribeiro-Neto (1999, pp. 27, 281; 2011, p. 66), Chowdhury (2010, pp. 206–207), Chu (2010, pp. 112–114), Cooper (1988), Korfhage (1997), and Turtle and Croft (1992) provide related criticism of exact-match models. Baeza-Yates and Ribeiro-Neto (2011) wrote, “Today, it is well known that index term weighting can lead to a substantial improvement in retrieval quality [compared to Boolean search]” (p. 66). Turtle and Croft (1992) found that: “For some kinds of searching (e.g., known-item searching) the deterministic and predictable exact-match models perform very well, while for more general searching or for untrained users the vector space and probabilistic models have clear performance advantages” (p. 289). Van Rijsbergen considers Boolean retrieval a tool “so ‘blunt’ that it is almost useless” (1991, p. 85) and (seemingly 1566 to the contrary): “a very attractive approach indeed, unfortunately most systems will either retrieve a set much too large or nothing at all. It is often the case that although a document is not a model of a query, it nevertheless has some relevance to the user need and indeed is about the query. So here we have a situation where a two-valued logic just will not work. The difficulty is how to extend the Boolean logic without at the same time losing its advantages” (van Rijsbergen & Lalmas, 1996, p. 387). Here, one of the critiques will be considered in more detail. Larson (2010, p. 2557) lists seven problems with Boolean searches. Each of these will be presented and discussed in turn: 1. “Boolean logic requires that the users be trained in construction of queries, because it is not intuitively obvious, and differs from common natural language usage. (E.g., a user seeking books on ‘cats and dogs’ probably wants books on either, not just those that treat both.)” Comment: It is correct that training is necessary and that Boolean logic is somewhat counterintuitive; the NOT command, in particular, is known to be problematic and counterintuitive to use; however, despite these difficulties, Boolean logic is powerful: when the technique has been learned, the searcher is empowered in ways with which no other technique can compete. The first objection is only relevant if commercial criteria are taken into account or if extreme simplicity is mandatory. 2. “Boolean AND operations tend to be too restrictive. A search for “A AND B AND C” will eliminate any records that do not have all these terms. Those that have one or two of the three terms are rejected along with those that have none of the terms.” Comment: If a search for “A AND B AND C” turns out to be too restrictive, the searcher may just search: (A AND B) OR (A AND C) OR (B AND C). Many additional strategies are available to increase recall (dependent on the content and structure of the database). Larson’s second point of criticism is therefore wrong. 3. “Boolean OR operations tend to be too inclusive. A search for “A OR B OR C” will retrieve any record with at least one of the terms, but no...

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Boolean and Bayesian Scenarios with explanation ”

Get high-quality paper

NEW! AI matching with writer