Wednesday 14 November 2007

Quick PDF sorting and searching: SWISH++

Problem: using SWISH++ it is possible to search and sort PDF-files automatically
Solve: tools like pdftotext, find, scripts on Bash or Perl are required to perform quick and fast search within PDF and indexing PDF documents.

The common way is to use Beagle or some other searching stuff, but I show how SWISH++ can do the same, but much, much more fast and resource-efficient.


Introduction: How indexing within PDF
Perl-lovers likes to say that "there is more than one way to do it". So, that`s my way to do it. Briefly, solve consists from these steps:

  • use find to search all pdf documents and converting them to text with pdftotext tool
  • indexing this text files with index++ and getting index file
  • experimentally choosing level of relevant
  • searching in index file with keywords using search++
  • found files moving into required directory

Searching in PDF-documents and getting text from them

Simply asking find tool to search all *.pdf files and for everyone executing pdftotext in quiet mode. This can be achieved by command:
find -name '*.pdf' -exec pdftotext -nopgbrk -q {} \;
It is possible only for English, and other languages are not supported..

Making index file
Here it is even more simply: just ask index++ to index all of our textual files from current directory to the deep:
index++ -e "text:*.txt" .
Dot at the end is required!


What is SWISH++
There are a few mentions about SWISH++ in the Net - only homepage of project and article about application this system to real search engine. Some guys tells that SWISH++ is fastest search engine ever.
Description of this excellent search system can be found in debian package - Simple Document Indexing System for Humans: C++ version. Especially it is suitable for fast and efficient search engine.
Here are some advantages of SWISH++
  • Lightning-fast indexing
  • Indexes META elements, ALT, and other attributes
  • Selectively not index text within HTML or XHTML elements
  • Intelligently index mail and news files
  • Index Unix manual page files
  • Apply filters to files on-the-fly prior to indexing
  • Index non-text files such as Microsoft Office documents (antiword required)
  • Modular indexing architecture
  • Index new files incrementally
  • Index remote web sites
  • Handles large collections of files
  • Lightning-fast searching
  • Optional word stemming (suffix stripping)
  • Ability to run as a search server
  • Easy-to-parse results format
SWISH++ consist of two tools: index++ и search++. First tool indexing files, and second one searches within index. It`s like your personal Google, but small, fast and console. :-)


Indexing files
index++ make index file, which contains indexed text documents, made by pdftotext (oh, yea, UNIX-way!). It supports such formats as text, HTML, XML, LaTeX, mail - all that can be converted to text with may be little bit of tag-reached. On my desktop machine indexing is very fast: Intel Р4 630 3GHz with 2GB RAM indexing 270 in 5 seconds.

With level verbosity of 3, one can get more information about indexing process:
index++ -v3 -e "text:*.txt" .
Dot at the end is important, manual page can say more. Output will be like this:

watters_etal_paleobio_2001.txt (2704 words) WaveMetriconChip64.txt (1351 words) wshedtopoalgoJMIV.txt (4042 words) Ye.IJDAR.1.txt (4470 words) YucelITIP01.txt (1678 words) ./edg: morphology.txt (753 words) LuengoEtAl_IbPRIA05.txt (1227 words) Cuisenaire2005_1250.txt (1162 words) icpr2004_nucleus.txt (1234 words) OrtizEtAl_SPIE01.txt (1463 words) Angulo_VIIP04.txt (1658 words) 682.txt (1901 words) comorph.txt (1948 words) index++: ranking index... index++: writing index... index++: done: 00:05 (min:sec) elapsed time 548 files, 271 indexed 2465116 words, 1046139 indexed, 56281 unique
The result will be swish++.index file were are all information about indexed files.
Great: this huge collection of articles indexed so fast! Now we are ready to search something in it.


Searching files
Let`s find something in our collection of files with keywords. It is possible by asking search++ to find in database swish++.index. For example, I can search papers about morphology analysis of images but without mentions about medicine:

$ search++ morphology and erosion and dilation not medicine
And here are results (output is reduced):
# results: 125
99 ./Krylov2.txt 3771 Krylov2.txt
49 ./13300407.txt 3103 13300407.txt
46 ./morph1.slides.printing.6.txt 4369 morph1.slides.printing.6.txt
37 ./lecture_morphology_sara.txt 6746 lecture_morphology_sara.txt
30 ./SIGGRAPH2002_Sketch-Mitchell.txt 5308 SIGGRAPH2002_Sketch-Mitchell.txt
26 ./MorphologicalImageProcessing.txt 7642 MorphologicalImageProcessing.txt
25 ./phdsymp2002_ledda.txt 8298 phdsymp2002_ledda.txt
23 ./lab2_manual.txt 9313 lab2_manual.txt
23 ./Project 1.txt 9946 Project 1.txt
22 ./morphology.txt 11212 morphology.txt
22 ./edg/morphology.txt 11212 morphology.txt
22 ./slides-6-geometry.txt 11717 slides-6-geometry.txt
22 ./V1BFOGG8.txt 10797 V1BFOGG8.txt
18 ./71650638.txt 13978 71650638.txt
First column is relevantness, second - relative file placement, third - file size, and fourth - name. Simple and clean. So it`s very simple to search article if you remember something about it (author name, keywords, or even phare from it).


What we get
I have vast collection of science articles in English, and it`s very hard to remember exact name and content each of paper. Using this approach, I had sorted more than 2400 papers in about 2 hours. Task for SWISH++ was more difficult because of homogeneity of paper`s content. Precision was estimated as approximately 60-70%. Of course, sorted papers had been viewed by me, so it was semi-automatic-alike mode ;-)


Links: I can`t say all about this shiny search system in one post, but I tried to show how quickly and easily I working with loads of PDFed articles in my Debian box.
For further information, you may be interest in sourceforge page of project. Here are many articles aobut search engines, and, particularly, about SWISH++, and documentation about SWISH-e is here. I hope that with this post, there will be one more article about this very useful system - SWISH++.

0 comments: