Parallel all-pairs similarity search algorithms in OCaml

Sources

The repository contains the 1.0 sources, a release will be made soon.

Citation

If you use this code, please cite the following paper. It is currently under review at IJPP.

https://arxiv.org/abs/1402.3010

Title: 1-D and 2-D Parallel Algorithms for All-Pairs Similarity Problem

Authors: Eray Özkural, Cevdet Aykanat (Submitted on 13 Feb 2014)

Abstract: All-pairs similarity problem asks to find all vector pairs in a set of vectors the similarities of which surpass a given similarity threshold, and it is a computational kernel in data mining and information retrieval for several tasks. We investigate the parallelization of a recent fast sequential algorithm. We propose effective 1-D and 2-D data distribution strategies that preserve the essential optimizations in the fast algorithm. 1-D parallel algorithms distribute either dimensions or vectors, whereas the 2-D parallel algorithm distributes data both ways. Additional contributions to the 1-D vertical distribution include a local pruning strategy to reduce the number of candidates, a recursive pruning algorithm, and block processing to reduce imbalance. The parallel algorithms were programmed in OCaml which affords much convenience. Our experiments indicate that the performance depends on the dataset, therefore a variety of parallelizations is useful.

The paper is included in the sources.

Comments

The code is quite interesting, as it shows how to effectively use OCaml for MPI code. There is a bunch of well-written parallel functional code that I will extract from this codebase and release separately. You need the latest ocamlmpi release as that contains the patches I made to make this code work.

License

This code is released under AGPL-3.0. Please do not ask me to release it under BSD license. If you need a commercial license, you should purchase it.

Happy hacking!

ANN: parallpairs

Eray Özkural

Eray Özkural has obtained his PhD in computer engineering from Bilkent University, Ankara. He has a deep and long-running interest in human-level AI. His name appears in the acknowledgements of Marvin Minsky's The Emotion Machine. He has collaborated briefly with the founder of algorithmic information theory Ray Solomonoff, and in response to a challenge he posed, invented Heuristic Algorithmic Memory, which is a long-term memory design for general-purpose machine learning. Some other researchers have been inspired by HAM and call the approach "Bayesian Program Learning". He has designed a next-generation general-purpose machine learning architecture. He is the recipient of 2015 Kurzweil Best AGI Idea Award for his theoretical contributions to universal induction. He has previously invented an FPGA virtualization scheme for Global Supercomputing, Inc. which was internationally patented. He has also proposed a cryptocurrency called Cypher, and an energy based currency which can drive green energy proliferation. You may find his blog at https://log.examachine.net and some of his free software projects at https://github.com/examachine/.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.