Filtering long sentences with regular expressions

It happens often to me that in order to make an article clearer or more incisive, I may need to identify very long sentences in a draft and break them down to smaller, simpler units.

In \LaTeX  editors, there may not be an option by default to identify long sentences. However, using regular expressions, it is possible to circumvent this issue.

In editors that allow to search for regular expressions (such as Sublime Text or Texpad or others) the following snippet would allow us to search for sentences with more that 20 words:

<br>
((\w+,\s+)|(\w+\s+)){20,}(\w+[\.|?|!])<br>

It is not too hard to break this regular expression down to its elementary constituents. Let us just recall a few ideas concerning regular expressions

  • remember that () enclose groups
  • | indicates the OR operation
  • \w+ corresponds to a series of one or more occurrences of an alphanumeric character (a word)
  • \s+ corresponds to a series of one or more occurrences of spaces
  • \. is the dot character
  • {number_1, number_2} looks for at least number_1 repetitions of the previous element (with at most number_2 repetitions)

Therefore, in plain language, the above regular expression is

(a word followed by a comma followed by some space) OR (a word followed by some space) REPEATED AT LEAST 20 TIMES (a word followed by a full stop OR a question mark OR an exclamation mark)

This clearly allows us to detect sentences that may be long, very long, very very long, at least as long as this very sentence!

Screen Shot 2018-03-31 at 11.43.49
An example of a match in Sublime Text. Notice that the regular expression button .* on the bottom left corner of the search field is pressed.

Segmenting 3d biological data

I have recently been given the opportunity to study the segmentation of 3D data.  The group of Dr. C. Hammond of the School of Physiology, Pharmacology and Neuroscience in Bristol studies malformation in tissues of  Zebrafish  a model organism which can be genetically manipulated relatively easily .

A major task is to identify bone malformation or osteoarthritis. Hammond’s group manages to image hundreds of Zebrafish in three dimensions so that bone structures can be visualised. Identifying bone deformations in the spine, for example, is key to associate them to specific genetic marker. To do so, a quantitative analysis of the structure of the individual vertebrae is necessary.

It turned out that it is possible to do this via image analysis techniques that are publicly available in Python: the key libraries that I employed are scipy. ndimage and scikit-image.  Identifying the vertebrae in 3d means to perform  a segmentation of volumes and surfaces in 3d images.

An example of the vertebrae, individually resolved, can be visualised in 3d here below:

Searching for a module

When installing software on an High Performance Computing unit, additional packages are often handled by the module package.

To have a list of all the modules available it is sufficient to type


module avail

Often one then retrieves a very long list of possible modules, in alphabetic order. This is not very convenient if one is looking for a particular feature and dos not really know how it has been categorised.

One may think that grep would suffice to filter the results. This is almost true: in order to use grep first one needs to reformat the result of module avail with the -t  option into a single column, redirect the standard error output (labelled by 2 in Bash) to the standard output (labelled by 1, so that the redirection is 2>&1) and then pipe it with grep.

For example, if we want to search for all the modules containing “python” in their name we would type:


module avail -t 2>&1 | grep -i python

and eventually just write a convenient script named modsearch in our ~/bin :


#!/bin/bash
module avail -t 2>&1 | grep -i $1

so that in the future we will just have to type


modsearch python

Clustering and periodic boundaries

Clustering in Python can be nicely done using the statistical tools provided by the sklearn library.

For example, the DBSCAN method easily implements a clustering algorithm that detects connected regions, given a maximum distance between two elements of a cluster.

However, natively the library does not support periodic boundaries, which can be sometimes annoying. But an easy workaround can be found precisely exploiting the power of the library: methods like DBSCAN can be given in input distance matrices directly, and then the clustering is computed on these.

The workaround is to compute the distance matrix with the periodic boundaries in it. The easiest way that I have found is to use the scipy function pdist on each coordinate, correct for the periodic boundaries, then combine the result in order to obtain a distance matrix (in square form) that can be digested by DBSCAN.

The following example may give you a better feeling of how it works.

import pylab as pl
from sklearn.cluster import DBSCAN
from scipy.spatial.distance import pdist,squareform

# box size
L=5.
threshold=0.3
# create data
X=pl.uniform(-1,1, size=(500,2))
# create for corners
X[XL*0.5]-=L

# finding clusters, no periodic boundaries
db=DBSCAN(eps=threshold).fit(X)

pl.scatter(X[:,0], X[:,1],c=db.labels_, s=3,edgecolors='None')
pl.figure()

# 1) find the correct distance matrix
for d in xrange(X.shape[1]):
    # find all 1-d distances
    pd=pdist(X[:,d].reshape(X.shape[0],1))
    # apply boundary conditions
    pd[pd>L*0.5]-=L
    
    try:
        # sum
        total+=pd**2
    except Exception, e:
        # or define the sum if not previously defined
        total=pd**2
# transform the condensed distance matrix...
total=pl.sqrt(total)
# ...into a square distance matrix
square=squareform(total)
db=DBSCAN(eps=threshold, metric='precomputed').fit(square)
pl.scatter(X[:,0], X[:,1],c=db.labels_,s=3, edgecolors='None')
pl.show()

Before the periodic boundaries (Lx=Ly=5):
fig1

… and after (Lx=Ly=5):

fig2

Concatenate pdfs from the Terminal

Oftentimes it can be convenient to merge different PDF documents in order to get a single, continuous document that can be easily sent via mail for review or correction.

If one has just a few documents, this can be done directly through the Preview.app application on the Mac, but for more documents (or when we want to repeat the merge many times) a command-line application can be very convenient.

On Linux, or on the Mac, poppler is the kind of set of tools that makes the trick (you can install it with  Homebrew on the Mac).

In particular, you will find that the package includes a program called
pdfunite
. Its usage is straightforward:

pdfunite file_in_a.pdf file_in_b.pdf file_in_c.pdf fileout.pdf

brew update ––force

Homebrew is a very convenient package manager for Mac OS X. It makes the installation of numerous utilities and programs incredibly easy. It is based on a databases of instructions (Ruby formulas) that are kept up to date using Git.

Keeping the database up-to-date is normally done with

brew update

Sometimes, however, it can fail.  It occurred to me already a few times that I was unable to retrieve the latest version of the database, and installing new software becomes impossible.

If the internal diagnostic tool

brew doctor

is not sufficient for identifying and solving the issue, there is a way to force the update. As indicated on these pages, one can use Git directly and recover the database:

cd `brew --prefix`
git remote add origin https://github.com/mxcl/homebrew.git
git fetch origin
git reset --hard origin/master