MongoDB IO performance

It was not the first time that we saw the problem, and it was not the first time that we cannot figure out what was creating the error. But today, I want to share the experience.

A little bit of the history, our production system is a replica set of three nodes. When the issue is happening our system responses slow, in such situation what you feel is that the web is down, because you don’t receive answer in quite some time for some requests.

You know that the system is fixable easily because after you restart the node involved you get the system up and running normally again. The problem is that you have to delete all the data in the replica affected, and start a synchronisation from scratch. The first time was happening in the master node, the last day was happening in the slave node (a different machine).

Something good was that the last day when it was happening we was testing Newrelic and we got new information, not so much but something more. Some pictures that I have extracted from our Newrelic monitoring:

The CPU in use is growing a little in the point of the peaks:

Screen Shot 2015-11-27 at 15.50.56

 

Screen Shot 2015-11-27 at 15.51.06

But the real problem is the disk IO (not the network IO, as you can see in the pictures):

Screen Shot 2015-11-27 at 15.50.46

Here, there is a better graph about the problem with the IO and you can see that the writes was increasing in that time like the hell, using the 100% of the IO in the hard disk.

Screen Shot 2015-11-27 at 15.50.15

We don’t know the real reason, at the moment of write this post. Although, there is an old issue (fixed in 2.5.0), when mongod flushs the working memory to disk, it was having this particular problem.

I would like to write down all the things that we know in this point, maybe it will help us to find a solution in some point and clarify our theory of the “flushing data”:

  • There was not so much users in that time in the website. The traffic and/or the consequences of that is not creating the issue, as far as we can see.
  • The issue was appearing when the system is not doing so much things. It is not the most relaxed time of our MongoDB infrastructure, but there is not so many things going on in that particular time.
  • The number of connections was growing from ~150 to 2000.
  • The response times in the queries was growing to seconds ~20s or more
  • If you create a graph with the response times of the collections, there is a log about “flushing data” (we use MMAPv1)

015-11-22T10:07:45.645+0100 I STORAGE [DataFileSync] flushing mmaps took 426819ms for 615 files
2015-11-22T11:13:31.755+0100 I STORAGE [DataFileSync] flushing mmaps took 429132ms for 615 files

  • The node is still working for the master, but really slow for queries. It makes the web to be blocked because the driver doesn’t see the node down and it continues sending request to it.
  • There was one of the web nodes that was using all the heap memory in that particular period of time, and going back to normal after we fix the problem with node.
  • Something bad about the monitoring is that currently the instrumentation is not working for our version of MongoDB, that is because I cannot connect the web nodes with the database and add more information about the issue.

We have tested the performance of the hard disk from the linux side and from the mongo side and we didn’t achieve any clue about what it’s wrong. Our results were successful in both case.

For the time been, we have switched the storage type from MMAPv1 to WiredTiger, to see if there is a different behaviour, in theory it should not flush data from memory to disk in that way.

If you have any suggestion, we have posted the same information in the MongoDB user group. Or write a comment here :).

A programming language for AI

I am curious which programming language is more useful for Artificial Intelligence. “Choose the language that you are more proficient in”, it is not an option to me. Choose the right tool, for the right problem is better in my case.

I was looking in Quora and founding some results. “What is best programming language for Artificial Intelligence projects?” is one of the most interesting, I was reading the answers from there. And the conclusion among the results is: Python (because it is fast to develop things and there are interesting libraries), C/C++ (because the speed and performance) or Java.

Taking a look to google, I found a tutorial written by Günter Neumann, from the German Research Center for Artificial Intelligence, entitled Programming Languages in Artificial Intelligence. In the tutorial, you can read why functional programming languages and symbolic languages are more useful for AI and then you find an introduction to Lisp and a small part for Prolog.

It is a simple introduction to Lisp, but I couldn’t avoid to remember µlisp (an small Lisp interpreter that I did, based in another book Build Your Own Lisp). I built a simple version of Lisp using C. In that point there were no standard libraries, you have to build them by yourself and I was wondering, if in that point you can start to create a language that it helps you to represent the world.

As always that was a crazy idea. Create a programming language that experts tell it is useful for artificial intelligence and build the standard libraries to represent part of the world that the system have to work with. As far as you go with the idea, you know that you cannot represent the complete real world with that approach, but… could you do a mix of implement part of the world with the programming language and part of the world based in the experience somehow? That was my thought, maybe there would be a way to do it.

By the way, my mind took me to start reading the new book of Deep Learning by Ian Goodfellow, Aaron Courville and Yoshua Bengio. In the introduction, there is a reference to Cyc (Lenat and Guha, 1989) and knowledge base.

A computer can reason about statements in these formal languages automatically using logical inference rules. This is known as the knowledge base approach to artificial intelligence. None of these projects has lead to a major success. One of the most famous such projects is Cyc (Lenat and Guha, 1989)  [extracted from the draft]

I am still thinking that it could work, because my approach is not to write every single rule of the world with the programming language, if not, to have some base using the language prepared for that specific problem like a DSL, but going further and without any limitation from the language itself. Either way, it is just an idea, I will continue reading the book from Yoshua Bengio, about deep learning it looks really promising and I will take a look to the review of the Lenat & Guha book, maybe I can figure out more.

Deep learning in a large scale distributed system

Deep learning is interesting in many ways. But when you consider to do it in thousands of cores that can process millions of parameters, then the problem is more interesting and complex at the same time.

Google Datacenter (via Google)

Google was doing an interesting experiment, training a deep network with millions of parameters in thousands of CPUs. The goal was to train very large datasets without to limit the form of the model.

The paper describes the use of DistBelief, a framework created for distributed parallel computing applied to deep learning training. A collection of the features that the framework manage by itself are:

The framework automatically parallelises computation in each machine using all available core, and manages communication, synchronisation and data transfer between machines during both training and inference.

I couldn’t find too much information about it, only what it is written in the paper.

They have applied two algorithms: SGD (Stochastic Gradient Descent) and L-BFGS. These algorithms usually works well, but they doesn’t scale with very large data sets. That is because they introduce some modifications to them. The paper gives you more details about the optimisations in both algorithms that you can find interesting.

I was found really interesting the idea of distributed parallel computing working for very large datasets  in such algorithms.

You can read “Large Scale Distributed Deep Networks”, or if you are interested in the pdf version. Have fun!

Classifying documents using Apache Mahout

I was wondering how to do some text classification with Java and Apache MahoutIsabel Drost-Fromm gave a talk in the LuceneSolrRevolution Conference (Dublin – 2013) where she was speaking about the topic, how Apache Mahout and Lucene could help you.

It is a good an introduction to the topic. I have enjoyed too much what it was presented in the talk.

Lucene, Mahout and Hadoop (only a little bit) sound really great for a talk about how to do texts classifications.

The general idea behind the complete process to classify documents will follow the below steps:

HTML >> Apache Tika

Fulltext >> Lucene Analyzer

Tokenstream >> FeatureVectorEnconder

Vector >> Online Learner

Of course Isabel was giving the advice of reuse the libraries that you have in your hands, take an internal look to the algorithms used there and improve them, if you need it. As a first approach it is really good for me to see how things work.

Mahout is a really good library for machine learning, it was using map reduce to perfectly integrate with Hadoop (v1.0), although from April of 2014 they have decided to move forward:

The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. (You can read that in there web site).

At the end of the video there is a recommendation to everyone to participate in the project: bug fixing, documentation, reporting bugs… There are a lot of things to do in open source projects always. If you are using the libraries there, I recommend you to subscribe to the mailing lists if you are interested in the project.

I really recommend you to see the video if you are interested in the field, I think she was giving a good talk about a good topic. You can take a look to the slides too.

Two interesting books to start with Machine Learning

There are a lot of books in the field of Machine Learning, just a fast search in Amazon gives you more than 25.ooo books. I wanted to filter all those books an choose the most useful. I was looking in google, quora and reading some post that I found around internet. There a lot of people giving a list of 10 – 20 books about machine learning, statistical learning, reinforcement learning… I just wanted to find the two interesting books to go into the field.

With these books, it is possible to learn general aspects about the topic and later go more in deep in the part that sounds more interesting.

 


 

Machine Learning

The “book” that everyone recommend as a good point to start, written by Tom M. Mitchell (professor in the Carnegie Mellon University).

This is an introduction book for the field. You don’t need to have previous knowledge in Machine Learning.

Some topics that you will find in the book: decision tree learning, artificial neural networks, bayesian learning, computational learning, genetic algorithms, reinforcement learning and more.

 


 

Pattern Recognition and Machine Learning (Information Science and Statistics)

The author is Christopher M. Bishop, a Distinguished Scientist at Microsoft Research Cambridge, where he leads the Machine Learning and Perception group

This book will give you a really good approach to the commonly used algorithms in Machine Learning.

 


 

Both books are theoretical and will give you a good introduction. Of course there so many books in the area, some of then more practical, some about statistical learning… But I think it is good to have a simple point to start.

I have started with Tom M. Mitchell’s book. I will give you my impression when I have finished it.

The agile samurai

When you have done so many projects from scratch, using legacy code… different type of customers: banks, telecoms, retail, … in different type of companies (big, small, startups…), you always move to the next project thinking that you will do better the next time. But, how?

For me Scrum changed how things could be done better. Agile is what this book is about, but I, personally, feel that both are really connected. The names of the meetings or events are different but, how they are organize look similar.

The book is about how to execute your projects in a way that your customer feel more confident about the job that your are doing. It is not only about agile, it is about how to execute projects in a way that we can deal with changes and still have quality; having immediate feedback about the current status; how to be ready for production from the beginning.

Not all the customers are the same, not all the product owners are the same, not all the companies are the same, in conclusion: not all the XXX are the same.

I like the idea of Inception Desk. It is really good to have everyone in the team working in the big picture as an approach to start. As a mirror where everybody is looking how the project look for him and how the things are going to be. After that you can start and change the things later, if you need it.

In general the book is good for: to feel how it could be if you organize the project in a agile way; what are going to be the problems; how you could engage the customer/product owner; how the team should work; how testing and continous deployment should work; how transparent is going to be the status of the project; how you deal with the changes from the beginning… A lot of things together in a few pages :).

To know more, you should read it.

Thanks to Carlos Díez to lend me the book.

Re-launch

So many times, I find myself writing a list of articles to read, writing notes about them or about books… but, not always, I write those notes in the same place. I took the decision that it is good if I put all of those notes together in the same place.

One time ago, I used to write using tumblr, but it have not been updated a long time.

I needed to put all those interesting notes, comments, ideas, investigations together. I think the evolution of those ideas, the experiences, the articles read… every of those could be interesting to share & collect here.