How to Import Large Number of Nodes or Other Entities?

wuinfoOne of the Drupal tasks is to import data into a system. Importing large number of articles or any types of entities is a little bit challenging. There are two tricks here. One is obvious. We need to find the way to break one mega task into small tasks; The other one is not so obvious. We have to minimize the damage of every mistake and avoid a small failure(uncaught exception) breaking the whole importing process. To achieve that, we isolate the processing of handling each node.

Here is a study case from one of my previous projects. Customer needed to import and constantly update 55,000 schedule records every day. We imported TV channel schedule data as node entity. There are two ways to achieve the goal. We can either use feeds module or build our custom modules.

Feeds module provides a full set of functions to do the job. It just took less than an hour to setup everything. But, it gets time out error every time after import around 2000 nodes. Feeds module did not provide the function to divide a big task into many small trunks. So, we have to implement the Feeds fetch Hook for it. I had used Feeds module when I was working at TVO.ORG. It had not been efficient as we had expected. A problem in a single node would impact whole import process. Since we just can not guaranty clean source data, it is difficult to be error-free for all nodes. It was related to human involvement, and content manager may make a mistake at some time.

So, I choose to build custom modules for it. According to Feeds module's architecture, there are three parts in the importing process. They are "Fetcher", "Parser" and "Processor". We will setup a module to fetch data and create a Drupal system queue for each node. But, we still have timeout issue if we want to put over 50,000 nodes into a queue at once. We handle 5000 nodes each time and use an internal counter to handle it. Depending on different kinds of data source, we put different program logic here in the fetcher. The fetcher is responsible for building the queue. It is also responsible for no missing or duplicated records. Then in the queue worker callback function, we built the parser and processor for each queue task. One task deals with one node only. That is to create those nodes. By doing this, we successfully detached the fetcher from parser and processor. Any breakdown in parsing and processing would not affect the others anymore.

Here is a little bit more about fetcher. We used cron to transfer the source data into Drupal system queue. Surely, it would timeout if I put those 55,000 nodes into queues at one time. In most cases, it is not possible for Drupal to handle over 20,000 nodes in one task. So, we need to break the job into small tasks. A single cron run may just handle 5000 nodes. So, every time cron import 5000 nodes from source data and put into the queue. We have not started the process of creating a node yet. At the end of this, we transfer all source data into the system queue. The queue holds 55,000 tasks. Each task has data to create a node. Use the contribute module queue ui to see see the created tasks.

Now, we have importing each node as a separate task by itself. Failure of one would not affect the others.

Since we have already put our parser and processor code in queue worker callback function, we have the last thing to do is to run through the queue and finish all the tasks on it. Drupal Cron runs the queue tasks by default. But I need a better control over creating queue and cleaning queue. I want to have a different program to handle it. There is a mob_queue module. It finish the queued tasks with Drush command and prevent the queue running from default Cron execution. Mob_queue also allow to assign time to execute the queue. Drush have default queue execution command, but it does not deal with the cron queue execution. We can invoke the Drush command in the Linux Crontab.

After all, we build our customized module to import and update all those nodes, and it is much more lightweight and less overhead than feeds module. It also gives us more control over the importing process.

Add new comment

Target Image