Son muchos los proyectos de Internet que necesitan una base de
datos donde se muestra a los usuarios listados de provincias y
municipios (caso España) o de regiones, departamentos, länder...
Esta información no suele estar disponible para su consumo rápido
ya que hay que ir a las webs de los organismos oficiales, buscar
los CSV (cuando funciona) y crear la base de datos.
Así que hoy me he decidido a agrupar estas organizaciones
territoriales cogiendo las fuentes de los organismos
oficiales para que cualquier persona pueda
descargar un volcado sql de muncipios, provincias y
comunidades autónomas de España (formato
Mysql) y otro volcado mysql de los
países de la Unión Europea con sus regiones,
departamentos, länder, estados federados o cualesquiera que sea
su organización territorial.
Para ello he creado un repositorio en GitHub donde además de los
muncipios y provincias hay más SQL
y CSV y así se puede ir actualizando cuando las fuentes hagan
Mysql con municipios y provincias de España
Los datos han sido tomados de la fuente de datos que provee
el INE (Insituto Nacional de Estadística) y se
ha creado una estructura mysql para su reciclaje en los
Los datos son del 2012.
Puedes descargar la base de datos por tablas
En el repositorio encontrarás
sistema de clasificación NUTS que viene a ser lo mismo que en
el caso español pero a nivel europeo. Los datos
se han sacado de EUROSTAT, herramienta que pone a disposición de
los usuarios la Comisión Europea.
Algunos ejemplos de los datos que encontrarás en esta tabla son:
Estados federados (länder) y regiones administrativas de
Regiones y departamentos Franceses
Regiones y provincias italianas
Administraciones y periferias griegas
Y así sucesivamente con 81 paises (incluido
If you ever needed a mysql table with all the languages detailed
in the ISO-639 language codes, here it is. I took the list from
authority and created the table in Mysql.
There are 21 languages that have alternative codes for
bibliographic or terminology purposes. In those cases I took the
bibliographic ones. The script contains the 2 and 3
letter ISO-639 language codes, as well as the English
and French names of the languages.
I've been playing recently with several distributed databases with the aim of choosing the best solution for my needs. Since there isn't much documentation on the web with a general overview on the subject, I write here some comments, thoughts and my humble experience. Hope it's usefull for you, this document is not a comparison of performance, or a "mine is bigger than yours", just some ideas ;)
My background Being my experience based on relational databases (like MySQL or Postgres) and Object oriented databases (like ZODB) it was very easy for me to get hooked on this new challenge, anything but the relational databases (no offense). Last year, in Mainz, Germany, and after having heard a lot of buzz on the subject, Jan Lehnardt gave an interesting lecture on CouchDB. (BTW, thanks Jan for all the questions you answered to me). That event was the firestarter and since that I've been trying things... So I am a newbie in this expert area.
Now, I am working on a project that performs WRITES on very large set of data. All this data doesn't need to be in a relational database since it's a simple collector of activity, flat and with a variant structure that writes some gigabytes every day that must be processed at night, under different criteria.
To store this, I started doing some research on what I could use to store that data, and here are some of the things I tried. The principles were:
Able to handle and process a lot of non-trivial data
Able to write at least 1000 reqs/s, non-bulk
(I still have a decision to make)
Tokyo Cabinet + Tokyo Tyrant (key/value)
Tokyo Cabinet is library of routines for managaing a database that in conjunction with Tokyo Tyrant, the network interface, create an excellent network database. Tokyo supports up to 8 Exabytes of data and it's really fast both in write and in read. Currently is used in the Japan's Mixi Social Network (more than 14 billion page views monthly). It's a key/value database.
From the sysadmin point of view was a lovely product, easy to install, start, manage, backup or replicate. It's everything as easy as they state in the website, and it works. If you, like me, are a Memcached lover, Tokyo is completely compatible with the memcached protocol, allowing you to do delights as "persistent but volatile data" . You can run the database in memory-only or with persistence (If you want in-memory, I think memcached is slightly faster for that purpose, but it's just my impression on few tests, do yours or Google it).
To start a master-slave system is as easy as (you might want only a master, or several slaves):
Start the Master: ttserver -port 1978 -sid 1 casket-1.tch
Where casket are the databases and the extensions define what algorythm you want to use. Easy as hell. Just be aware to set a different and unique identifier (sid) for every server and say the slaves who is the master host (mhost). That would run a master and a slave on the same machine on ports 1978 and 1979.
Ok, that was really exciting for me but I found that despite Tokyo is wonderful and everything, as in memcached, you need to know the KEY of the element you want to retrieve (yeah, that's what a key/value is supposed to do). In my case, I need to map/reduce the vast amount of data, so Tokyo is not an option for me, but might be what you've been looking for. I'll use it for any other project, sure.
The weak points I found where that when used through the HTTP wrapper the performance goes down significantly. For instance, I had much better results using curl POSTs with CouchDB on a single node. Oh, and I hope you speak Japanese, you won't find much documentation in English ;)
Basic example on how to write data in Tokyo using the HTTP wrapper:
curl -s -X PUT -d "my_value" "http://127.0.0.1:1978/my_key"
But the fastest way is using the C binaries:
tcrmgr put -port 1978 127.0.0.1 "my_key" "my_value"
CouchDB is a schema-free document oriented database. A wonderful project of Apache with more community than the others behind it. CouchDB works using a RESTful HTTP/JSON API meaning that with simple HTML and JS you can create a dynamic website. Of course you can use other wrappers like PHPillow if you want to use PHP.
Amongst all the systems is the one I like more as a product, although is not having a daemon for managing the service (tried the 0.9 from checkout) it is very easy to install and run.
Some things I really like:
No language scripting needed
The replication is bi-directional, peer-based
Leave your nodes offline for a while, reconnect them and nothing happened.
The Futon interface. Some sort of, let's say, phpMyAdmin where you can play and test your views
If you want to insert a lot of documents you should insert documents in bulk mode. It's faster and your database will be more compact. Do not perform this test over a million iterations or your space disk will grow a lot.
MongoDB is like CouchDB, another document-oriented database. This project was born this year 2009, so it's very young. With a colleague we tried to install the server and run it with the PHP libraries under CentOS, but it was completelly impossible.
Despite all, I'll keep an eye on it, very interesting as well, because:
You can filter data by fields without knowing the keys (as couchDb)
You can create indexes to accelerate the data filtering
Sexy functionalities to operate with the database
Internal drivers ready for C, Ruby or Python. A PHP module available.
Without much learning curve
Replication, failover and auto-sharding coming soon.
Project Voldemort (key/value)
Behind this curious name from Harry Potter there is an interesting Java distributed key-value system database. Project Voldemort handles replication and partitioning automatically, every server contains only a subset of the data and if a node fails your data integrity is not compromised, nor there is a central point for coordination. The balancer read from any node but writes in all them at the time, that can reduce your writing experience. This system is used in some parts of the well-known LinkedIn with 12TB of data. But I am not a big fan of Java, so I played a little bit, and after going nuts running the examples and installing the Ecplise stuff, i tried something else, not for me :$ (Not a lover of Tomcat either)
Again, another key/value. Doesn't fit me at this moment.
Hbase is a Key/value Hadoop project. I am using it right now with the HTTP interface and it's very fast on writing (the quicker for my tests). If you want to serve your website content from this database, you better skip this one, the latency is high. I jumped from here to cloudera.
curl -v -T ./y.row http://ip.to.hbase:60050/api/table1/scanner?column=a:
curl -v -T ./y.row http://ip.to.hbase:60050/api/table1/scanner?column=a:
# That returns an ID, like b959591, retrieve it:
curl -v -T ./y.row http://ip.to.hbase:60050/api/table1/scanner/b959591
Cloudera (hadoop cluster)
Cloudera is a company that offers an Apache-licensed packed solution based on the Hadoop family. If you are interested in cloud computing and you have been after projects like the Yahoo!'s Pig or the Facebook's Hive, then you'll like to know that Cloudera brings to the 'mere mortals' all that power in an easy RPM.
Cloudera combines the Hadoop cluster and filesystem HDFS with Hive and all the pain that is configure a Hadoop cluster. You will be able to summarie, analyze and query an ingent amount of data. Have a look at the latest video of training sessions to get an idea on how the Hive query language similar to SQL looks like. You can even download an VMWare virtual machine with the solution running.
I am recently looking at this, it's a lot more complex than any of the other solutions, but since I'll need to store and process several Gigabytes every day, this seems the way to go... :_/ sniff...
If you are interested in this, you can also have a running Hadoop via Amazon Elastic MapReduce, billing is hourly and by machine.
More interesting projects:
The folling project aren't less interesting, nor I hadn't time to fully test, but for my current needs aren't suitable, they are worth a mention anyway:
Disco Project: Another map/reduce framework started at Nokia. Uses python and quite simple.
LightCloud: Distributed key-value database using the Tokyo Tyrant network interface (Plurk open source)