Working with some messy address or name data? It helps to split each one into separate components.
parserator is a framework for making parsers using natural language processing (NLP) methods.
Structured text like written addresses and names don’t have rules as much as they have tendencies. Instead of coming up with a set of hard and fast parsing rules, parserator takes a statistical approach to learn these tendencies from real examples.
For a given address or name, we use a statistical model, called conditional random fields, to find the sequence of fields that has the highest probability of generating it.
Machine learning for addresses, in a nutshell
usaddress, probablepeople and the parserator framework were all built in partnership with the Atlanta Journal-Constitution and DataMade as part of the Entity-Focused Data System - a platform for journalists to continually link information about political figures, campaign filings, contracts and lobbyist disclosures to drive investigations.
We built everything open source under the MIT License so others could benefit from this work.
With the probabalistic approach that parserator parsers use, we can continue adding examples to our training dataset, and the model will continue to learn and improve in performance. For this purpose, and for service monitoring, we may retain text submitted to this site.
We only log address when an error occurs or when an incorrect parse is reported. In the case where we deem an address is a good candidate for training, we will train the parser on another address that we can find publicly that has similar features, not on the submitted address itself.
Need help parsing a lot of addresses? Looking for guidance on creating your own parser? Contact DataMade.